StageControlFileFailure: Failed to stage tricky-chrome-pfq/R53-8490.0.0-rc2 |
|||||||||||
Issue description"Failed to stage" error reported at https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2021 Logs https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2021/steps/HWTest%20%5Bsanity%5D/logs/stdio Autotest instance: cautotest 06-23-2016 [01:13:33] Submitted create_suite_job rpc Error Message: StageControlFileFailure: Failed to stage tricky-chrome-pfq/R53-8490.0.0-rc2: 500 Internal Server Error #powered_by { margin-top: 20px; border-top: 2px solid black; font-style: italic; } #traceback { color: red; } 500 Internal Server Error The server encountered an unexpected condition which prevented it from fulfilling the request. Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__ return self.callable(*self.args, **self.kwargs) File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 669, in is_staged response = str(dl.IsStaged(factory)) File "/home/chromeos-test/chromiumos/src/platform/dev/downloader.py", line 211, in IsStaged raise DownloaderException(exceptions) DownloaderException: An error occurred when attempting to untar /home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2/test_suites.tar.bz2: Command '['tar', 'xf', u'/home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2/test_suites.tar.bz2', '--directory', u'/home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2', '--verbose', '--use-compress-prog=pbzip2']' returned non-zero exit status 2
,
Jun 23 2016
gerrit was throwing 500 failures this morning, could that be related? +infra deputies
,
Jun 23 2016
> gerrit was throwing 500 failures this morning, could that be related? This particular failure is coming from the test lab infrastructure, so it's more likely a problem on a specific devserver, or (more likely) a glitch with googlestorage.
,
Jun 28 2016
+ Simran - can you take a look as secondary?
,
Jun 28 2016
Adding info: It looks like the request for testing (and therefore
to stage payloads) happened before all artifacts were fully uploaded.
It looks like the uploads completed on or around 00:27, while testing
was kicked off no later than 00:25, or two minutes too early.
Here's the timings from the "Archive" stage:
00:23:31: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -m cp -v -- /b/cbuild/internal_master/buildbot_archive/daisy_skate-paladin/R53-8507.0.0-rc3/UPLOADED gs://chromeos-moblab-samsung/daisy_skate-paladin/R53-8507.0.0-rc3/UPLOADED
00:27:31: INFO: Running cidb query on pid 18317, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x9bb8e50>
From the HWTest stage, we see these timestamps:
00:22:06: INFO: Waiting up to forever for payloads ...
00:24:57: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpVpV5mP/tmpaVq2KA/temp_summary.json --raw-cmd --task-name daisy_skate-paladin/R53-8507.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build daisy_skate-paladin/R53-8507.0.0-rc3 --board daisy_skate --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 10 --minimum_duts 4 --offload_failures_only True --suite_args "{'attr_filter': '(suite:bvt-inline) and (subsystem:default)'}" -c
Autotest instance: cautotest
06-27-2016 [00:25:10] Submitted create_suite_job rpc
,
Jun 28 2016
The test_suites tarball exists and is un-tarable, likely a devserver issue. A good first step might be to log the devserver IP when these failures occur.
,
Jun 28 2016
Ahh Richard's comment makes sense, but why didn't the Archive stage block the Hwtest stage?
,
Jun 28 2016
Actually, diving deeper, it looks like the test_suites.tar.bz2
file is uploaded in the "UploadTestArtifacts" statge (duh),
but the timing still suggests a race:
00:22:24: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -m cp -v -- /b/cbuild/internal_master/buildbot_archive/daisy_skate-paladin/R53-8507.0.0-rc3/test_suites.tar.bz2 gs://chromeos-image-archive/daisy_skate-paladin/R53-8507.0.0-rc3/test_suites.tar.bz2
[ ... ]
00:26:16: INFO: Running cidb query on pid 18290, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x9b0f150>
,
Jun 29 2016
> Ahh Richard's comment makes sense, but why didn't the Archive
> stage block the Hwtest stage?
I suspect the key is buried in this log line from HWTest:
00:22:06: INFO: Waiting up to forever for payloads ...
One should check to see whether it's waiting just for the
payloads and forgetting (say) the test artifacts.
,
Aug 18 2016
I note that this failure has a long, storied history of neglect: * bug 518591 * bug 529324 * this bug report, and now * bug 639005 Can we find some way to escalate this for attention?
,
Aug 18 2016
,
Aug 18 2016
Issue 529324 has been merged into this issue.
,
Aug 18 2016
Pulling in a relevant link: https://uberchromegw.corp.google.com/i/chromeos/builders/celes-release/builds/312 Test failure log message: chromite-0.0.2-r2522: 05:25:37: WARNING: Cannot run HWTest because UploadTestArtifacts failed. See UploadTestArtifacts for details. Test artifacts finished uploading log message: ** Finished Stage UploadTestArtifacts - Thu, 18 Aug 2016 05:39:21 -0700 (PDT) We have a stage synchronization problem inside cbuildbot that is causing this issue.
,
Aug 18 2016
,
Aug 25 2016
Re c#14, I'm confused about that "test failure log message"... That output is from the UnitTest stage and I don't think it's a failure. It's one of the test cases: HWTestStageTest.testPayloadsNotGenerated. Basically, I'm not seeing this stage synchronization issue based off of the logs in c#14.
,
Aug 26 2016
I think that this bug has conflated two issues. The bug that got merged in here by c#13 is a different issue with receiving an error during update_engine. (ErrorCode::kOmahaErrorInHTTPResponse). Might want to move those reports to bug 620815 . The original issue was about running tests before the autotest artifacts were staged. Have we seen re-occurrences of that? I looked through the chromite code and I do see that we're waiting for just the payloads and not the test artifacts. We could speculatively fix that.
,
Aug 26 2016
bug 612230 is related to that one as well. I'd seen that issue before and thought that it might be an overloaded devserver. (bug 632149)
,
Aug 31 2016
I started on a CL to fix the stage synchronization issue here: https://chromium-review.googlesource.com/#/c/377622/
,
Aug 31 2016
Issue 546457 has been merged into this issue.
,
Sep 2 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/e1648c5bb5327295223fe9d99abbd270976d0954 commit e1648c5bb5327295223fe9d99abbd270976d0954 Author: Aseda Aboagye <aaboagye@google.com> Date: Mon Aug 29 23:39:12 2016 cbuildbot: stages: Add runattr for test artifacts. The UploadHWTestArtifacts stage is in charge of uploading the test payloads along with the test artifacts. There was already an existing runtime attribute which indicated when the payloads were generated, but there was nothing for the HW Test artifacts. This commit replaces the existing attributes of "payloads_generated" and "delta_payloads_generated" with the "test_artifacts_uploaded" runtime attribute which is set when all of the test artifacts are uploaded. Additionally, the HWTest stage and the AUTest stage have been modified to wait for this attribute to be set from which the decision will be made to proceed with the stage. Both the payloads and the test artifacts must be uploaded before the HWTest or AUTest stage will be allowed to run. This in effect synchronizes the stages. BUG= chromium:622789 BRANCH=None TEST=python test_stages_unittest TEST=python artifact_stages_unittest Change-Id: Ib9b0a3d02f986b5d8181650eaae8c307cf5b89c1 Signed-off-by: Aseda Aboagye <aaboagye@google.com> Reviewed-on: https://chromium-review.googlesource.com/377622 Commit-Ready: Aseda Aboagye <aaboagye@chromium.org> Tested-by: Aseda Aboagye <aaboagye@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/cbuildbot_run.py [modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/test_stages_unittest.py [modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/artifact_stages.py [modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/test_stages.py
,
Sep 2 2016
Let's see if the 10:00 canary run actually picked up these changes.
,
Sep 3 2016
Looks like the changes were picked up. If this ever happens again, someone can reopen this.
,
Oct 21 2016
I think this change broke cbuildbot --nobuild --notests It looks like UploadTestArtifacts will now block forever waiting for images which aren't coming since some other stages are disabled. https://chromium-review.googlesource.com/#/c/401993/ is a potential fix.
,
Oct 22 2016
I think that CL will work, but once WaitUntilReady() is used everywhere, wherever the decision is made to skip tests, that should set the runtime attribute that the stage is waiting for to None. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by steve...@chromium.org
, Jun 23 2016