New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 622789 link

Starred by 7 users

Issue metadata

Status: Verified
Owner:
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

StageControlFileFailure: Failed to stage tricky-chrome-pfq/R53-8490.0.0-rc2

Project Member Reported by h...@chromium.org, Jun 23 2016

Issue description

"Failed to stage" error reported at https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2021

Logs https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2021/steps/HWTest%20%5Bsanity%5D/logs/stdio

Autotest instance: cautotest
06-23-2016 [01:13:33] Submitted create_suite_job rpc
Error Message: StageControlFileFailure: Failed to stage tricky-chrome-pfq/R53-8490.0.0-rc2: 


    
    500 Internal Server Error
    
    #powered_by {
        margin-top: 20px;
        border-top: 2px solid black;
        font-style: italic;
    }

    #traceback {
        color: red;
    }
    

    
        500 Internal Server Error
        The server encountered an unexpected condition which prevented it from fulfilling the request.
        Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 669, in is_staged
    response = str(dl.IsStaged(factory))
  File "/home/chromeos-test/chromiumos/src/platform/dev/downloader.py", line 211, in IsStaged
    raise DownloaderException(exceptions)
DownloaderException: An error occurred when attempting to untar /home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2/test_suites.tar.bz2:
Command '['tar', 'xf', u'/home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2/test_suites.tar.bz2', '--directory', u'/home/chromeos-test/images/tricky-chrome-pfq/R53-8490.0.0-rc2', '--verbose', '--use-compress-prog=pbzip2']' returned non-zero exit status 2

 
Labels: Build-PFQ-Failures
Cc: jrbarnette@chromium.org pho...@chromium.org
gerrit was throwing 500 failures this morning, could that be related?

+infra deputies

Components: -Infra>Labs Infra>Client>ChromeOS
> gerrit was throwing 500 failures this morning, could that be related?

This particular failure is coming from the test lab infrastructure,
so it's more likely a problem on a specific devserver, or (more likely)
a glitch with googlestorage.

Comment 5 by autumn@chromium.org, Jun 28 2016

Owner: sbasi@chromium.org
+ Simran - can you take a look as secondary? 
Adding info:  It looks like the request for testing (and therefore
to stage payloads) happened before all artifacts were fully uploaded.

It looks like the uploads completed on or around 00:27, while testing
was kicked off no later than 00:25, or two minutes too early.

Here's the timings from the "Archive" stage:
    00:23:31: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -m cp -v -- /b/cbuild/internal_master/buildbot_archive/daisy_skate-paladin/R53-8507.0.0-rc3/UPLOADED gs://chromeos-moblab-samsung/daisy_skate-paladin/R53-8507.0.0-rc3/UPLOADED
    00:27:31: INFO: Running cidb query on pid 18317, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x9bb8e50>

From the HWTest stage, we see these timestamps:
    00:22:06: INFO: Waiting up to forever for payloads ...
    00:24:57: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpVpV5mP/tmpaVq2KA/temp_summary.json --raw-cmd --task-name daisy_skate-paladin/R53-8507.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build daisy_skate-paladin/R53-8507.0.0-rc3 --board daisy_skate --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 10 --minimum_duts 4 --offload_failures_only True --suite_args "{'attr_filter': '(suite:bvt-inline) and (subsystem:default)'}" -c
    Autotest instance: cautotest
    06-27-2016 [00:25:10] Submitted create_suite_job rpc

Comment 7 by sbasi@chromium.org, Jun 28 2016

The test_suites tarball exists and is un-tarable, likely a devserver issue. A good first step might be to log the devserver IP when these failures occur.

Comment 8 by sbasi@chromium.org, Jun 28 2016

Ahh Richard's comment makes sense, but why didn't the Archive stage block the Hwtest stage?
Actually, diving deeper, it looks like the test_suites.tar.bz2
file is uploaded in the "UploadTestArtifacts" statge (duh),
but the timing still suggests a race:

    00:22:24: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -m cp -v -- /b/cbuild/internal_master/buildbot_archive/daisy_skate-paladin/R53-8507.0.0-rc3/test_suites.tar.bz2 gs://chromeos-image-archive/daisy_skate-paladin/R53-8507.0.0-rc3/test_suites.tar.bz2
[ ... ]
    00:26:16: INFO: Running cidb query on pid 18290, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x9b0f150>

> Ahh Richard's comment makes sense, but why didn't the Archive
> stage block the Hwtest stage?

I suspect the key is buried in this log line from HWTest:
    00:22:06: INFO: Waiting up to forever for payloads ...

One should check to see whether it's waiting just for the
payloads and forgetting (say) the test artifacts.

Owner: ----
I note that this failure has a long, storied history of neglect:
  * bug 518591
  *  bug 529324 
  * this bug report, and now
  * bug 639005

Can we find some way to escalate this for attention?

Cc: semenzato@chromium.org shchen@chromium.org
Cc: dgarr...@chromium.org snanda@chromium.org akes...@chromium.org aaboagye@chromium.org
 Issue 529324  has been merged into this issue.
Pulling in a relevant link:
  https://uberchromegw.corp.google.com/i/chromeos/builders/celes-release/builds/312


Test failure log message:
  chromite-0.0.2-r2522: 05:25:37: WARNING: Cannot run HWTest because UploadTestArtifacts failed. See UploadTestArtifacts for details.


Test artifacts finished uploading log message:
  ** Finished Stage UploadTestArtifacts - Thu, 18 Aug 2016 05:39:21 -0700 (PDT)


We have a stage synchronization problem inside cbuildbot that is causing this issue.
Labels: BuildHealth iptaskforce
Re c#14, I'm confused about that "test failure log message"... That output is from the UnitTest stage and I don't think it's a failure. It's one of the test cases: HWTestStageTest.testPayloadsNotGenerated.

Basically, I'm not seeing this stage synchronization issue based off of the logs in c#14. 
I think that this bug has conflated two issues. The bug that got merged in here by c#13 is a different issue with receiving an error during update_engine. (ErrorCode::kOmahaErrorInHTTPResponse). Might want to move those reports to  bug 620815 .

The original issue was about running tests before the autotest artifacts were staged. Have we seen re-occurrences of that?

I looked through the chromite code and I do see that we're waiting for just the payloads and not the test artifacts. We could speculatively fix that.
bug 612230 is related to that one as well. I'd seen that issue before and thought that it might be an overloaded devserver. (bug 632149) 
Labels: Week-1635
Owner: aaboagye@chromium.org
Status: Started (was: Untriaged)
I started on a CL to fix the stage synchronization issue here: https://chromium-review.googlesource.com/#/c/377622/
Issue 546457 has been merged into this issue.
Project Member

Comment 21 by bugdroid1@chromium.org, Sep 2 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/e1648c5bb5327295223fe9d99abbd270976d0954

commit e1648c5bb5327295223fe9d99abbd270976d0954
Author: Aseda Aboagye <aaboagye@google.com>
Date: Mon Aug 29 23:39:12 2016

cbuildbot: stages: Add runattr for test artifacts.

The UploadHWTestArtifacts stage is in charge of uploading the test
payloads along with the test artifacts.  There was already an existing
runtime attribute which indicated when the payloads were generated, but
there was nothing for the HW Test artifacts.  This commit replaces the
existing attributes of "payloads_generated" and
"delta_payloads_generated" with the "test_artifacts_uploaded" runtime
attribute which is set when all of the test artifacts are uploaded.

Additionally, the HWTest stage and the AUTest stage have been modified
to wait for this attribute to be set from which the decision will be
made to proceed with the stage.  Both the payloads and the test
artifacts must be uploaded before the HWTest or AUTest stage will be
allowed to run.  This in effect synchronizes the stages.

BUG= chromium:622789 
BRANCH=None
TEST=python test_stages_unittest
TEST=python artifact_stages_unittest

Change-Id: Ib9b0a3d02f986b5d8181650eaae8c307cf5b89c1
Signed-off-by: Aseda Aboagye <aaboagye@google.com>
Reviewed-on: https://chromium-review.googlesource.com/377622
Commit-Ready: Aseda Aboagye <aaboagye@chromium.org>
Tested-by: Aseda Aboagye <aaboagye@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/cbuildbot_run.py
[modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/test_stages_unittest.py
[modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/artifact_stages.py
[modify] https://crrev.com/e1648c5bb5327295223fe9d99abbd270976d0954/cbuildbot/stages/test_stages.py

Status: Fixed (was: Started)
Let's see if the 10:00 canary run actually picked up these changes.
Status: Verified (was: Fixed)
Looks like the changes were picked up. If this ever happens again, someone can reopen this.
I think this change broke cbuildbot --nobuild --notests

It looks like UploadTestArtifacts will now block forever waiting for images which aren't coming since some other stages are disabled. 

https://chromium-review.googlesource.com/#/c/401993/ is a potential fix.
I think that CL will work, but once WaitUntilReady() is used everywhere, wherever the decision is made to skip tests, that should set the runtime attribute that the stage is waiting for to None.

Sign in to add a comment