New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 736812 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 736807



Sign in to add a comment

All boards in auron,kunimitsu,strago families have RED build status on ToT(61). OOB suites do not run on them.

Project Member Reported by ka...@chromium.org, Jun 26 2017

Issue description

GE builds view - https://screenshot.googleplex.com/MtSNtzknJJA

Happening since	2017-06-23 9679.0.0 / 61.0.3138.0

Red status stages:
- PaygenTestDev - all boards
- HW tests - lots of the boards

Other families has more or less of the boards in the same state.

OOB tests suites do not run on the boards with red build state - this is all time issue for me to track test results based on which test team prioritizes daily work.
Can the dependency for the OOB suites on PayGen tests be removed? 
 
Cc: yllin@chromium.org sjg@chromium.org cernekee@chromium.org pprabhu@chromium.org
+Sheriffs.  It's not yet clear to me whether this is an infra or a
product bug.  The tests are failing.  In some cases, they may be
timing out.

Also, pprabhu@ mentioned to me a similar problem on Friday.  This
could be a duplicate.  Certainly, the auron_paine builder has been
red long enough.

Comment 2 by ka...@chromium.org, Jun 26 2017

Summary: All boards in auron,kunimitsu,strago families have RED build status on ToT(61). OOB suites do not run on them. (was: All boards in auron,kunimitsu,strago families have RED build status on ToT(61). OOB suites do not run on them.l)
Labels: -Pri-2 Pri-1
Owner: sjg@chromium.org
Status: Assigned (was: Untriaged)
I went and looked at the history of one Paygen test failure:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125175082

That test ran on chromeos4-row8-rack4-host18.  Looking at the DUT's history,
you see this:

chromeos4-row8-rack4-host18
    2017-06-25 23:59:35  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row8-rack4-host18/959995-repair/
    2017-06-25 23:57:10  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row8-rack4-host18/959991-reset/
    2017-06-25 23:17:13  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/125175082-chromeos-test/
    2017-06-25 23:16:49  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row8-rack4-host18/959965-reset/

That log shows that after running the test, the DUT failed reset
testing, and required repair.  The repair logs are here:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack4-host18/959995-repair/20172506235933/

The status.log from the repair task shows that the DUT was offline at
the start of repair, and required reinstallation from USB before it would
work.

This sort of symptom is caused by product, not infra bugs:  Passing to
a sheriff for more evaluation.

Escalating, because this doesn't look like something we can afford to
ignore.

Comment 4 by sjg@google.com, Jun 26 2017

Status: Started (was: Assigned)

Comment 5 by sjg@google.com, Jun 26 2017

Status: Assigned (was: Started)

Comment 6 by sjg@chromium.org, Jun 26 2017

Status: Started (was: Assigned)

Comment 7 by sjg@chromium.org, Jun 26 2017

09:10:05: INFO: RunCommand: /b/c/cbuild/repository/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' stat -- gs://chromeos-releases/canary-channel/auron-paine/9687.0.0/payloads/signing/28791-140248560363328/1.payload.hash.update_signer.signed.bin
09:10:05: WARNING: GS_ERROR: No URLs matched: gs://chromeos-releases/canary-channel/auron-paine/9687.0.0/payloads/signing/28791-140248560363328/1.payload.hash.update_signer.signed.bin 


Comment 8 by sjg@chromium.org, Jun 26 2017

It seems to generate and sign the payloads OK.

Then PaygetnTestCanary says this:


09:13:16: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpnOa2xh/tmpTA3zxf/temp_summary.json --raw-cmd --task-name auron_paine-release/R61-9687.0.0-paygen_au_canary --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:paygen_au_canary' '--tags=build:auron_paine-release/R61-9687.0.0' '--tags=task_name:auron_paine-release/R61-9687.0.0-paygen_au_canary' '--tags=board:auron_paine' -- /usr/local/autotest/site_utils/run_suite.py --build auron_paine-release/R61-9687.0.0 --board auron_paine --suite_name paygen_au_canary --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -m 125235548

@@@STEP_FAILURE@@@
10:02:36: ERROR: Timeout occurred- waited 27622 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.




https://uberchromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/1249/steps/PaygenTestCanary/logs/stdio

> 10:02:36: ERROR: Timeout occurred- waited 27622 seconds, failing.
> Timeout reason: This build has reached the timeout deadline set
> by the master. Either this stage or a previous one took too long
> (see stage timing historical summary in ReportStage) or the build
> failed to start on time.

My assumption is that this is a downstream impact of the real problem.
The logs from the test suite show failures, DUTs being forced into
repair, and aborts.  I expect that this is the causal chain:
  * DUT goes offline, as described in  bug 736807 .
  * The offline DUT causes a test failure.
  * The offline DUT (and the failure) force repair.
  * The time required to complete repair means that the test suite
    times out, and some tests abort.
  * The timeout on the Autotest side shows up as the builder message
    above.

Comment 10 by nxia@chromium.org, Jun 26 2017

Cc: nxia@chromium.org
https://bugs.chromium.org/p/chromium/issues/detail?id=722603#c29

Looks like sentry-release is also affected by this

Comment 11 by sjg@google.com, Jun 26 2017

Blockedon: 736807

Comment 12 by sjg@google.com, Jun 27 2017

Status: Fixed (was: Started)
I believe the PaygenTestDev failure is fixed by the toolchain revert.

sentry-release just had a green run.

https://uberchromegw.corp.google.com/i/chromeos/builders/sentry-release/builds/1252

I'm closing this since I believe the root cause is fixed.

Comment 13 by ka...@chromium.org, Jun 27 2017

Status: Verified (was: Fixed)
Thanks, 
Yes, most of these failures are gone.

Sign in to add a comment