New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 2 users
Status: WontFix
Owner:
Closed: Nov 13
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
canary: Big series of build timeouts
Project Member Reported by tfiga@chromium.org, Aug 16 Back to list
We just had a significant series of build timeouts across multiple boards.

My guess is that canary and stable/beta kicked in at the same time and were fighting for boards, which could be confirmed by HWtests having big delays between run of particular tests for some of the builds and for other builds swarming just failing due to not having enough boards 
in the pool.

Someone to confirm whether this is WAI.

Builders failed on: 
- banon-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/banon-release/1403
- buddy-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/buddy-release/1386
- daisy_skate-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/daisy_skate-release/1680
- gandof-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/gandof-release/1394
- hana-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/hana-release/945
- kevin-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/kevin-release/1413
- lulu-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/lulu-release/1406
- peach_pit-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-release/2861
- quawks-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/quawks-release/1689
- reef-uni-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/reef-uni-release/164
- reks-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/reks-release/1402
- samus-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/samus-release/4575
- setzer-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/setzer-release/1398
- terra-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/terra-release/1407
- ultima-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/ultima-release/1407
- veyron_fievel-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/veyron_fievel-release/1403
- veyron_minnie-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-release/1406
- veyron_rialto-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/veyron_rialto-release/1402
- veyron_tiger-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/veyron_tiger-release/1402
- wolf-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/wolf-release/2410
- zako-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/zako-release/1695



 
Cc: jrbarnette@chromium.org akes...@chromium.org
Owner: nxia@chromium.org
Status: Assigned
A problem for this week's deputy.

nxia@ - Check that cautotest and the database were up and running.
Infrastructure failures sometimes cause devices to be reported bad
when the real problem is elsewhere.

 https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-release/2861

TestLabException: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/run_suite.py", line 1957, in _run_task
    return _run_suite(options)
  File "/usr/local/autotest/site_utils/run_suite.py", line 1706, in _run_suite
    options.skip_duts_check)
  File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 336, in check_dut_availability
    hosts=hosts)
NotEnoughDutsError: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3
Will return from run_suite with status: INFRA_FAILURE

Saw a lot error logs like this. As I recall this a sympton of shard apache problem, and we have a cron job to restart the shard apache service periodically. I don't see the timeout anymore, I guess it has been fixed by the restart cron job.
 Issue 755882  has been merged into this issue.
> Saw a lot error logs like this. As I recall this a sympton of shard apache problem, and we have a cron job to restart the shard apache service periodically. I don't see the timeout anymore, I guess it has been fixed by the restart cron job.

I took a brief look at the DUT history.  This is a different problem.
I can't explain it.  Given that it's not still happening, it may or may
not need a better explanation, but this isn't a known problem.

OK, I poked around a bit more regarding the cited peach_pit failure.

This command was most useful:

====
$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b peach_pit | awk '/provision/ {print previous ; print} {previous = $0}'
    2017-08-15 19:30:10  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323459-chromeos-test/
    2017-08-15 19:21:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host22/4088869-provision/
    2017-08-15 20:24:37  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135329252-chromeos-test/
    2017-08-15 20:13:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host7/4089304-provision/
    2017-08-15 19:30:14  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323462-chromeos-test/
    2017-08-15 19:21:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host7/4088870-provision/
    2017-08-15 19:30:22  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323465-chromeos-test/
    2017-08-15 19:21:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host6/4088871-provision/
    2017-08-15 19:30:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323456-chromeos-test/
    2017-08-15 19:21:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host16/4088868-provision/
    2017-08-15 19:30:16  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323449-chromeos-test/
    2017-08-15 19:21:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host13/4088866-provision/
    2017-08-15 19:14:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135322421-chromeos-test/
    2017-08-15 19:05:00  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host10/4088835-provision/
====

That shows the last provision executed on the BVT DUTs prior to the
failed canary test, plus the first test after that provision.  These
are the test jobs:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323459
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135329252
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323462
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323465
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323456
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323449
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135322421

One of those tests above was the sanity suite for the failed R62-9845.0.0 build.
The remainder of those tests were for R60-9592.83.0 (a stable channel build, I
think).

So, it looks like the reason that the peach_pit canary failed is that the DUTs
were busy testing the stable channel build.

I followed jrbarnette@'s procedure and checked banon

nxia@nxia:~/cidb_creds$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b banon | awk '/provision/ {print previous ; print} {previous = $0}'
    2017-08-15 20:06:15  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330868-chromeos-test/
    2017-08-15 19:54:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack17-host15/1398930-provision/
    2017-08-15 20:06:09  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330870-chromeos-test/
    2017-08-15 19:54:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack17-host9/1398931-provision/
    2017-08-15 20:21:22  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135334012-chromeos-test/
    2017-08-15 20:08:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host7/1399001-provision/
    2017-08-15 20:06:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330862-chromeos-test/
    2017-08-15 19:54:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host7/1398928-provision/
    2017-08-15 20:06:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330859-chromeos-test/
    2017-08-15 19:54:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host3/1398927-provision/
    2017-08-15 19:46:50  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135326364-chromeos-test/
    2017-08-15 19:34:39  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack19-host13/1398870-provision/


http://cautotest/afe/#tab_id=view_job&object_id=135330868

Job: banon-release/R60-9592.83.0/bvt-inline/provision_AutoUpdate.double (135330868-chromeos-test)

Around 2017-08-15 19:53:24, the duts were running tests for R60-9592.83.0


https://uberchromegw.corp.google.com/i/chromeos_release/builders/master-release%20release-R60-9592.B/builds/71

release-R60-9592.B/builds/71 was kicked at Tue Aug 15 16:30:06 2017 and finished at Wed Aug 16 00:55:11 2017. 

Looks like the reason is as jrbarnette@ pointed out, the stable channel builds got kicked before the canary builds, and they took up the DUTs, and so canary timed out at waiting for DUTs to run tests.

> I followed jrbarnette@'s procedure and checked banon
>
> nxia@nxia:~/cidb_creds$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b banon | awk '/provision/ {print previous ; print} {previous = $0}'

Oh. Hm.  For anyone else wanting to further reproduce this:  The
date "2017-08-15 20:32:14" isn't arbitrary.  It's taken from the
failure logs, and represents (roughly) the start time of the failed
HWTest stage.  For peach_pit it was 20:32:14; for banon it was
20:28:02, or close enough to be the same as peach_pit.  But, other
boards may be different, so pay attention.

Status: WontFix
The lack of duts symptom in canary builds were caused by the triggering of release-R60-9592.B builds. NotEnoughDutsError is the expected exception for this case. 
Sign in to add a comment