canary: Big series of build timeouts |
|||
Issue descriptionWe just had a significant series of build timeouts across multiple boards. My guess is that canary and stable/beta kicked in at the same time and were fighting for boards, which could be confirmed by HWtests having big delays between run of particular tests for some of the builds and for other builds swarming just failing due to not having enough boards in the pool. Someone to confirm whether this is WAI. Builders failed on: - banon-release: https://luci-milo.appspot.com/buildbot/chromeos/banon-release/1403 - buddy-release: https://luci-milo.appspot.com/buildbot/chromeos/buddy-release/1386 - daisy_skate-release: https://luci-milo.appspot.com/buildbot/chromeos/daisy_skate-release/1680 - gandof-release: https://luci-milo.appspot.com/buildbot/chromeos/gandof-release/1394 - hana-release: https://luci-milo.appspot.com/buildbot/chromeos/hana-release/945 - kevin-release: https://luci-milo.appspot.com/buildbot/chromeos/kevin-release/1413 - lulu-release: https://luci-milo.appspot.com/buildbot/chromeos/lulu-release/1406 - peach_pit-release: https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-release/2861 - quawks-release: https://luci-milo.appspot.com/buildbot/chromeos/quawks-release/1689 - reef-uni-release: https://luci-milo.appspot.com/buildbot/chromeos/reef-uni-release/164 - reks-release: https://luci-milo.appspot.com/buildbot/chromeos/reks-release/1402 - samus-release: https://luci-milo.appspot.com/buildbot/chromeos/samus-release/4575 - setzer-release: https://luci-milo.appspot.com/buildbot/chromeos/setzer-release/1398 - terra-release: https://luci-milo.appspot.com/buildbot/chromeos/terra-release/1407 - ultima-release: https://luci-milo.appspot.com/buildbot/chromeos/ultima-release/1407 - veyron_fievel-release: https://luci-milo.appspot.com/buildbot/chromeos/veyron_fievel-release/1403 - veyron_minnie-release: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-release/1406 - veyron_rialto-release: https://luci-milo.appspot.com/buildbot/chromeos/veyron_rialto-release/1402 - veyron_tiger-release: https://luci-milo.appspot.com/buildbot/chromeos/veyron_tiger-release/1402 - wolf-release: https://luci-milo.appspot.com/buildbot/chromeos/wolf-release/2410 - zako-release: https://luci-milo.appspot.com/buildbot/chromeos/zako-release/1695
,
Aug 16 2017
A problem for this week's deputy. nxia@ - Check that cautotest and the database were up and running. Infrastructure failures sometimes cause devices to be reported bad when the real problem is elsewhere.
,
Aug 18 2017
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-release/2861 TestLabException: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3 Traceback (most recent call last): File "/usr/local/autotest/site_utils/run_suite.py", line 1957, in _run_task return _run_suite(options) File "/usr/local/autotest/site_utils/run_suite.py", line 1706, in _run_suite options.skip_duts_check) File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 336, in check_dut_availability hosts=hosts) NotEnoughDutsError: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3 Will return from run_suite with status: INFRA_FAILURE Saw a lot error logs like this. As I recall this a sympton of shard apache problem, and we have a cron job to restart the shard apache service periodically. I don't see the timeout anymore, I guess it has been fixed by the restart cron job.
,
Aug 18 2017
Issue 755882 has been merged into this issue.
,
Aug 18 2017
> Saw a lot error logs like this. As I recall this a sympton of shard apache problem, and we have a cron job to restart the shard apache service periodically. I don't see the timeout anymore, I guess it has been fixed by the restart cron job. I took a brief look at the DUT history. This is a different problem. I can't explain it. Given that it's not still happening, it may or may not need a better explanation, but this isn't a known problem.
,
Aug 18 2017
OK, I poked around a bit more regarding the cited peach_pit failure.
This command was most useful:
====
$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b peach_pit | awk '/provision/ {print previous ; print} {previous = $0}'
2017-08-15 19:30:10 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323459-chromeos-test/
2017-08-15 19:21:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host22/4088869-provision/
2017-08-15 20:24:37 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135329252-chromeos-test/
2017-08-15 20:13:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host7/4089304-provision/
2017-08-15 19:30:14 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323462-chromeos-test/
2017-08-15 19:21:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host7/4088870-provision/
2017-08-15 19:30:22 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323465-chromeos-test/
2017-08-15 19:21:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host6/4088871-provision/
2017-08-15 19:30:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323456-chromeos-test/
2017-08-15 19:21:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host16/4088868-provision/
2017-08-15 19:30:16 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135323449-chromeos-test/
2017-08-15 19:21:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host13/4088866-provision/
2017-08-15 19:14:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135322421-chromeos-test/
2017-08-15 19:05:00 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host10/4088835-provision/
====
That shows the last provision executed on the BVT DUTs prior to the
failed canary test, plus the first test after that provision. These
are the test jobs:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323459
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135329252
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323462
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323465
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323456
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135323449
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=135322421
One of those tests above was the sanity suite for the failed R62-9845.0.0 build.
The remainder of those tests were for R60-9592.83.0 (a stable channel build, I
think).
So, it looks like the reason that the peach_pit canary failed is that the DUTs
were busy testing the stable channel build.
,
Aug 18 2017
I followed jrbarnette@'s procedure and checked banon
nxia@nxia:~/cidb_creds$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b banon | awk '/provision/ {print previous ; print} {previous = $0}'
2017-08-15 20:06:15 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330868-chromeos-test/
2017-08-15 19:54:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack17-host15/1398930-provision/
2017-08-15 20:06:09 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330870-chromeos-test/
2017-08-15 19:54:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack17-host9/1398931-provision/
2017-08-15 20:21:22 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135334012-chromeos-test/
2017-08-15 20:08:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host7/1399001-provision/
2017-08-15 20:06:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330862-chromeos-test/
2017-08-15 19:54:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host7/1398928-provision/
2017-08-15 20:06:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135330859-chromeos-test/
2017-08-15 19:54:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack18-host3/1398927-provision/
2017-08-15 19:46:50 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/135326364-chromeos-test/
2017-08-15 19:34:39 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack19-host13/1398870-provision/
http://cautotest/afe/#tab_id=view_job&object_id=135330868
Job: banon-release/R60-9592.83.0/bvt-inline/provision_AutoUpdate.double (135330868-chromeos-test)
Around 2017-08-15 19:53:24, the duts were running tests for R60-9592.83.0
https://uberchromegw.corp.google.com/i/chromeos_release/builders/master-release%20release-R60-9592.B/builds/71
release-R60-9592.B/builds/71 was kicked at Tue Aug 15 16:30:06 2017 and finished at Wed Aug 16 00:55:11 2017.
Looks like the reason is as jrbarnette@ pointed out, the stable channel builds got kicked before the canary builds, and they took up the DUTs, and so canary timed out at waiting for DUTs to run tests.
,
Aug 19 2017
> I followed jrbarnette@'s procedure and checked banon
>
> nxia@nxia:~/cidb_creds$ dut-status -f -d 4 -u '2017-08-15 20:32:14' -p bvt -b banon | awk '/provision/ {print previous ; print} {previous = $0}'
Oh. Hm. For anyone else wanting to further reproduce this: The
date "2017-08-15 20:32:14" isn't arbitrary. It's taken from the
failure logs, and represents (roughly) the start time of the failed
HWTest stage. For peach_pit it was 20:32:14; for banon it was
20:28:02, or close enough to be the same as peach_pit. But, other
boards may be different, so pay attention.
,
Nov 13 2017
The lack of duts symptom in canary builds were caused by the triggering of release-R60-9592.B builds. NotEnoughDutsError is the expected exception for this case. |
|||
►
Sign in to add a comment |
|||
Comment 1 by bhthompson@google.com
, Aug 16 2017