elm CQ DUTs locked up between the shard and the master |
||||||
Issue descriptionelm-paladin builds have failed for two consecutive runs on dummy_Pass in HWTests. Link to build or pfq page. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936923584334088048 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936942011261040112 build # for that buildbot. 7052 and 7051 Snippet of log that contains the failure. BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) ** Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 441, in _Run self._task(*self._task_args, **self._task_kwargs) File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 700, in Run self.PerformStage() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 280, in PerformStage raise cmd_result.to_raise TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) ** [0m
,
Aug 29
Per jrbarnett@ "AU tests change the version on the DUT, but don't change the version label" causing the DUTs to repair and become unavailable.
,
Aug 29
DUT pool is actually labeled fine for CQ. BVT had provisioning problems and would not have caused this problem. Looks like there may be issues with one of the arc_setup CLs.
,
Aug 29
This failure isn't because of the DUTs at all; the scheduler information
for the DUTs is somehow out of whack.
There are two failed provision suites:
http://cautotest-prod/afe/#tab_id=view_job&object_id=231653305
http://cautotest-prod/afe/#tab_id=view_job&object_id=231611195
In both cases, the suites aborted. In the second case (at least),
all of the DUTs were working and available, yet the work was never
scheduled. I can't explain why.
I attempted to schedule more work on two of the DUTs; the job is here:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231699034
The jobs are stuck. On the shard, the jobs are recorded as Aborted:
http://cros-full-0015.mtv.corp.google.com/afe/#tab_id=view_job&object_id=231699034
So, something's really wrong. But I can't explain what.
It's mine to fix, because I'm deputy, but I'm going to need help.
So, this has to wait until the morrow.
,
Aug 29
The problem still happens, as of the elm-paladin run that started
at 02:03:
https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/7055
I think this qualifies for P0: We've mitigated the problem by
marking the builder experimental, but that means we have a significant
coverage gap that we need to close ASAP. In any event, short of a global
outage, this is the top priority for the Lab Deputy.
,
Aug 29
The 231699034 job killed the scheduler because of a weird key error in creating aborted HQE entries
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 193, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 493, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
self._send_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 306, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 964, in _send_to_lucifer
self._send_starting_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 987, in _send_starting_to_lucifer
job.hostqueueentry_set.all())
File "/usr/local/autotest/frontend/afe/models.py", line 2036, in abort_host_queue_entries
AbortedHostQueueEntry.objects.bulk_create(aborted_hqes)
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 152, in bulk_create
return self.get_query_set().bulk_create(*args, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 441, in bulk_create
self._batched_insert(objs_with_pk, fields, batch_size)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 902, in _batched_insert
using=self.db)
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 215, in _insert
return insert_query(self.model, objs, fields, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 1661, in insert_query
return query.get_compiler(using=using).execute_sql(return_id)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
cursor.execute(sql, params)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 122, in execute
six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
IntegrityError: (1062, "Duplicate entry '232414284' for key 'PRIMARY'")
,
Aug 29
The provision jobs that were the original bug don't appear to ever have been sent to the shard. I'm not the primary expert on the shard sync though so I don't have any clues other than the ongoing shard heartbeat issues.
,
Aug 29
> The provision jobs that were the original bug don't appear to > ever have been sent to the shard. I'm not the primary expert > on the shard sync though so I don't have any clues other than > the ongoing shard heartbeat issues. TTBOMK, there's no ongoing shard heartbeat issue. The scheduler restart problem on cros-full-0015 is pretty dramatic, though: http://shortn/_ey4FErCX57 That graph seems to show that the problem has stopped, too.
,
Aug 29
> That graph seems to show that the problem has stopped, too. Cross-checking the shard, the logs show that the scheduler was restarting once every minute, 8 seconds or so until 09:17:52 this morning. The scheduler seems stable now. I don't know if that means that the shard is back to scheduling work, though.
,
Aug 29
... and checking history, the scheduler restarts post-date the start of this problem. So, we can expect that this problem is still ongoing.
,
Aug 29
I've created two more new jobs, to see if we can find out more:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=232004976
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231998497
So far, both of them seem to be hung up
,
Aug 29
My claims to the contrary notwithstanding, multiple elm CQ DUTs are now successfully doing work: $ dut-status -d 2 -p cq -b elm hostname S last checked URL chromeos2-row7-rack11-host1 OK 2018-08-29 09:18:35 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host1/1399451-provision/ chromeos2-row7-rack11-host11 OK 2018-08-29 09:18:35 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host11/1399454-provision/ chromeos2-row7-rack10-host1 OK 2018-08-29 09:18:33 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host1/1399448-provision/ chromeos2-row7-rack10-host7 OK 2018-08-29 09:18:35 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host7/1399453-provision/ chromeos2-row7-rack10-host11 OK 2018-08-29 09:18:37 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host11/1399478-provision/ chromeos2-row7-rack9-host5 OK 2018-08-29 09:18:33 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack9-host5/1399447-provision/ chromeos2-row7-rack7-host1 OK 2018-08-29 09:18:35 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack7-host1/1399450-provision/ chromeos2-row7-rack8-host13 OK 2018-08-29 09:25:29 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack8-host13/1399516-reset/ chromeos2-row7-rack11-host17 OK 2018-08-29 10:55:57 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host17/1400083-cleanup/ chromeos2-row7-rack10-host21 OK 2018-08-29 10:57:38 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host21/1400097-cleanup/ chromeos2-row7-rack11-host19 ?? 2018-08-28 18:15:31 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host19/1399373-verify/ chromeos2-row7-rack11-host13 ?? 2018-08-28 18:15:31 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host13/1399372-verify/ The problem children are the two hosts selected for job 231699034. I suspect simply resetting their state back to Ready will help.
,
Aug 29
And look!
https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/7056
Green paladin run.
,
Aug 29
No longer urgent, but still worrisome.
I note an ostensibly similar failure on auron_paine-paladin:
This build:
https://luci-milo.appspot.com/buildbot/chromeos/auron_paine-paladin/3765
This provision suite:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231888034
,
Aug 29
There's some evidence that this problem could be related to bug 878403.
,
Sep 1
,
Sep 5
https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/ is looking good close_wait -> close
,
Sep 5
https://luci-milo.appspot.com/buildbot/chromeos/auron_paine-paladin/ is alos looking good. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by skau@chromium.org
, Aug 29