New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 878621 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

elm CQ DUTs locked up between the shard and the master

Project Member Reported by skau@chromium.org, Aug 29

Issue description

elm-paladin builds have failed for two consecutive runs on dummy_Pass in HWTests.

Link to build or pfq page.
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936923584334088048
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936942011261040112

build # for that buildbot.
7052 and 7051

Snippet of log that contains the failure.
BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 441, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 700, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 280, in PerformStage
    raise cmd_result.to_raise
TestLabFailure: ** HWTest did not complete due to infrastructure issues (code 3) **


 
Cc: jrbarnette@chromium.org
Something seems to be killing the DUTs.  Investigating.
Per jrbarnett@ "AU tests change the version on the DUT, but don't change the version label" causing the DUTs to repair and become unavailable.
DUT pool is actually labeled fine for CQ.  BVT had provisioning problems and would not have caused this problem.  Looks like there may be issues with one of the arc_setup CLs.
Cc: pprabhu@chromium.org ayatane@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: OS-Chrome Pri-1
Owner: jrbarnette@chromium.org
Status: Assigned (was: Untriaged)
Summary: elm CQ DUTs locked up between the shard and the master (was: elm-paladin failing during HWTest Provision)
This failure isn't because of the DUTs at all; the scheduler information
for the DUTs is somehow out of whack.

There are two failed provision suites:
    http://cautotest-prod/afe/#tab_id=view_job&object_id=231653305
    http://cautotest-prod/afe/#tab_id=view_job&object_id=231611195

In both cases, the suites aborted.  In the second case (at least),
all of the DUTs were working and available, yet the work was never
scheduled.  I can't explain why.

I attempted to schedule more work on two of the DUTs; the job is here:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231699034

The jobs are stuck.  On the shard, the jobs are recorded as Aborted:
    http://cros-full-0015.mtv.corp.google.com/afe/#tab_id=view_job&object_id=231699034

So, something's really wrong.  But I can't explain what.

It's mine to fix, because I'm deputy, but I'm going to need help.
So, this has to wait until the morrow.

Labels: -Pri-1 Hotlist-Deputy Pri-0
The problem still happens, as of the elm-paladin run that started
at 02:03:
    https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/7055

I think this qualifies for P0:  We've mitigated the problem by
marking the builder experimental, but that means we have a significant
coverage gap that we need to close ASAP.  In any event, short of a global
outage, this is the top priority for the Lab Deputy.

The 231699034 job killed the scheduler because of a weird key error in creating aborted HQE entries

Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 193, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 493, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 391, in tick
    self._send_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 306, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 964, in _send_to_lucifer
    self._send_starting_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 987, in _send_starting_to_lucifer
    job.hostqueueentry_set.all())
  File "/usr/local/autotest/frontend/afe/models.py", line 2036, in abort_host_queue_entries
    AbortedHostQueueEntry.objects.bulk_create(aborted_hqes)
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 152, in bulk_create
    return self.get_query_set().bulk_create(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 441, in bulk_create
    self._batched_insert(objs_with_pk, fields, batch_size)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 902, in _batched_insert
    using=self.db)
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 215, in _insert
    return insert_query(self.model, objs, fields, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 1661, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 122, in execute
    six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
IntegrityError: (1062, "Duplicate entry '232414284' for key 'PRIMARY'")
The provision jobs that were the original bug don't appear to ever have been sent to the shard.  I'm not the primary expert on the shard sync though so I don't have any clues other than the ongoing shard heartbeat issues.
> The provision jobs that were the original bug don't appear to
> ever have been sent to the shard.  I'm not the primary expert
> on the shard sync though so I don't have any clues other than
> the ongoing shard heartbeat issues.

TTBOMK, there's no ongoing shard heartbeat issue.

The scheduler restart problem on cros-full-0015 is pretty
dramatic, though:  http://shortn/_ey4FErCX57

That graph seems to show that the problem has stopped, too.

> That graph seems to show that the problem has stopped, too.

Cross-checking the shard, the logs show that the scheduler was restarting
once every minute, 8 seconds or so until 09:17:52 this morning.
The scheduler seems stable now.  I don't know if that means that the
shard is back to scheduling work, though.

... and checking history, the scheduler restarts post-date the start of
this problem.  So, we can expect that this problem is still ongoing.

I've created two more new jobs, to see if we can find out more:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=232004976
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231998497

So far, both of them seem to be hung up
My claims to the contrary notwithstanding, multiple elm CQ DUTs are
now successfully doing work:

$ dut-status -d 2 -p cq -b elm
hostname                       S   last checked         URL
chromeos2-row7-rack11-host1    OK  2018-08-29 09:18:35  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host1/1399451-provision/
chromeos2-row7-rack11-host11   OK  2018-08-29 09:18:35  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host11/1399454-provision/
chromeos2-row7-rack10-host1    OK  2018-08-29 09:18:33  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host1/1399448-provision/
chromeos2-row7-rack10-host7    OK  2018-08-29 09:18:35  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host7/1399453-provision/
chromeos2-row7-rack10-host11   OK  2018-08-29 09:18:37  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host11/1399478-provision/
chromeos2-row7-rack9-host5     OK  2018-08-29 09:18:33  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack9-host5/1399447-provision/
chromeos2-row7-rack7-host1     OK  2018-08-29 09:18:35  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack7-host1/1399450-provision/
chromeos2-row7-rack8-host13    OK  2018-08-29 09:25:29  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack8-host13/1399516-reset/
chromeos2-row7-rack11-host17   OK  2018-08-29 10:55:57  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host17/1400083-cleanup/
chromeos2-row7-rack10-host21   OK  2018-08-29 10:57:38  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack10-host21/1400097-cleanup/
chromeos2-row7-rack11-host19   ??  2018-08-28 18:15:31  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host19/1399373-verify/
chromeos2-row7-rack11-host13   ??  2018-08-28 18:15:31  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row7-rack11-host13/1399372-verify/

The problem children are the two hosts selected for job 231699034.
I suspect simply resetting their state back to Ready will help.

Labels: -Pri-0 Pri-1
No longer urgent, but still worrisome.

I note an ostensibly similar failure on auron_paine-paladin:

This build:
    https://luci-milo.appspot.com/buildbot/chromeos/auron_paine-paladin/3765
This provision suite:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=231888034

There's some evidence that this problem could be related to bug 878403.

Owner: pprabhu@chromium.org
Status: Fixed (was: Assigned)
https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/ is looking good


close_wait -> close

Sign in to add a comment