New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 891758 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 9
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocked on:
issue 893331
issue 893355



Sign in to add a comment

grunt devices not running jobs

Project Member Reported by briannorris@chromium.org, Oct 3

Issue description

The last two CQ runs have had 1 or more paladin fail HWTest due to "infrastructure issues." It looks to me like their suites get scheduled, but they just sit there for an hour+ and never run. Eventually, they get aborted.

See this CQ run, which failed on edgar and wolf:

https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3994
https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/18869

and this subsequent run which failed on edgar:

https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3995

For the last one:

Autotest instance created: cautotest-prod
10-03-2018 [05:15:32] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743628
The suite job has another 0:59:46.174032 till timeout.
10-03-2018 [06:00:33] printing summary of incomplete jobs (8):
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743639
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743640
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743641
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743642
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743643
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743644
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743645
dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743646
The suite job has another 0:29:40.565268 till timeout.
The suite job has another -1 day, 23:59:36.955034 till timeout.
10-03-2018 [06:46:35] Suite job is finished.
Suite timed out. Started on 10-03-2018 [05:15:32], timed out on 10-03-2018 [06:46:35]
10-03-2018 [06:46:35] Start collecting test results and dump them to json.
Suite job   [ FAILED ]
Suite job     ABORT: 


It just looks like we scheduled the suite and....it timed out. No other useful logs that I see.
 
Owner: gu...@chromium.org
Status: Assigned (was: Untriaged)
From chat, it looks like guocb is aware?
Hmm, wolf might have been a little different. It managed to run many tests and then aborted later on. And guocb it shouldn't have been affected by the issues that affected edgar.
Hmm, I really don't know what to say about the wolf failures either. This is the suite result:

https://stainless.corp.google.com/browse/chromeos-autotest-results/244651137-chromeos-test/

I think that had this sub-job with an "abort" status:

http://cautotest/afe/#tab_id=view_job&object_id=244651182
http://cautotest/afe/#tab_id=view_host&object_id=2758

except I get this from the suite's suite_timing.log:

{"gs_url": "gs://chromeos-autotest-results/244651182-chromeos-test", "id": ["HWTest", 244651182], "job_id": 244651182, "name": "desktopui_KillRestart", "parent": ["Suite", 244651137], "shard": "cros-full-0008.mtv.corp.google.com", "start_time": null, "status": "pass", "try": 1}


which sounds like the test passed as well? But it has a bad 'start_time'.

Methinks something along the way screwed up in the database, but I really don't know much at all about the autotest / result database.
Components: Infra>Client>ChromeOS>CI
I'm not sure which component this goes in.
Cc: agawronska@chromium.org
From guocb@ on-call IRC: "the server control board:enguarde, board:tricky, board:heli, board:edgar has problems since last evening"

edgar and tricky paladins are marked EXPERIMENTAL due to this bug, but passed the last few paladin runs. Is the issue resolved? I think PFQ folks were also having trouble with tricky.
To clarify: tricky-paladin was never red, edgard-paladin is green for the last two builds.
enguarde-release is the only one remaining red. tricky-,heli-,edgard-release all became green in the last build.
Cc: sammiequon@chromium.org
Congbin, what is the status? Please close if fixed.
Another case: grunt-paladin: two builds: #2900 and #2901:

  Autotest instance created: cautotest-prod
  10-04-2018 [15:30:21] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373198
  @@@STEP_LINK@Link to suite@http://cautotest-prod/afe/#tab_id=view_job&object_id=245373198@@@
  The suite job has another 0:59:48.444652 till timeout.
  
  10-04-2018 [16:15:31] printing summary of incomplete jobs (13):
  
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373327
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373333
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373337
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373342
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373346
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373349
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373353
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373357
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373361
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373365
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373369
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373372
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373376
  The suite job has another 0:29:41.385010 till timeout.
  The suite job has another -1 day, 23:59:34.888606 till timeout.
  10-04-2018 [17:00:56] Suite job is finished.
  Suite timed out. Started on 10-04-2018 [15:30:21], timed out on 10-04-2018 [17:00:56]
  10-04-2018 [17:00:56] Start collecting test results and dump them to json.
  Suite job   [ FAILED ]
  Suite job     ABORT: 
enguarde-release is also green. Looking the problems of grunt.
Cc: bhthompson@chromium.org
Bernie just made grunt 'important', but we still haven't managed to properly schedule jobs for it. I see the current paladin is about to time out with no progress on scheduling its provision jobs:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933480250330740016

http://cautotest-prod/afe/#tab_id=view_job&object_id=245733221

Moved back to EXPERIMENTAL in tree status
Owner: akes...@chromium.org
Cc: djkurtz@chromium.org bmgordon@chromium.org
Cc: akes...@chromium.org briannorris@chromium.org jclinton@chromium.org
 Issue 893224  has been merged into this issue.
^ Coral is likely uynrelated, see  Issue 888628
Labels: -Pri-1 Pri-0
Summary: grunt paladins: HWTest suite timed out (was: multiple paladins: HWTest suite timed out)
repurposing this bug for grunt.
Issue 890966 has been merged into this issue.
Cc: -briannorris@chromium.org
Cc: ecgh@chromium.org
grunt is served by shard cros-full-0039.mtv.corp.google.com 

digging around on shard now
Cc: gkling@chromium.org
Owner: gkling@chromium.org
Summary: grunt devices have inconsistent "Locked" state | grunt HWTest timing out (was: grunt paladins: HWTest suite timed out)
I see an inconsistency between the shard and the master's opinion about DUT locked state.

For instance, for host chromeos2-row3-rack9-host13 master claims the device is unlocked and ready, whereas shard claims it is locked by gkling@
http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_host&object_id=6635 for reason: "Locked for migration, pools will be shifted"

I see the same when spot checking a few other devices.

gkling is this migration completed? What process did you use to lock or unlock these devices? 
Blockedon: 893331
That lock message is from when i shifted the previously deployed DUT - a Samus - to chromeos6, so perhaps it is locked on the samus shard. Whatever the case, the migration was completed a long time ago:
b/110150820
Owner: gkling@google.com
The atest hosts should have been deleted before grunt was deployed in the same location, so you're correct there could have been an error made in this process. It is possible that grunt was deployed soon after the samus was deleted, or that it was deleted while it was locked. Are either of those scenarios likely to cause an issue like this? 
Were the same host names reused for a grunt vs. samus device? This can potentially cause major confusion, because now the host history across two unrelated devices is now combined.

In  issue 893331  I managed to run sentinel once against this shard, spot checking again to see if lock/unlocked inconsistencies are resolved.
Cc: -agawronska@chromium.org
Devices are back to appearing unlocked on the shard. This *might* be resolved, lets see what happens with the next cq round...
Owner: akes...@chromium.org
> Were the same host names reused for a grunt vs. samus device?

Yes. This type of scenario is very common for EVT/DVT devices. If it is a problem we'll have to come up with a way we can safely re-use hostnames or distinguish device histories
The locked/unlocked issue is fixed, but another inconsistency is still affecting things. Investigating...
Blockedon: 893355
The other label problem is being tracked here: Issue 893355

A fix is now running, but will take several hours to complete.
Summary: grunt devices not running jobs (was: grunt devices have inconsistent "Locked" state | grunt HWTest timing out)
I'm still at a loss for why jobs aren't being scheduled on idle grunt devices on this shard.

For instance this job http://cautotest-prod/afe/#tab_id=view_job&object_id=246821395 was eligible to run on this host http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_host&object_id=1212

Maybe it is related to confusion over platform label for the devices? I see only samus jobs in this host's history, even though it's current platform is grunt.

Comment 34 Deleted

I've run a reverify against grunts, because I noticed that dut-status was not displaying a known status for them (which may be why they weren't being scheduled for jobs?)
Similar failure on peach_pit-paladin ( 20290 )

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?builderName=peach_pit-paladin&buildNumber=20290

  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914734
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914737
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914740
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914745
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914747
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914751
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914755
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914760
  dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914764
  The suite job has another 0:29:45.213771 till timeout.
  The suite job has another -1 day, 23:59:41.003572 till timeout.
  10-06-2018 [06:15:10] Suite job is finished.
  Suite timed out. Started on 10-06-2018 [04:43:50], timed out on 10-06-2018 [06:15:10]
  10-06-2018 [06:15:10] Start collecting test results and dump them to json.
  Suite job   [ FAILED ]
  Suite job     ABORT: 

^ that is unrelated, please confine this bug to grunt (or other boards that are not running any jobs whatsoever)
I'm flailing here and confused, I'm going to remove grunt from this shard, and then re-add it.
I clicked through all the jobs listed in comment36 and nothing appeared to run nor was there any logs for the object_id.  Seemed to match the grunt failure IMO.
^ peach_pit-paladin is running jobs within the last day, e.g. http://cautotest-prod/afe/#tab_id=view_job&object_id=246863206

grunt-paladin is not.
By cloning a job via the afe, I managed to create a paladin job that actually is running.

http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_job&object_id=246906558

This might mean that my action in #38 has fixed the issue? Waiting until current CQ run completes (or at least reaches hwtest) to find out.
Status: Fixed (was: Assigned)
grunt jobs are running. I don't understand what fixed this in particular, but I guess it was the shard wipe-and-readd in #38. May be related to the re-use of hosts of a given name between two different board labels.
Issue 893252 has been merged into this issue.

Sign in to add a comment