grunt devices not running jobs |
|||||||||||||||||||
Issue descriptionThe last two CQ runs have had 1 or more paladin fail HWTest due to "infrastructure issues." It looks to me like their suites get scheduled, but they just sit there for an hour+ and never run. Eventually, they get aborted. See this CQ run, which failed on edgar and wolf: https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3994 https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/18869 and this subsequent run which failed on edgar: https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3995 For the last one: Autotest instance created: cautotest-prod 10-03-2018 [05:15:32] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743628 The suite job has another 0:59:46.174032 till timeout. 10-03-2018 [06:00:33] printing summary of incomplete jobs (8): dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743639 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743640 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743641 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743642 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743643 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743644 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743645 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=244743646 The suite job has another 0:29:40.565268 till timeout. The suite job has another -1 day, 23:59:36.955034 till timeout. 10-03-2018 [06:46:35] Suite job is finished. Suite timed out. Started on 10-03-2018 [05:15:32], timed out on 10-03-2018 [06:46:35] 10-03-2018 [06:46:35] Start collecting test results and dump them to json. Suite job [ FAILED ] Suite job ABORT: It just looks like we scheduled the suite and....it timed out. No other useful logs that I see.
,
Oct 3
Hmm, wolf might have been a little different. It managed to run many tests and then aborted later on. And guocb it shouldn't have been affected by the issues that affected edgar.
,
Oct 3
Hmm, I really don't know what to say about the wolf failures either. This is the suite result: https://stainless.corp.google.com/browse/chromeos-autotest-results/244651137-chromeos-test/ I think that had this sub-job with an "abort" status: http://cautotest/afe/#tab_id=view_job&object_id=244651182 http://cautotest/afe/#tab_id=view_host&object_id=2758 except I get this from the suite's suite_timing.log: {"gs_url": "gs://chromeos-autotest-results/244651182-chromeos-test", "id": ["HWTest", 244651182], "job_id": 244651182, "name": "desktopui_KillRestart", "parent": ["Suite", 244651137], "shard": "cros-full-0008.mtv.corp.google.com", "start_time": null, "status": "pass", "try": 1} which sounds like the test passed as well? But it has a bad 'start_time'. Methinks something along the way screwed up in the database, but I really don't know much at all about the autotest / result database.
,
Oct 4
I'm not sure which component this goes in.
,
Oct 4
From guocb@ on-call IRC: "the server control board:enguarde, board:tricky, board:heli, board:edgar has problems since last evening" edgar and tricky paladins are marked EXPERIMENTAL due to this bug, but passed the last few paladin runs. Is the issue resolved? I think PFQ folks were also having trouble with tricky.
,
Oct 4
To clarify: tricky-paladin was never red, edgard-paladin is green for the last two builds. enguarde-release is the only one remaining red. tricky-,heli-,edgard-release all became green in the last build.
,
Oct 4
,
Oct 4
Congbin, what is the status? Please close if fixed.
,
Oct 5
Another case: grunt-paladin: two builds: #2900 and #2901: Autotest instance created: cautotest-prod 10-04-2018 [15:30:21] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373198 @@@STEP_LINK@Link to suite@http://cautotest-prod/afe/#tab_id=view_job&object_id=245373198@@@ The suite job has another 0:59:48.444652 till timeout. 10-04-2018 [16:15:31] printing summary of incomplete jobs (13): dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373327 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373333 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373337 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373342 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373346 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373349 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373353 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373357 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373361 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373365 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373369 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373372 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245373376 The suite job has another 0:29:41.385010 till timeout. The suite job has another -1 day, 23:59:34.888606 till timeout. 10-04-2018 [17:00:56] Suite job is finished. Suite timed out. Started on 10-04-2018 [15:30:21], timed out on 10-04-2018 [17:00:56] 10-04-2018 [17:00:56] Start collecting test results and dump them to json. Suite job [ FAILED ] Suite job ABORT:
,
Oct 5
enguarde-release is also green. Looking the problems of grunt.
,
Oct 5
Bernie just made grunt 'important', but we still haven't managed to properly schedule jobs for it. I see the current paladin is about to time out with no progress on scheduling its provision jobs: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933480250330740016 http://cautotest-prod/afe/#tab_id=view_job&object_id=245733221 Moved back to EXPERIMENTAL in tree status
,
Oct 6
,
Oct 8
,
Oct 8
Issue 893224 has been merged into this issue.
,
Oct 8
,
Oct 8
^ Coral is likely uynrelated, see Issue 888628
,
Oct 8
repurposing this bug for grunt.
,
Oct 8
pool:cq for grunt claims to be completely Idle, so something fishy is likely going on in scheduler or shard. https://viceroy.corp.google.com/chromeos/dut_utilization?board=grunt&model=&pool=managed%3Acq&status=Running&is_locked=False&topstreams=5&duration=1d&mdb_role=chrome-infra&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2
,
Oct 8
Issue 890966 has been merged into this issue.
,
Oct 8
,
Oct 8
,
Oct 8
grunt is served by shard cros-full-0039.mtv.corp.google.com digging around on shard now
,
Oct 8
I see an inconsistency between the shard and the master's opinion about DUT locked state. For instance, for host chromeos2-row3-rack9-host13 master claims the device is unlocked and ready, whereas shard claims it is locked by gkling@ http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_host&object_id=6635 for reason: "Locked for migration, pools will be shifted" I see the same when spot checking a few other devices. gkling is this migration completed? What process did you use to lock or unlock these devices?
,
Oct 8
,
Oct 8
That lock message is from when i shifted the previously deployed DUT - a Samus - to chromeos6, so perhaps it is locked on the samus shard. Whatever the case, the migration was completed a long time ago: b/110150820
,
Oct 8
The atest hosts should have been deleted before grunt was deployed in the same location, so you're correct there could have been an error made in this process. It is possible that grunt was deployed soon after the samus was deleted, or that it was deleted while it was locked. Are either of those scenarios likely to cause an issue like this?
,
Oct 8
Were the same host names reused for a grunt vs. samus device? This can potentially cause major confusion, because now the host history across two unrelated devices is now combined. In issue 893331 I managed to run sentinel once against this shard, spot checking again to see if lock/unlocked inconsistencies are resolved.
,
Oct 8
,
Oct 8
Devices are back to appearing unlocked on the shard. This *might* be resolved, lets see what happens with the next cq round...
,
Oct 8
> Were the same host names reused for a grunt vs. samus device? Yes. This type of scenario is very common for EVT/DVT devices. If it is a problem we'll have to come up with a way we can safely re-use hostnames or distinguish device histories
,
Oct 9
The locked/unlocked issue is fixed, but another inconsistency is still affecting things. Investigating...
,
Oct 9
The other label problem is being tracked here: Issue 893355 A fix is now running, but will take several hours to complete.
,
Oct 9
I'm still at a loss for why jobs aren't being scheduled on idle grunt devices on this shard. For instance this job http://cautotest-prod/afe/#tab_id=view_job&object_id=246821395 was eligible to run on this host http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_host&object_id=1212 Maybe it is related to confusion over platform label for the devices? I see only samus jobs in this host's history, even though it's current platform is grunt.
,
Oct 9
I've run a reverify against grunts, because I noticed that dut-status was not displaying a known status for them (which may be why they weren't being scheduled for jobs?)
,
Oct 9
Similar failure on peach_pit-paladin ( 20290 ) https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?builderName=peach_pit-paladin&buildNumber=20290 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914734 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914737 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914740 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914745 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914747 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914751 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914755 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914760 dummy_Pass: http://cautotest-prod/afe/#tab_id=view_job&object_id=245914764 The suite job has another 0:29:45.213771 till timeout. The suite job has another -1 day, 23:59:41.003572 till timeout. 10-06-2018 [06:15:10] Suite job is finished. Suite timed out. Started on 10-06-2018 [04:43:50], timed out on 10-06-2018 [06:15:10] 10-06-2018 [06:15:10] Start collecting test results and dump them to json. Suite job [ FAILED ] Suite job ABORT:
,
Oct 9
^ that is unrelated, please confine this bug to grunt (or other boards that are not running any jobs whatsoever)
,
Oct 9
I'm flailing here and confused, I'm going to remove grunt from this shard, and then re-add it.
,
Oct 9
I clicked through all the jobs listed in comment36 and nothing appeared to run nor was there any logs for the object_id. Seemed to match the grunt failure IMO.
,
Oct 9
^ peach_pit-paladin is running jobs within the last day, e.g. http://cautotest-prod/afe/#tab_id=view_job&object_id=246863206 grunt-paladin is not.
,
Oct 9
By cloning a job via the afe, I managed to create a paladin job that actually is running. http://cros-full-0039.mtv.corp.google.com/afe/#tab_id=view_job&object_id=246906558 This might mean that my action in #38 has fixed the issue? Waiting until current CQ run completes (or at least reaches hwtest) to find out.
,
Oct 9
grunt jobs are running. I don't understand what fixed this in particular, but I guess it was the shard wipe-and-readd in #38. May be related to the re-use of hosts of a given name between two different board labels.
,
Oct 9
Issue 893252 has been merged into this issue. |
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by briannorris@chromium.org
, Oct 3Status: Assigned (was: Untriaged)