push-to-prod failing consistently (6/12 version) |
||||
Issue descriptionExample job http://chromeos-autotest.hot.corp.google.com/afe/#tab_id=view_job&object_id=3244 was supposed to run on shard. Poking around in shard logs, I see that 3244 is logged in shard_client and scheduler logs, but not in any host-scheduler logs. I suspect there's a bug in ToT that is preventing host-scheduler from matching metahosts to hosts. https://chromium-review.googlesource.com/c/521687/ is my closest suspect CL that is in ToT but not yet passed the pre-prod test. I can't explain why this would break things, but it does touch the right kind of code to do so.
,
Jun 12 2017
Thanks. One last musing before I drop this -- test_push failures started happening consistently around midday 06/08 which is before that landed, so maybe something else responsible http://shortn/_1YQHi3OJjS
,
Jun 12 2017
I cloned the failing job on the staging master and it was correctly matched to a host and executed on the shard on ToT autotest code. This belies the claim in the OP.
,
Jun 12 2017
The unmatched job was a red-herring. In that suite, earlier jobs consumed all available DUTs. That's why it was never matched. The earlier jobs were aborted because provision failed. e.g.: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row1-rack2-host17/1909-provision The real problem lies there.
,
Jun 12 2017
Turns out primary deputy needs a relatively urgent prod-push. That means he needs to get involved in fixing the push-to-prod test. My current understanding is that these failures ought to be devserver related. I'm not sure (yet).
,
Jun 12 2017
Focussing on job 3240. host: chromeos2-row1-rack2-host17 provision for 3240 failed, failing the job. But, note that a following provision for job 3253 created by me this morning, did pass. The logs from provision for 3240 seem truncated though :( https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row1-rack2-host17/1909-provision/20171206051952/ There's no client logs, and autoserv logs just die at a point. The same devserver (100.115.245.253) was used in the good/bad provisions. So... not devserver either then?
,
Jun 12 2017
That points to the autoserv process for the provision job dying mid-flight. Sure enough, in drone_utility logs on the shard, I see: 06/12 05:20:54.972 INFO | drone_utility:0177| List of process to be killed: [<autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1d850><localhost/22812>, <autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1ddd0><localhost/22813>] 06/12 05:20:54.972 INFO | drone_utility:0177| List of process to be killed: [<autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1d850><localhost/22812>, <autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1ddd0><localhost/22813>] 06/12 05:20:55.093 DEBUG| utils:2104| Sending signal 9 to the following pids: 06/12 05:20:55.094 DEBUG| utils:2107| Pid -22812 06/12 05:20:55.094 DEBUG| utils:2107| Pid -22813
,
Jun 12 2017
And that's because the scheduler decided to abort the provision job:
06/12 05:19:54.608 INFO | scheduler_models:0656| HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Queued [active] -> Provisioning
06/12 05:19:54.644 INFO | rdb_hosts:0219| Host chromeos2-row1-rack2-host17 in Ready updating {'status': 'Provisioning'} through rdb on behalf of: Task: Special Task 1909 (host chromeos2-row1-rack2-host17, task Provision, time 2017-06-12 05:19:52)HQE: 3271, for job: 3240
06/12 05:20:53.344 INFO | monitor_db:0936| Aborting HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Provisioning [active,aborted]
06/12 05:20:53.416 INFO | scheduler_models:0656| HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Provisioning [active,aborted] -> Aborted
within a minute of being scheduled...? what gives?
,
Jun 12 2017
No magic here, the job was aborted from the master. Master monitor_db logs show it waiting for the HQE to be aborted on the shard:
chromeos-test@chromeos-autotest:/usr/local/autotest/logs$ grep 3240 scheduler.log.2017-06-12-05.15.35 | head
06/12 05:20:49.336 INFO | monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]
06/12 05:20:50.956 INFO | monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]
06/12 05:20:52.426 INFO | monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]
... and that's because the parent job was aborted on the master around that time (soon after it created its child jobs):
chromeos-test@chromeos-autotest:/usr/local/autotest/logs$ grep ' 3239\| 3270' scheduler.log.2017-06-12-05.15.35
06/12 05:18:46.895 INFO | scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Queued -> Starting
06/12 05:18:47.541 INFO | scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Starting [active] -> Running
06/12 05:20:48.360 WARNI| scheduler_models:0216| initialized <class 'autotest_lib.scheduler.scheduler_models.HostQueueEntry'> 3270 instance requery is updating: {'aborted': (0, 1)}
06/12 05:20:49.129 INFO | monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Running [active,aborted]
06/12 05:20:49.131 INFO | scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Running [active,aborted] -> Parsing
06/12 05:20:50.955 INFO | monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:52.425 INFO | monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:53.460 INFO | monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:53.469 INFO | scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted] -> Aborted
Now, why was the suite job aborted? My guess would be that test_push had already failed for some other reason, so it decide to bail on this suite.
Takeaway from this bug: Figuring out why a certain job was aborted is extremely painful.
We don't leave enough bread-crumbs to easily figure this out.
Real reason for push-to-prod failure being tracked elsewhere.
|
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jun 12 2017Status: Started (was: Untriaged)