New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 732441 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

push-to-prod failing consistently (6/12 version)

Project Member Reported by akes...@chromium.org, Jun 12 2017

Issue description

Example job http://chromeos-autotest.hot.corp.google.com/afe/#tab_id=view_job&object_id=3244 was supposed to run on shard.

Poking around in shard logs, I see that 3244 is logged in shard_client and scheduler logs, but not in any host-scheduler logs. I suspect there's a bug in ToT that is preventing host-scheduler from matching metahosts to hosts.

https://chromium-review.googlesource.com/c/521687/ is my closest suspect CL that is in ToT but not yet passed the pre-prod test. I can't explain why this would break things, but it does touch the right kind of code to do so.
 
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
Taking this up in my secondary deputy sweep. It's the secondary's duty to keep push-to-prod test green (even when not pushing) so that we can push when needed.
Thanks. One last musing before I drop this -- test_push failures started happening consistently around midday 06/08 which is before that landed, so maybe something else responsible http://shortn/_1YQHi3OJjS
I cloned the failing job on the staging master and it was correctly matched to a host and executed on the shard on ToT autotest code.
This belies the claim in the OP.

The unmatched job was a red-herring. In that suite, earlier jobs consumed all available DUTs. That's why it was never matched.

The earlier jobs were aborted because provision failed.
e.g.: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row1-rack2-host17/1909-provision

The real problem lies there.
Owner: jrbarnette@chromium.org
Summary: push-to-prod failing consistently (6/12 version) (was: push-to-prod failing consistently due to timeout of jobs on its shard | suspect failure in host-scheduler)
Turns out primary deputy needs a relatively urgent prod-push.
That means he needs to get involved in fixing the push-to-prod test.



My current understanding is that these failures ought to be devserver related. I'm not sure (yet).
Focussing on job 3240.
host: chromeos2-row1-rack2-host17

provision for 3240 failed, failing the job.
But, note that a following provision for job 3253 created by me this morning, did pass.


The logs from provision for 3240 seem truncated though :(
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row1-rack2-host17/1909-provision/20171206051952/

There's no client logs, and autoserv logs just die at a point.


The same devserver (100.115.245.253) was used in the good/bad provisions.
So... not devserver either then?
That points to the autoserv process for the provision job dying mid-flight.

Sure enough, in drone_utility logs on the shard, I see:

06/12 05:20:54.972 INFO |     drone_utility:0177| List of process to be killed: [<autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1d850><localhost/22812>, <autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1ddd0><localhost/22813>]
06/12 05:20:54.972 INFO |     drone_utility:0177| List of process to be killed: [<autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1d850><localhost/22812>, <autotest_lib.scheduler.drone_manager.Process object at 0x7f2f55b1ddd0><localhost/22813>]
06/12 05:20:55.093 DEBUG|             utils:2104| Sending signal 9 to the following pids:
06/12 05:20:55.094 DEBUG|             utils:2107| Pid -22812
06/12 05:20:55.094 DEBUG|             utils:2107| Pid -22813
Owner: pprabhu@chromium.org
And that's because the scheduler decided to abort the provision job:

06/12 05:19:54.608 INFO |  scheduler_models:0656| HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Queued [active] -> Provisioning
06/12 05:19:54.644 INFO |         rdb_hosts:0219| Host chromeos2-row1-rack2-host17 in Ready updating {'status': 'Provisioning'} through rdb on behalf of: Task: Special Task 1909 (host chromeos2-row1-rack2-host17, task Provision, time 2017-06-12 05:19:52)HQE: 3271, for job: 3240 
06/12 05:20:53.344 INFO |        monitor_db:0936| Aborting HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Provisioning [active,aborted]
06/12 05:20:53.416 INFO |  scheduler_models:0656| HQE: 3271, for job: 3240 and host: chromeos2-row1-rack2-host17 has status:Provisioning [active,aborted] -> Aborted

within a minute of being scheduled...? what gives?
Status: WontFix (was: Started)
No magic here, the job was aborted from the master. Master monitor_db logs show it waiting for the HQE to be aborted on the shard:

chromeos-test@chromeos-autotest:/usr/local/autotest/logs$ grep 3240 scheduler.log.2017-06-12-05.15.35 | head
06/12 05:20:49.336 INFO |        monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]
06/12 05:20:50.956 INFO |        monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]
06/12 05:20:52.426 INFO |        monitor_db:0933| Waiting for shard 2 to abort hqe HQE: 3271, for job: 3240 and host: no host has status:Queued [aborted]

... and that's because the parent job was aborted on the master around that time (soon after it created its child jobs):

chromeos-test@chromeos-autotest:/usr/local/autotest/logs$ grep ' 3239\| 3270' scheduler.log.2017-06-12-05.15.35
06/12 05:18:46.895 INFO |  scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Queued -> Starting
06/12 05:18:47.541 INFO |  scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Starting [active] -> Running
06/12 05:20:48.360 WARNI|  scheduler_models:0216| initialized <class 'autotest_lib.scheduler.scheduler_models.HostQueueEntry'> 3270 instance requery is updating: {'aborted': (0, 1)}
06/12 05:20:49.129 INFO |        monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Running [active,aborted]
06/12 05:20:49.131 INFO |  scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Running [active,aborted] -> Parsing
06/12 05:20:50.955 INFO |        monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:52.425 INFO |        monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:53.460 INFO |        monitor_db:0936| Aborting HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted]
06/12 05:20:53.469 INFO |  scheduler_models:0656| HQE: 3270, for job: 3239 and host: no host has status:Parsing [aborted] -> Aborted



Now, why was the suite job aborted? My guess would be that test_push had already failed for some other reason, so it decide to bail on this suite.

Takeaway from this bug: Figuring out why a certain job was aborted is extremely painful.
We don't leave enough bread-crumbs to easily figure this out.

Real reason for push-to-prod failure being tracked elsewhere.

Sign in to add a comment