New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 756187 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

monitor_db recovery: Delay aborting timed out jobs on monitor_db start

Project Member Reported by pprabhu@chromium.org, Aug 16 2017

Issue description

Consider the following (very common) scenario:
- monitor_db has some issues, crash-loops
- we fix it and things are back to normal in ~1.5 hour
- monitor_db starts.

Now, there were a bunch of jobs that were running when monitor_db started crashing. For the ones in Running state, the actual test was running and continued to run during monitor_db downtime, and finished successfully.

When monitor_db restarts, in the first tick:
- it recovers all the HQEs in Running (including any autoserv pidfiles associated with them)
- these jobs go from Running --> Parsing (for the ones that finished)
- it aborts all timed-out jobs
  - this aborts all the jobs that just transitioned to Parsing.

In the next tick:
- These jobs go from Parsing --> Aborted.

But nothing was wrong with these jobs!


Example: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15754

Three builders failed due to this exact scenario:
All there suite jobs actually succeeded (i.e., all their children had finished, and the autoserv for the suite job itself had also finished):

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=135522439
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=135522654
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=135522450

Looking at monitor_db's logs after the recover confirms the theory:
chromeos-test@chromeos-server2:/usr/local/autotest/logs$ grep 135522439 scheduler.log.2017-08-16-12.51.51
08/16 12:53:54.197 INFO |     drone_manager:0785| monitoring pidfile /usr/local/autotest/results/135522439-chromeos-test/hostless/.autoserv_execute
08/16 12:53:58.159 INFO |        agent_task:0503| Recovering process cros-autotest-shard5.hot.corp.google.com/2996 for HostlessQueueTask at 135522439-chromeos-test/hostless
08/16 12:54:25.128 INFO |  scheduler_models:0634| HQE: 135873302, for job: 135522439 and host: no host has status:Running [active] -> Parsing
08/16 12:55:37.514 INFO |     drone_manager:0785| monitoring pidfile /usr/local/autotest/results/135522439-chromeos-test/hostless/.parser_execute
08/16 12:55:37.515 INFO |     drone_manager:0759| command = ['nice', '-n', '10', '/usr/local/autotest/tko/parse', '--write-pidfile', '--record-duration', '--suite-report', '-l', '2', '-r', '-o', u'/usr/local/autotest/results/135522439-chromeos-test/hostless']
08/16 12:55:37.515 INFO |     drone_manager:0760| log file = cros-autotest-shard5.hot.corp.google.com:/usr/local/autotest/results/135522439-chromeos-test/hostless/.parse.log
08/16 12:58:43.775 WARNI|monitor_db_cleanup:0089| Aborting job 135522439 due to job timeout
08/16 12:59:22.034 INFO |        monitor_db:0941| Aborting HQE: 135873302, for job: 135522439 and host: no host has status:Parsing [aborted]
08/16 12:59:30.897 INFO |  scheduler_models:0634| HQE: 135873302, for job: 135522439 and host: no host has status:Parsing [aborted] -> Aborted
08/16 12:59:30.902 INFO |     drone_manager:0797| forgetting pidfile /usr/local/autotest/results/135522439-chromeos-test/hostless/.autoserv_execute
08/16 12:59:30.902 INFO |     drone_manager:0797| forgetting pidfile /usr/local/autotest/results/135522439-chromeos-test/hostless/.parser_execute

===================================


This could have been avoided if we refrain from aborting any jobs for the first ~5 minutes after starting monitor_db. This is pretty easy to implement.
 
Cc: nxia@chromium.org

Comment 2 by nxia@chromium.org, Aug 16 2017

Labels: Chase-Pending
Labels: -Chase-Pending
dshi@ thinks that we should just let them abort because they did time out. :)

Is this really a bug?
(we're definitely not convinced this is chase)
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Please justify this bug (yes, that's me).
Status: WontFix (was: Assigned)
Noone liked this idea.

Sign in to add a comment