New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 730070 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

scheduler ts_mon process died or was killed, and did not come back

Project Member Reported by akes...@chromium.org, Jun 6 2017

Issue description

Follow-up to  Issue 673639 . Prior to that fix, this would have caused the scheduler to hang. Instead, the scheduler kept looping (good thing) but metrics stopped working from it (not so good thing).

Not clear whether 

From a cautotest/ scheduler log...

06/06 03:09:09.188 INFO |        thread_lib:0071| (Worker.chromeos-server7.mtv.corp.google.com) finished.
06/06 03:09:16.992 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Gauge'
06/06 03:09:16.994 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Gauge'
06/06 03:09:16.995 DEBUG|        monitor_db:1086| Starting _schedule_special_tasks
06/06 03:09:17.131 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'SecondsDistribution'
06/06 03:09:17.131 DEBUG|        monitor_db:1086| Starting _schedule_new_jobs
06/06 03:09:17.379 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter'
06/06 03:09:17.380 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter'
06/06 03:09:17.381 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter'
06/06 03:09:17.382 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter'
[snip]
06/06 10:19:00.724 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'PercentageDistribution'
06/06 10:19:00.724 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Float'
06/06 10:19:00.725 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'PercentageDistribution'
06/06 10:19:00.725 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'CumulativeMetric'
06/06 10:19:00.726 WARNI|           metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'SecondsDistribution'
 
Sorry, lost my train of though. Not clear whether flushing process crashed of its own accord, or was killed by something. This incident was at the same time as one of the drones filling up on disk space, but I think that was a coincidence.
Oh cool! So phobbs@'s recent CL about parent process not dying when the metrics pipe blocks up works. The unintended consequence is that we may not notice when the metrics process has died.

But, I like it this way. Kudos.
The logs don't indicate how the flushing process crashed, although the exit code of -9 indicates it was SIGKILLed (https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Process.exitcode). 

Comment 4 by aut...@google.com, Jun 7 2017

Labels: Hotlist-Fixit
Doesn't happen often, possible fixit candidate. Phobbs@ has already done some mitigation work. 

Possible next steps: move metrics sending process to upstart and allow it to be independently restarted. Could also add more logging on why the process crashed. 
Status: Archived (was: Available)

Sign in to add a comment