scheduler ts_mon process died or was killed, and did not come back |
|||
Issue descriptionFollow-up to Issue 673639 . Prior to that fix, this would have caused the scheduler to hang. Instead, the scheduler kept looping (good thing) but metrics stopped working from it (not so good thing). Not clear whether From a cautotest/ scheduler log... 06/06 03:09:09.188 INFO | thread_lib:0071| (Worker.chromeos-server7.mtv.corp.google.com) finished. 06/06 03:09:16.992 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Gauge' 06/06 03:09:16.994 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Gauge' 06/06 03:09:16.995 DEBUG| monitor_db:1086| Starting _schedule_special_tasks 06/06 03:09:17.131 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'SecondsDistribution' 06/06 03:09:17.131 DEBUG| monitor_db:1086| Starting _schedule_new_jobs 06/06 03:09:17.379 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter' 06/06 03:09:17.380 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter' 06/06 03:09:17.381 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter' 06/06 03:09:17.382 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Counter' [snip] 06/06 10:19:00.724 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'PercentageDistribution' 06/06 10:19:00.724 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'Float' 06/06 10:19:00.725 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'PercentageDistribution' 06/06 10:19:00.725 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'CumulativeMetric' 06/06 10:19:00.726 WARNI| metrics:0090| Flushing process has been closed (exit code -9), skipped sending metric 'SecondsDistribution'
,
Jun 6 2017
Oh cool! So phobbs@'s recent CL about parent process not dying when the metrics pipe blocks up works. The unintended consequence is that we may not notice when the metrics process has died. But, I like it this way. Kudos.
,
Jun 6 2017
The logs don't indicate how the flushing process crashed, although the exit code of -9 indicates it was SIGKILLed (https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Process.exitcode).
,
Jun 7 2017
Doesn't happen often, possible fixit candidate. Phobbs@ has already done some mitigation work. Possible next steps: move metrics sending process to upstart and allow it to be independently restarted. Could also add more logging on why the process crashed.
,
Mar 31 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by akes...@chromium.org
, Jun 6 2017