ts_mon_config's 60 second wait to flush is too long for autoserv |
||||||||
Issue descriptionMy large batch of metrics CLs has been timing out in VMTest stage. I think the CLs that are actually stuck are: https://chromium-review.googlesource.com/#/c/414604/ https://chromium-review.googlesource.com/#/c/414446/ The problem is that they report a metric close to the finish of an autoserv process. When we request the metrics flushing process to quit, chromite's ts_mon_config._ConsumeMessages calls _WaitToFlush before exiting. This introduces anywhere between 0 and 60 seconds delay before the process really exits. I compared a successful run of VMTest from a couple days ago with my failing runs, and the only difference is the 1 minute added for every test run (that launches a new autoserv process). This adds up to timeout the VMTest: 3 11/30 0: 0:20 INFO | test_runner_utils:0198| autoserv| Running (ssh) 'rm -rf "/tmp/autoserv-6DjPwH"' 2 11/30 0: 0:20 INFO | test_runner_utils:0198| autoserv| Nuking master_ssh_job. 1 11/30 0: 0:21 INFO | test_runner_utils:0198| autoserv| Cleaning master_ssh_tempdir. 0 11/30 0: 0:21 INFO | test_runner_utils:0198| autoserv| Waiting for ts_mon flushing process to finish... 1 11/30 0: 1:14 INFO | test_runner_utils:0198| autoserv| record_state_duration failed: job_or_task_id=None, hostname=127.0.0.1:9228, status=Running Notice the 53 second jump waiting for ts_mon to finish? This delay is not being seen already in the lab because we're not sending any metrics late enough in autoserv's lifecycle to still be pending when the process quits. At least some of my metrics CLs are blocked on this bug.
,
Dec 2 2016
Go for it (if it's easy to do), I suspect the impacts of what you propose are benign.
,
Dec 6 2016
,
Dec 7 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/04e59d08c79c73ca15997383552314c32f95a86b commit 04e59d08c79c73ca15997383552314c32f95a86b Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Dec 02 19:42:39 2016 metrics: Don't wait before flushing last metrics. This CL removes a wait before the metrics handling process flushes it's last metrics. This means that the last metrics may be flushed at an interval smaller than usual. This can impact the rate computation of metrics on the backend. But, this wait needs to go because it was causing delays in process exits leading to test failures. (The extra sleep of ~ 60 seconds is too much at the end of a task). BUG= chromium:670548 TEST=unittests. No live testing. Change-Id: I9272b6574080c3013c46c5ac53b57fe950ecced0 Reviewed-on: https://chromium-review.googlesource.com/416308 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/04e59d08c79c73ca15997383552314c32f95a86b/lib/ts_mon_config.py [modify] https://crrev.com/04e59d08c79c73ca15997383552314c32f95a86b/lib/ts_mon_config_unittest.py
,
Dec 7 2016
Will be verified if my blocked CLs pass ;)
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by pprabhu@chromium.org
, Dec 2 2016Status: Started (was: Available)