New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 670548 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

ts_mon_config's 60 second wait to flush is too long for autoserv

Project Member Reported by pprabhu@chromium.org, Dec 2 2016

Issue description

My large batch of metrics CLs has been timing out in VMTest stage.

I think the CLs that are actually stuck are:
https://chromium-review.googlesource.com/#/c/414604/
https://chromium-review.googlesource.com/#/c/414446/

The problem is that they report a metric close to the finish of an autoserv process.
When we request the metrics flushing process to quit, chromite's ts_mon_config._ConsumeMessages calls _WaitToFlush before exiting. This introduces anywhere between 0 and 60 seconds delay before the process really exits.

I compared a successful run of VMTest from a couple days ago with my failing runs, and the only difference is the 1 minute added for every test run (that launches a new autoserv process). This adds up to timeout the VMTest:

  3 11/30  0: 0:20 INFO | test_runner_utils:0198| autoserv| Running (ssh) 'rm -rf "/tmp/autoserv-6DjPwH"'
  2 11/30  0: 0:20 INFO | test_runner_utils:0198| autoserv| Nuking master_ssh_job.
  1 11/30  0: 0:21 INFO | test_runner_utils:0198| autoserv| Cleaning master_ssh_tempdir.
  0 11/30  0: 0:21 INFO | test_runner_utils:0198| autoserv| Waiting for ts_mon flushing process to finish...
  1 11/30  0: 1:14 INFO | test_runner_utils:0198| autoserv| record_state_duration failed: job_or_task_id=None, hostname=127.0.0.1:9228, status=Running

Notice the 53 second jump waiting for ts_mon to finish?

This delay is not being seen already in the lab because we're not sending any metrics late enough in autoserv's lifecycle to still be pending when the process quits.

At least some of my metrics CLs are blocked on this bug.
 
Owner: pprabhu@chromium.org
Status: Started (was: Available)
Unblocking myself.


A simple fix is to not wait before the last flush when the flushing process exits.

This will bring our flushing process' behavior in line with ts_mon's own flushing thread implementation: https://cs.corp.google.com/chromeos_public/chromite/third_party/infra_libs/ts_mon/common/interface.py?l=174

A quick search of the history of chromite/lib/ts_mon_config.py doesn't tell me that we added the final delay to the flush in response to a bug.

I can see how it keeps things more "correct":
- We don't run the risk of sending a metric too soon and getting rejected somewhere in the pipeline
- We ensure that the last flush does not screw up rate calculations.

BUT,
- This is causing delays that are significant enough to cause failures
- This behaviour is different from using ts_mon's flushing (so rate calculations will differ slightly based on what flags mode we use ts_mon in).

The CL to remove this is trivial. Figuring out the impact is a bit harder.
Go for it (if it's easy to do), I suspect the impacts of what you propose are benign.
Labels: -current-issue
Project Member

Comment 4 by bugdroid1@chromium.org, Dec 7 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/04e59d08c79c73ca15997383552314c32f95a86b

commit 04e59d08c79c73ca15997383552314c32f95a86b
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Dec 02 19:42:39 2016

metrics: Don't wait before flushing last metrics.

This CL removes a wait before the metrics handling process flushes it's
last metrics. This means that the last metrics may be flushed at an
interval smaller than usual. This can impact the rate computation of
metrics on the backend.

But, this wait needs to go because it was causing delays in process
exits leading to test failures. (The extra sleep of ~ 60 seconds is too
much at the end of a task).

BUG= chromium:670548 
TEST=unittests. No live testing.

Change-Id: I9272b6574080c3013c46c5ac53b57fe950ecced0
Reviewed-on: https://chromium-review.googlesource.com/416308
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/04e59d08c79c73ca15997383552314c32f95a86b/lib/ts_mon_config.py
[modify] https://crrev.com/04e59d08c79c73ca15997383552314c32f95a86b/lib/ts_mon_config_unittest.py

Status: Fixed (was: Started)
Will be verified if my blocked CLs pass ;)

Comment 6 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 7 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 8 by dchan@google.com, May 30 2017

Labels: VerifyIn-60

Comment 9 by dchan@chromium.org, Aug 1 2017

Labels: VerifyIn-61

Comment 10 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment