New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 639499 link

Starred by 3 users

Issue metadata

Status: Duplicate
Owner: ----
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug



Sign in to add a comment

Monitoring pushes have no timeout, take minutes, stall CQ

Project Member Reported by martiniss@chromium.org, Aug 19 2016

Issue description

The monitoring in the CQ has no timeouts when it pushes to pubsub. This causes events like this:

[pid:21454 tid:139782236428032 infra_libs.ts_mon.common.interface:189] Last monitoring flush took 833.833315 seconds (longer than --ts-mon-flush-interval-secs = 60.000000 seconds)


This hangs the CQ for 13 minutes, which is pretty bad. Not P0 bad, but still pretty bad.
 
It's not supposed to hang the CQ - ts_mon flushes in a separate thread. It's indeed pretty bad that it takes so long (no metrics, and alerts will probably go nuts), but it shouldn't impact the actual CQ.

PubSub connections do have a timeout of 10 sec: https://cs.chromium.org/chromium/infra/packages/infra_libs/infra_libs/ts_mon/common/monitors.py?sq=package:chromium&dr&l=173

Such a long flush usually means CQ is sending a gazillion metrics - which it shouldn't.
Components: Infra>Monitoring
From a more detailed log in http://b/30969702 it seems CQ is not sending too many metrics - it's odd that it took so long:

08:35:30.416
[pid:21454 tid:139782236428032 infra_libs.ts_mon.common.monitors:188] ts_mon: sending 54 metrics to PubSub
08:49:24.125
[pid:21454 tid:139782236428032 googleapiclient.discovery:817] URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json
08:49:24.207
[pid:21454 tid:139782316652352 utils.rietveld:449] POSTing to /search?format=json&closed=1&commit=2&keys_only=True&limit=1000&order=__key__&project=bling&cursor=CjYSMGodc35nb29nbGUuY29tOmNocm9tZXJldmlld3MtaHJyDwsSBUlzc3VlGLa84OkBDBgAIAA=, args {'payload': None, 'timeout': 15}.
08:49:24.240
[pid:21454 tid:139782236428032 infra_libs.ts_mon.common.interface:189] Last monitoring flush took 833.833315 seconds (longer than --ts-mon-flush-interval-secs = 60.000000 seconds)
08:49:24.251
[pid:21454 tid:139782236428032 infra_libs.ts_mon.common.monitors:188] ts_mon: sending 55 metrics to PubSub


Note: from "ts_mon: sending 54 metrics" to actually posting a request took 14 minutes (!). So it's not hanging in PubSub. It's hanging in the flushing thread for some reason. This *might* be a ts_mon issue; I'm adding Infra>Monitoring to help investigate further.
Yep, I saw these long flushes for quite some time actually, but indeed it doesn't (shouldn't?) hung the CQ except on restarts, and CQ processes would be killed after ~30 seconds anyway. 

Stephen, where does your conclusion come from?
Realted: issue 639890
I just looked at the log, and saw that the time the monitoring flush took was basically how long the CQ was stalled for. Didn't dig into the system much, didn't know much about threads.
I'm looking a bit more into it. Searching through the logs for long ts_mon flushes (Big Query):

SELECT
  FORMAT_UTC_USEC(UTC_USEC_TO_DAY(metadata.timestamp)) as day, COUNT(*) as count
FROM (TABLE_DATE_RANGE(build_logs.commit_queue_log_, DATE_ADD(CURRENT_TIMESTAMP(), -200, 'DAY'), CURRENT_TIMESTAMP()))
WHERE
  textPayload CONTAINS "Last monitoring flush"
GROUP BY day
ORDER BY day DESC
LIMIT
  1000

1	2016-08-22 00:00:00.000000	80	 
2	2016-08-19 00:00:00.000000	7	 
3	2016-08-18 00:00:00.000000	2	 
4	2016-08-17 00:00:00.000000	1	 
5	2016-08-03 00:00:00.000000	29	 
6	2016-08-02 00:00:00.000000	20	 
7	2016-07-19 00:00:00.000000	4	 
8	2016-07-08 00:00:00.000000	1	 
9	2016-06-10 00:00:00.000000	1	 
10	2016-06-07 00:00:00.000000	67	 
11	2016-05-10 00:00:00.000000	1	 
12	2016-05-05 00:00:00.000000	1	 
13	2016-03-07 00:00:00.000000	1

This seems to indicate some external factor delaying ts_mon. I'm not yet sure what it is...
Cc: vadimsh@chromium.org dsansome@chromium.org
Could this be related to yesterday's outage with cloud logging, since ts_mon spits out line "flushing" before actually flushing and hence could be blocked?
If so, then I guess it has been resolved, hasn't it?

+vadimsh@ +dsansome@
My armchair theory would be *yes* :-)
Cloudtail blocks only if it accumulates large enough buffer and can't push it to Cloud Logging. It happens only if Cloud Logging consistently rejects pushes (like when we are over quota or when network is really bad). I believe yesterday was the first time we were over quota. So I'm not sure what was happening on Aug 19.
Mergedinto: 639890
Status: Duplicate (was: Available)

Sign in to add a comment