Monitoring pushes have no timeout, take minutes, stall CQ |
||||
Issue descriptionThe monitoring in the CQ has no timeouts when it pushes to pubsub. This causes events like this: [pid:21454 tid:139782236428032 infra_libs.ts_mon.common.interface:189] Last monitoring flush took 833.833315 seconds (longer than --ts-mon-flush-interval-secs = 60.000000 seconds) This hangs the CQ for 13 minutes, which is pretty bad. Not P0 bad, but still pretty bad.
,
Aug 20 2016
From a more detailed log in http://b/30969702 it seems CQ is not sending too many metrics - it's odd that it took so long: 08:35:30.416 [pid:21454 tid:139782236428032 infra_libs.ts_mon.common.monitors:188] ts_mon: sending 54 metrics to PubSub 08:49:24.125 [pid:21454 tid:139782236428032 googleapiclient.discovery:817] URL being requested: POST https://pubsub.googleapis.com/v1/projects/chrome-infra-mon-pubsub/topics/monacq:publish?alt=json 08:49:24.207 [pid:21454 tid:139782316652352 utils.rietveld:449] POSTing to /search?format=json&closed=1&commit=2&keys_only=True&limit=1000&order=__key__&project=bling&cursor=CjYSMGodc35nb29nbGUuY29tOmNocm9tZXJldmlld3MtaHJyDwsSBUlzc3VlGLa84OkBDBgAIAA=, args {'payload': None, 'timeout': 15}. 08:49:24.240 [pid:21454 tid:139782236428032 infra_libs.ts_mon.common.interface:189] Last monitoring flush took 833.833315 seconds (longer than --ts-mon-flush-interval-secs = 60.000000 seconds) 08:49:24.251 [pid:21454 tid:139782236428032 infra_libs.ts_mon.common.monitors:188] ts_mon: sending 55 metrics to PubSub Note: from "ts_mon: sending 54 metrics" to actually posting a request took 14 minutes (!). So it's not hanging in PubSub. It's hanging in the flushing thread for some reason. This *might* be a ts_mon issue; I'm adding Infra>Monitoring to help investigate further.
,
Aug 22 2016
Yep, I saw these long flushes for quite some time actually, but indeed it doesn't (shouldn't?) hung the CQ except on restarts, and CQ processes would be killed after ~30 seconds anyway. Stephen, where does your conclusion come from?
,
Aug 22 2016
Realted: issue 639890
,
Aug 22 2016
I just looked at the log, and saw that the time the monitoring flush took was basically how long the CQ was stalled for. Didn't dig into the system much, didn't know much about threads.
,
Aug 26 2016
I'm looking a bit more into it. Searching through the logs for long ts_mon flushes (Big Query): SELECT FORMAT_UTC_USEC(UTC_USEC_TO_DAY(metadata.timestamp)) as day, COUNT(*) as count FROM (TABLE_DATE_RANGE(build_logs.commit_queue_log_, DATE_ADD(CURRENT_TIMESTAMP(), -200, 'DAY'), CURRENT_TIMESTAMP())) WHERE textPayload CONTAINS "Last monitoring flush" GROUP BY day ORDER BY day DESC LIMIT 1000 1 2016-08-22 00:00:00.000000 80 2 2016-08-19 00:00:00.000000 7 3 2016-08-18 00:00:00.000000 2 4 2016-08-17 00:00:00.000000 1 5 2016-08-03 00:00:00.000000 29 6 2016-08-02 00:00:00.000000 20 7 2016-07-19 00:00:00.000000 4 8 2016-07-08 00:00:00.000000 1 9 2016-06-10 00:00:00.000000 1 10 2016-06-07 00:00:00.000000 67 11 2016-05-10 00:00:00.000000 1 12 2016-05-05 00:00:00.000000 1 13 2016-03-07 00:00:00.000000 1 This seems to indicate some external factor delaying ts_mon. I'm not yet sure what it is...
,
Aug 30 2016
Could this be related to yesterday's outage with cloud logging, since ts_mon spits out line "flushing" before actually flushing and hence could be blocked? If so, then I guess it has been resolved, hasn't it? +vadimsh@ +dsansome@
,
Aug 30 2016
My armchair theory would be *yes* :-)
,
Aug 30 2016
Cloudtail blocks only if it accumulates large enough buffer and can't push it to Cloud Logging. It happens only if Cloud Logging consistently rejects pushes (like when we are over quota or when network is really bad). I believe yesterday was the first time we were over quota. So I'm not sure what was happening on Aug 19.
,
Sep 1 2016
|
||||
►
Sign in to add a comment |
||||
Comment 1 by sergeybe...@chromium.org
, Aug 20 2016