New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 774172 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

LogDog propagation delay is high

Project Member Reported by chanli@chromium.org, Oct 12 2017

Issue description

Try job dashboard: https://findit-for-me.appspot.com/waterfall/try-job-dashboard?start_date=2017-10-05&end_date=2017-10-13&category=waterfall

Several try jobs are classified as 'No result report was found' but the reports are there.

Except  https://ci.chromium.org/buildbot/tryserver.chromium.linux/linux_chromium_chromeos_asan_variable/29519. At the time of reporting this issue, the result report is 'Missing'.

Either there's a bug at Findit side to read the report, or there is some sort of delay to produce the report. 

 
report_missing.png
22.5 KB View Download

Comment 1 by chanli@chromium.org, Oct 12 2017

The report for https://ci.chromium.org/buildbot/tryserver.chromium.linux/linux_chromium_chromeos_asan_variable/29519 shows up. It seems there is a delay of producing this report.

Comment 2 by chanli@chromium.org, Oct 12 2017

Owner: hinoka@chromium.org

Comment 3 by hinoka@chromium.org, Oct 13 2017

Components: -Tools>Test>FindIt Infra>Platform>LogDog
Labels: -Pri-3 Pri-2
Owner: ----
Status: Available (was: Assigned)
Summary: LogDog propagation delay is high (was: [Findit] Several try jobs completed but at Findit side it shows as 'No result report was found.')
We got several user reports of this since Wednesday.  Using this as the tracking bug.

Comment 4 by d...@chromium.org, Oct 13 2017

FWIW there have been some latency issues due to Collector fleet scaling parameters changing: https://screenshot.googleplex.com/8pH6ud4T7C6.png

These are believed to be solved by now by doubling the number of replicas: https://chrome-internal-review.googlesource.com/c/480272

Comment 5 by chanli@chromium.org, Oct 14 2017

Cc: st...@chromium.org lijeffrey@chromium.org robert...@chromium.org wylieb@chromium.org
Project Member

Comment 6 by bugdroid1@chromium.org, Oct 15 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/901c7585fea5249ee144468065d36954437ec99c

commit 901c7585fea5249ee144468065d36954437ec99c
Author: Dan Jacques <dnj@google.com>
Date: Fri Oct 13 19:50:00 2017

Comment 7 by chanli@chromium.org, Oct 16 2017

We just saw the delay again. And it has impacted Findit to get failure signals.

Comment 8 by d...@chromium.org, Oct 16 2017

The ingest graph shows at most a 3 minute delay early in the AM, which is unfortunately within Cloud Pub/Sub's permitted variance. Can you be specific about when you observed this?

Comment 9 by chanli@chromium.org, Oct 16 2017

It's for https://findit-for-me.appspot.com/waterfall/failure?redirect=1&url=https://build.chromium.org/p/chromium/builders/Mac/builds/33761.

Findit tries to retrieve json.output[ninja_info] for the compile step first and it'll fall back to stdout if it failed to do so.

Based on the pipeline status, Findit failed to get both json.output[ninja_info] and stdout at first try.



And a sheriff also mentioned having trouble opening the logs for the same build failure in irc.

Comment 11 by d...@chromium.org, Oct 16 2017

To make this easier to examine, please include the URL of the actual log that is failing (or failed).

How sensitive is your code? LogDog has propagation delay; this is part of the system. Clients that need logs should build in a healthy window of time to fetch them and retry until that window expires or there is a hard success/failure. If you are trying once and immediately failing, and your attempt is within general publish time of the log, this could just be standard LogDog ingest delay and not indicative of a larger problem with the system.

Comment 13 by d...@chromium.org, Oct 16 2017

> Findit will retry up to 5 times so it should be OK at our side. 

Depends how much you delay in between.

> And how long should be a healthy window of waiting time for the log?

Generally a log will be available within a few seconds (5), but "acceptable" can delay as much as 3 minutes or so. We are to a very real extent at the mercy of Cloud Pub/Sub latency, and that's been known to spike once in a while.
Project Member

Comment 14 by bugdroid1@chromium.org, Oct 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/2cd8036762402fc5f40dcf22054da178caeb37a8

commit 2cd8036762402fc5f40dcf22054da178caeb37a8
Author: Daniel Jacques <dnj@google.com>
Date: Fri Oct 20 02:39:01 2017

Comment 15 by d...@chromium.org, Oct 20 2017

Owner: d...@chromium.org
Status: Fixed (was: Available)
To add some finality here, the added latency is suspected to have been caused by a change in RPC dispatch to batching. Originally, the change clustered RPCs into a single batch, but subsequent RPCs blocked pending each previous RPC's completion. The throughput wasn't phenomenally worse, and the CPU and memory consumption were accordingly better, but it resulted in a net negative throughput per individual node. The solution was to double the number of nodes to 16.

I landed this CL to make batching concurrent, not serial. Each batch, once complete, is emitted as its own individual RPC, and future batches are independent from it:
https://chromium.googlesource.com/infra/luci/luci-go/+/a90d36d07f65098da34b600f6cfb8a5f05199ac5

A deployment with 8 nodes has been observed to not exhibit the latency issues, so I scaled the nodes back to 8.
We can still see cases where Findit failed to get log when running analyses, but later I can see the logs on the page and also get them locally through remote api.

Please see https://findit-for-me.appspot.com/waterfall/try-job-dashboard?start_date=2017-10-25&end_date=2017-11-04&category=both. All the try jobs with error 'No result report was found.' are like what I described above.


Comment 17 by d...@chromium.org, Nov 3 2017

The consumption pipeline still seems to be OK. The highest latency spike in the last 7 days was ~6 minutes, which isn't great but wasn't sustained. It's possible some other component is the cause (e.g., the log serving endpoint is laggy or not working), but the propagation is not obviously problematic from the throughput metric.

Sign in to add a comment