LogDog propagation delay is high |
|||||
Issue descriptionTry job dashboard: https://findit-for-me.appspot.com/waterfall/try-job-dashboard?start_date=2017-10-05&end_date=2017-10-13&category=waterfall Several try jobs are classified as 'No result report was found' but the reports are there. Except https://ci.chromium.org/buildbot/tryserver.chromium.linux/linux_chromium_chromeos_asan_variable/29519. At the time of reporting this issue, the result report is 'Missing'. Either there's a bug at Findit side to read the report, or there is some sort of delay to produce the report.
,
Oct 12 2017
,
Oct 13 2017
We got several user reports of this since Wednesday. Using this as the tracking bug.
,
Oct 13 2017
FWIW there have been some latency issues due to Collector fleet scaling parameters changing: https://screenshot.googleplex.com/8pH6ud4T7C6.png These are believed to be solved by now by doubling the number of replicas: https://chrome-internal-review.googlesource.com/c/480272
,
Oct 14 2017
,
Oct 15 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/901c7585fea5249ee144468065d36954437ec99c commit 901c7585fea5249ee144468065d36954437ec99c Author: Dan Jacques <dnj@google.com> Date: Fri Oct 13 19:50:00 2017
,
Oct 16 2017
We just saw the delay again. And it has impacted Findit to get failure signals.
,
Oct 16 2017
The ingest graph shows at most a 3 minute delay early in the AM, which is unfortunately within Cloud Pub/Sub's permitted variance. Can you be specific about when you observed this?
,
Oct 16 2017
It's for https://findit-for-me.appspot.com/waterfall/failure?redirect=1&url=https://build.chromium.org/p/chromium/builders/Mac/builds/33761. Findit tries to retrieve json.output[ninja_info] for the compile step first and it'll fall back to stdout if it failed to do so. Based on the pipeline status, Findit failed to get both json.output[ninja_info] and stdout at first try.
,
Oct 16 2017
And a sheriff also mentioned having trouble opening the logs for the same build failure in irc.
,
Oct 16 2017
To make this easier to examine, please include the URL of the actual log that is failing (or failed). How sensitive is your code? LogDog has propagation delay; this is part of the system. Clients that need logs should build in a healthy window of time to fetch them and retry until that window expires or there is a hard success/failure. If you are trying once and immediately failing, and your attempt is within general publish time of the log, this could just be standard LogDog ingest delay and not indicative of a larger problem with the system.
,
Oct 16 2017
The logs are https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.linux%2FAndroid_Arm64_Builder__dbg_%2F46528%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Flogs%2Fjson.output_ninja_info_%2F0 and https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.linux%2FAndroid_Arm64_Builder__dbg_%2F46528%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout These logs are already available now. Findit will retry up to 5 times so it should be OK at our side. And how long should be a healthy window of waiting time for the log?
,
Oct 16 2017
> Findit will retry up to 5 times so it should be OK at our side. Depends how much you delay in between. > And how long should be a healthy window of waiting time for the log? Generally a log will be available within a few seconds (5), but "acceptable" can delay as much as 3 minutes or so. We are to a very real extent at the mercy of Cloud Pub/Sub latency, and that's been known to spike once in a while.
,
Oct 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/2cd8036762402fc5f40dcf22054da178caeb37a8 commit 2cd8036762402fc5f40dcf22054da178caeb37a8 Author: Daniel Jacques <dnj@google.com> Date: Fri Oct 20 02:39:01 2017
,
Oct 20 2017
To add some finality here, the added latency is suspected to have been caused by a change in RPC dispatch to batching. Originally, the change clustered RPCs into a single batch, but subsequent RPCs blocked pending each previous RPC's completion. The throughput wasn't phenomenally worse, and the CPU and memory consumption were accordingly better, but it resulted in a net negative throughput per individual node. The solution was to double the number of nodes to 16. I landed this CL to make batching concurrent, not serial. Each batch, once complete, is emitted as its own individual RPC, and future batches are independent from it: https://chromium.googlesource.com/infra/luci/luci-go/+/a90d36d07f65098da34b600f6cfb8a5f05199ac5 A deployment with 8 nodes has been observed to not exhibit the latency issues, so I scaled the nodes back to 8.
,
Nov 3 2017
We can still see cases where Findit failed to get log when running analyses, but later I can see the logs on the page and also get them locally through remote api. Please see https://findit-for-me.appspot.com/waterfall/try-job-dashboard?start_date=2017-10-25&end_date=2017-11-04&category=both. All the try jobs with error 'No result report was found.' are like what I described above.
,
Nov 3 2017
The consumption pipeline still seems to be OK. The highest latency spike in the last 7 days was ~6 minutes, which isn't great but wasn't sustained. It's possible some other component is the cause (e.g., the log serving endpoint is laggy or not working), but the propagation is not obviously problematic from the throughput metric. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by chanli@chromium.org
, Oct 12 2017