LUCI: M71 Chrome PFQ: Misleading Completion Status |
|||||||
Issue description
The M71 Chrome PFQ is showing odd status in Luci.
The completed PFQ run is showing a purple 'Internal Error' ('Infra Failure' on the tab):
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932403704197749520
While the top level is showing an outright failure:
https://luci-scheduler.appspot.com/jobs/chromeos/release-R71-11151.B-samus-chrome-pre-flight-branch
These are hiding the fact that the PFQ completed successfully with a good uprev.
,
Oct 18
Hi all -- This looks like perhaps another edge case regarding the completion status. We've seen similar behavior in http://crbug/860508 (I believe this is the right bug) but has been clean for most cases since. Tasks show success: https://chrome-swarming.appspot.com/task?id=409fa6a06bfec410&refresh=10&show_raw=1&wide_logs=true Results are not demonstrating any errors: https://chrome-isolated.appspot.com/browse?namespace=default-gzip&digest=67231b97b01232f76bd571ce17214c46e0d51f9a&as=build-run-result.json Thanks, Mike
,
Oct 18
this happens because the build violated buildbucket-level API limit on the size of the build (1MB). So even if though the build succeeded on the bot, it wasn't accepted on the server. we should have done better job of surfacing that, though. v2 API doesn't expose that yet. V1 exposes in result_details_json https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8932403704197749520&_h=1& but it is quite buried: build-run-result.json returned by the swarming task is bad: >= 1 MB this should be fixed in the v2 API (there is infra_failure_reason field for that), and then Milo should switch v2 API in the meantime, please use v1 API if v2 API doesn't explain the reason of an infra failure also please limit the number of step links. There are 12K+ links. Is this necessary? assigning to myself to implement infra_failure_reason
,
Oct 18
Apologies if this is implied in #3, but what's the timing on v2? Will the v1 change keep the overall status from going red / purple? Really need that resolved for the TPM team building Chrome OS. Thx
,
Oct 18
the change to expose infra failure reason in v2 won't help unblocking TPMs. It will only enable faster diagnosis, but won't fix the problem. The root cause is that cbuildbot is dumping too many links. This need to be fixed. i didn't realize there is followup needed for this build. I've extracted buildbucket feature request to a separate bug 896792. Please make necessary changes in cbuildbot. CCI caps the number of step lines they emit. They used to emit a step line for each failed tests, so sometimes tens of thousands of lines.
,
Oct 18
Thoughts on the best owner per #5?
,
Oct 18
Most of those links are CL's so I don't know how we can reduce it: they are required. Can we raise the limited from 1MB to 2MB? There's a discussion going on right now of moving to Milo's native support for showing changes but Milo will need support for Repo manifests instead of DEPS/git submodules for that to work. At least, that's what I gathered, so far.
,
Oct 18
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/8f878b9effe685fce4edd27a516298e7e5ea849a commit 8f878b9effe685fce4edd27a516298e7e5ea849a Author: Nodir Turakulov <nodir@google.com> Date: Thu Oct 18 21:03:11 2018 [buildbucket] Increase build-run-result.json limit In buildbucket build-run-result.json exists only in memory during finalizing of a build. Increase its size limit. Separately, we still have a limit on the size of stored BuildSteps.step_container. Bug: 896405 Change-Id: Ib732fd705f50989b8593d5ccf904c4333ae337e7 Reviewed-on: https://chromium-review.googlesource.com/c/1289450 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> Cr-Commit-Position: refs/heads/master@{#18454} [modify] https://crrev.com/8f878b9effe685fce4edd27a516298e7e5ea849a/appengine/cr-buildbucket/swarming/swarming.py [modify] https://crrev.com/8f878b9effe685fce4edd27a516298e7e5ea849a/appengine/cr-buildbucket/swarming/test/swarming_test.py
,
Oct 18
change in #c8 was deployed. Please schedule another builds
,
Oct 18
thank you.
,
Oct 18
I'll keep an eye on the current PFQ run (or after) and offer feedback tomorrow; thanks. Current: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932287752066711792
,
Oct 19
Hey, it worked! Yay! Thanks everyone! https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932287752066711792
,
Oct 22
Hey, it looks like this is back with the past two pfqs.... https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932574644788841360
,
Oct 24
It looks like the CL list was even larger and so it likely exceeded the new 2MB limit.
,
Oct 24
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932574644788841360 is of Oct 15. The limit was changed on Oct 18. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jclinton@google.com
, Oct 17Owner: mikenichols@chromium.org
Status: Assigned (was: Untriaged)