New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 896405 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 24
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 850113



Sign in to add a comment

LUCI: M71 Chrome PFQ: Misleading Completion Status

Project Member Reported by kbleicher@google.com, Oct 17

Issue description

The M71 Chrome PFQ is showing odd status in Luci.

The completed PFQ run is showing a purple 'Internal Error' ('Infra Failure' on the tab):
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932403704197749520

While the top level is showing an outright failure:
https://luci-scheduler.appspot.com/jobs/chromeos/release-R71-11151.B-samus-chrome-pre-flight-branch

These are hiding the fact that the PFQ completed successfully with a good uprev.


 
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>CI
Owner: mikenichols@chromium.org
Status: Assigned (was: Untriaged)
We had a bug open with LUCI about this recently; probably a corner case that they hadn't thought of. I would just reuse this bug, label it for Foundation-Troopers (and usassign), and reference the original bug.
Labels: Foundation-Troopers
Owner: ----
Status: Available (was: Assigned)
Hi all --

This looks like perhaps another edge case regarding the completion status.  We've seen similar behavior in http://crbug/860508 (I believe this is the right bug) but has been clean for most cases since.  

Tasks show success:  https://chrome-swarming.appspot.com/task?id=409fa6a06bfec410&refresh=10&show_raw=1&wide_logs=true 
Results are not demonstrating any errors:   https://chrome-isolated.appspot.com/browse?namespace=default-gzip&digest=67231b97b01232f76bd571ce17214c46e0d51f9a&as=build-run-result.json

Thanks,
Mike  

Blockedon: 850113
Owner: no...@chromium.org
Status: Assigned (was: Available)
this happens because the build violated buildbucket-level API limit on the size of the build (1MB). So even if though the build succeeded on the bot, it wasn't accepted on the server.

we should have done better job of surfacing that, though. v2 API doesn't expose that yet.
V1 exposes in result_details_json
https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8932403704197749520&_h=1&
but it is quite buried: build-run-result.json returned by the swarming task is bad: >= 1 MB

this should be fixed in the v2 API (there is infra_failure_reason field for that), and then Milo should switch v2 API

in the meantime, please use v1 API if v2 API doesn't explain the reason of an infra failure
also please limit the number of step links. There are 12K+ links. Is this necessary?

assigning to myself to implement infra_failure_reason
Apologies if this is implied in #3, but what's the timing on v2?

Will the v1 change keep the overall status from going red / purple?   Really need that resolved for the TPM team building Chrome OS.

Thx
Cc: no...@chromium.org
Owner: ----
Status: Available (was: Assigned)
the change to expose infra failure reason in v2 won't help unblocking TPMs. It will only enable faster diagnosis, but won't fix the problem. The root cause is that cbuildbot is dumping too many links. This need to be fixed.

i didn't realize there is followup needed for this build. I've extracted buildbucket feature request to a separate bug 896792. Please make necessary changes in cbuildbot. CCI caps the number of step lines they emit. They used to emit a step line for each failed tests, so sometimes tens of thousands of lines.
Thoughts on the best owner per #5?
Most of those links are CL's so I don't know how we can reduce it: they are required. Can we raise the limited from 1MB to 2MB?

There's a discussion going on right now of moving to Milo's native support for showing changes but Milo will need support for Repo manifests instead of DEPS/git submodules for that to work. At least, that's what I gathered, so far.

Project Member

Comment 8 by bugdroid1@chromium.org, Oct 18

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/8f878b9effe685fce4edd27a516298e7e5ea849a

commit 8f878b9effe685fce4edd27a516298e7e5ea849a
Author: Nodir Turakulov <nodir@google.com>
Date: Thu Oct 18 21:03:11 2018

[buildbucket] Increase build-run-result.json limit

In buildbucket build-run-result.json exists only in memory
during finalizing of a build.
Increase its size limit. Separately, we still have a limit on the size of
stored BuildSteps.step_container.

Bug:  896405 
Change-Id: Ib732fd705f50989b8593d5ccf904c4333ae337e7
Reviewed-on: https://chromium-review.googlesource.com/c/1289450
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Cr-Commit-Position: refs/heads/master@{#18454}
[modify] https://crrev.com/8f878b9effe685fce4edd27a516298e7e5ea849a/appengine/cr-buildbucket/swarming/swarming.py
[modify] https://crrev.com/8f878b9effe685fce4edd27a516298e7e5ea849a/appengine/cr-buildbucket/swarming/test/swarming_test.py

change in #c8 was deployed. Please schedule another builds
Owner: jclinton@chromium.org
Status: Verified (was: Available)
thank you.
I'll keep an eye on the current PFQ run (or after) and offer feedback tomorrow; thanks.

Current:
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932287752066711792
Status: Assigned (was: Verified)
Hey, it looks like this is back with the past two pfqs....

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932574644788841360
It looks like the CL list was even larger and so it likely exceeded the new 2MB limit.
Status: Fixed (was: Assigned)
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8932574644788841360
is of Oct 15. The limit was changed on Oct 18.

Sign in to add a comment