Issue metadata
Sign in to add a comment
|
Milo reports spurious step results for completed builds |
||||||||||||||||||||||
Issue descriptionSee e.g.: https://ci.chromium.org/p/v8/builders/luci.v8.try/v8_win64_rel_ng_triggered Screenshot: http://shortn/_Wdoj2ZFBCa The build 8931252555500354848 says at info "Failure Check". But click on the build: https://ci.chromium.org/p/v8/builders/luci.v8.try/v8_win64_rel_ng_triggered/b8931252555500354848 Screenshot: http://shortn/_jCEkK3hPOr There is no step "Check" yet. Instead "Test262 - no variants" is purple. But looking at the stdout of that step everything looks like a passing step (i.e. swarming shard has exit code 0).
,
Oct 30
Note that in the first example there actually is an error in the "Check" step, but somehow we don't get to see it because of the other purple step.
,
Oct 30
,
Oct 30
Looked at the annotations of the first example: https://chromium-swarm.appspot.com/task?id=40dd11ef765a2210&refresh=10&show_raw=1&wide_logs=true The Check step indeed has a FAILURE annotation, while the purple "Test262 - no variants" is marked with SUCCESS.
,
Oct 30
Similar example on CI: https://ci.chromium.org/p/v8/builders/luci.v8.ci/V8%20Android%20Arm64%20-%20N5X/1526 There's indeed an exception in "Test262 - no variants". Instead the Mozilla step is purple.
,
Oct 30
There are several more cases across V8 CI. Also some with overall green result but spurious purple steps like: https://ci.chromium.org/p/v8/builders/luci.v8.ci/V8%20Linux%20-%20debug/22900
,
Oct 30
This also affects Chromium: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20Tests%20x64/28991 Screenshot: https://screenshot.googleplex.com/4X59dm7GJXG Bot update is purple and no other steps are shown thereafter. But when clicking on the source task, there are lots of steps and the failure is something completely different.
,
Oct 30
,
Oct 30
This is much wider than I first thought. See: https://storage.cloud.google.com/chromium-v8/lkgr-status/v8-lkgr-status.html Screenshot: http://shortn/_vLOp8paW1v All builds shown as yellow are actually succeeded builds, but there's a spurious purple step in the data. Maybe the time stamps of those builds can help to corner when this all started.
,
Oct 30
Same for Chromium: https://storage.cloud.google.com/chromium-v8/chromium-lkgr-status/chromium-lkgr-status.html Screenshot: http://shortn/_KJTTU1Wt3A Also chromium lkgr finder fails now, probably because of same root cause.
,
Oct 30
There was no update to Milo the last few days it seems. Maybe this problem is somewhere else in the LUCI stack?
,
Oct 30
Issue 900166 has been merged into this issue.
,
Oct 30
into the foundation trooper queue; if it's not in milo, it's definitely somewhere in LUCI land.
,
Oct 30
I think the problem is in LogDog. Task stdout says the test step succeeded, but Check step failed => Build Summary says check failed. Then Milo reads steps from LogDog (instead of buildbucket. It is a bug) which returns stale data, without check step, and with running test step. The build completed, so Milo marks all running steps as infra failed. hinoka, these are bugs in LogDog and Milo. PTAL
,
Oct 30
,
Oct 30
Issue 900204 has been merged into this issue.
,
Oct 30
Issue 900155 has been merged into this issue.
,
Oct 30
Adding Dart to CC
,
Oct 30
,
Oct 30
,
Oct 30
,
Oct 30
Removed comment 21 with internal link. Here is the comment with shortened link: here is an example of showing success but having incomplete build stages: http://shortn/_vLIZk5hvQU
,
Oct 30
,
Oct 30
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/c38a76acb9c6f12eace009e8ce7d662a4423a57f commit c38a76acb9c6f12eace009e8ce7d662a4423a57f Author: Ryan Tseng <hinoka@google.com> Date: Tue Oct 30 17:31:51 2018
,
Oct 30
logdog archivist were archiving streams before they completed. To mitigate this emergency, we've turned off archivist. This should fix the user-visible issues and not add new user-visible issues. It means that archivist will stop moving data from fast expensive storage (BigTable) to slow cheap storage (GCS). Our pubsub backlog started growing and we've enough time to figure this out. please make it p0 if you see a new build (started after this message is posted) that still has stale/invalid steps
,
Oct 30
affected are users of milo and any system that loads steps from logdog, as opposed to buildbucket v2. Buildbucket API v2 (including BigQuery) users were not affected. We don't plan to restore lost steps in logdog. If you need to load steps programmatically, please use buildbucket API: https://cr-buildbucket.appspot.com/rpcexplorer/services/buildbucket.v2.Builds/GetBuild?request={%20%20%20%20%22id%22:%20%228931252555500354848%22,%20%20%20%20%22fields%22:%20%22id,status,builder,steps%22}
,
Oct 30
Issue 900225 has been merged into this issue.
,
Oct 30
Issue 900227 has been merged into this issue.
,
Oct 30
The remaining work is to figure out what's wrong with archivist. Ryan has started it. This had sufficient impact to require a postmortem. Ryan please start one and record your findings about archivist there too. I've added an entry in go/chops-pm
,
Oct 30
We have https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/PreCQ/b8931211207055002960 which was started after #26 but seems to exhibit the same problem. Maybe I'm taking crazy-pills but I also thought I saw builds https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931213536298796368 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931213469416854832 showing the same thing, but they have full logs/steps now...
,
Oct 30
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/PreCQ/b8931211207055002960 was canceled. In this case the steps were marked infra failed as intended.
,
Oct 30
i see a yellow step in https://screenshot.googleplex.com/8ha9mmb5EZ2 but it is green on milo Milo RPC returns green too: https://ci.chromium.org/rpcexplorer/services/milo.BuildInfo/Get?request={%20%20%20%20%22buildbucket%22:%20{%20%20%20%20%20%20%20%20%22id%22:%20%228931213536298796368%22%20%20%20%20}}
,
Oct 30
re: 31 - Those are unrelated. Those failed because someone pressed the "cancel" button on swarming. And by someone, it's probably cbuildbot: https://screenshot.googleplex.com/L7UqFLHkWEj
,
Oct 30
,
Oct 31
,
Nov 2
Re 30: I think analyzing archivist is one thing, but isn't here another big problem? It has been asked elsewhere on the logdog bug, but couldn't we implement the system in a fail-safe way? IIUC what happened here is that many annotation streams were missing or archived incorrectly (root cause). But then milo should show us _ONE_ purple step saying "Sorry I can't show you any steps" instead of showing random stale and misleading data.
,
Nov 2
Re: 37 - Yes we can and we should implement logdog in a failsafe way. Fixing logdog to reduce these classes of failures is going to be a major undertaking, but is also a major part of the roadmap for the next couple quarters. As for the Milo case, currently have two choices when we encounter a stale logdog annotation stream: 1. Show as much as possible, and add an indication that something went wrong. 2. Don't show anything. We currently do (1), and the indication that something went wrong is a purple step at the end (maybe this isn't a good enough signal?) I'm not convinced (2) is the right way to go, since there is a lot of correct data we could be showing.
,
Nov 2
machenbach, the milo problem will be fixed by issue 850113 (by removing logdog from the equation and using same channel for build status and steps)
,
Nov 2
Re 38: Choice 1 sg, but a bit more readable indication than just making the last step purple would be great. E.g. a line at the purple step saying "broken stream" or something similar. Re 39: sg :)
,
Nov 2
PM at go/chops-pm-98 Archivist is turned off, tracking bug for turning it back on in crbug.com/901549 |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by machenb...@chromium.org
, Oct 30