New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 900148 link

Starred by 14 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug-Regression

Blocking:
issue 900409
issue 900497



Sign in to add a comment

Milo reports spurious step results for completed builds

Project Member Reported by machenb...@chromium.org, Oct 30

Issue description

See e.g.:
https://ci.chromium.org/p/v8/builders/luci.v8.try/v8_win64_rel_ng_triggered

Screenshot:
http://shortn/_Wdoj2ZFBCa

The build 8931252555500354848 says at info "Failure Check". But click on the build:
https://ci.chromium.org/p/v8/builders/luci.v8.try/v8_win64_rel_ng_triggered/b8931252555500354848

Screenshot:
http://shortn/_jCEkK3hPOr

There is no step "Check" yet. Instead "Test262 - no variants" is purple. But looking at the stdout of that step everything looks like a passing step (i.e. swarming shard has exit code 0).
 
Similar also https://ci.chromium.org/p/v8/builders/luci.v8.try/v8_linux64_asan_rel_ng_triggered/b8931253648855642256 which is purple in "Test262" but none of the shards failed.
Note that in the first example there actually is an error in the "Check" step, but somehow we don't get to see it because of the other purple step.
Cc: ishell@chromium.org
Looked at the annotations of the first example:
https://chromium-swarm.appspot.com/task?id=40dd11ef765a2210&refresh=10&show_raw=1&wide_logs=true

The Check step indeed has a FAILURE annotation, while the purple "Test262 - no variants" is marked with SUCCESS.
Similar example on CI:
https://ci.chromium.org/p/v8/builders/luci.v8.ci/V8%20Android%20Arm64%20-%20N5X/1526

There's indeed an exception in "Test262 - no variants". Instead the Mozilla step is purple.
There are several more cases across V8 CI. Also some with overall green result but spurious purple steps like:
https://ci.chromium.org/p/v8/builders/luci.v8.ci/V8%20Linux%20-%20debug/22900
Labels: Infra-Troopers
This also affects Chromium:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20Tests%20x64/28991

Screenshot:
https://screenshot.googleplex.com/4X59dm7GJXG

Bot update is purple and no other steps are shown thereafter. But when clicking on the source task, there are lots of steps and the failure is something completely different.
Cc: mslekova@chromium.org
Labels: -Pri-1 Pri-0
This is much wider than I first thought. See:
https://storage.cloud.google.com/chromium-v8/lkgr-status/v8-lkgr-status.html

Screenshot:
http://shortn/_vLOp8paW1v

All builds shown as yellow are actually succeeded builds, but there's a spurious purple step in the data. Maybe the time stamps of those builds can help to corner when this all started.
Same for Chromium:
https://storage.cloud.google.com/chromium-v8/chromium-lkgr-status/chromium-lkgr-status.html

Screenshot:
http://shortn/_KJTTU1Wt3A

Also chromium lkgr finder fails now, probably because of same root cause.
Cc: iannucci@chromium.org tandrii@chromium.org
There was no update to Milo the last few days it seems. Maybe this problem is somewhere else in the LUCI stack?
 Issue 900166  has been merged into this issue.
Labels: -Infra-Troopers Foundation-Troopers
into the foundation trooper queue; if it's not in milo, it's definitely somewhere in LUCI land.
Components: -Infra Infra>Platform>LogDog
Owner: hinoka@chromium.org
Status: Assigned (was: Untriaged)
I think the problem is in LogDog. Task stdout says the test step succeeded, but Check step failed => Build Summary says check failed.
Then Milo reads steps from LogDog (instead of buildbucket. It is a bug) which returns stale data, without check step, and with running test step. The build completed, so Milo marks all running steps as infra failed.

hinoka, these are bugs in LogDog and Milo. PTAL  


Cc: jcgrego...@google.com
Issue 900204 has been merged into this issue.
Cc: jdufault@chromium.org xiaochu@chromium.org athilenius@chromium.org groeck@chromium.org
 Issue 900155  has been merged into this issue.
Cc: whesse@google.com sortie@google.com
Adding Dart to CC
Owner: no...@chromium.org
Status: Started (was: Assigned)
Cc: estaab@chromium.org

Comment 21 Deleted

Cc: cduvall@chromium.org
Removed comment 21 with internal link. Here is the comment with shortened link:

here is an example of showing success but having incomplete build stages:

http://shortn/_vLIZk5hvQU

Summary: Milo reports spurious step results for completed builds (was: Milo reports spurious step results)
Project Member

Comment 25 by bugdroid1@chromium.org, Oct 30

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/c38a76acb9c6f12eace009e8ce7d662a4423a57f

commit c38a76acb9c6f12eace009e8ce7d662a4423a57f
Author: Ryan Tseng <hinoka@google.com>
Date: Tue Oct 30 17:31:51 2018

Labels: -Pri-0 Pri-1
logdog archivist were archiving streams before they completed. To mitigate this emergency, we've turned off archivist. This should fix the user-visible issues and not add new user-visible issues. It means that archivist will stop moving data from fast expensive storage (BigTable) to slow cheap storage (GCS). Our pubsub backlog started growing and we've enough time to figure this out.

please make it p0 if you see a new build (started after this message is posted) that still has stale/invalid steps
affected are users of milo and any system that loads steps from logdog, as opposed to buildbucket v2.
Buildbucket API v2 (including BigQuery) users were not affected.

We don't plan to restore lost steps in logdog. If you need to load steps programmatically, please use buildbucket API:
https://cr-buildbucket.appspot.com/rpcexplorer/services/buildbucket.v2.Builds/GetBuild?request={%20%20%20%20%22id%22:%20%228931252555500354848%22,%20%20%20%20%22fields%22:%20%22id,status,builder,steps%22}
Issue 900225 has been merged into this issue.
 Issue 900227  has been merged into this issue.
Owner: hinoka@chromium.org
The remaining work is to figure out what's wrong with archivist. Ryan has started it.

This had sufficient impact to require a postmortem. Ryan please start one and record your findings about archivist there too. I've added an entry in go/chops-pm
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/PreCQ/b8931211207055002960
was canceled. In this case the steps were marked infra failed as intended.
re: 31 - Those are unrelated.  Those failed because someone pressed the "cancel" button on swarming.

And by someone, it's probably cbuildbot: https://screenshot.googleplex.com/L7UqFLHkWEj
Blocking: 900409
Blocking: 900497
Re 30: I think analyzing archivist is one thing, but isn't here another big problem? It has been asked elsewhere on the logdog bug, but couldn't we implement the system in a fail-safe way?

IIUC what happened here is that many annotation streams were missing or archived incorrectly (root cause). But then milo should show us _ONE_ purple step saying "Sorry I can't show you any steps" instead of showing random stale and misleading data.
Re: 37 - Yes we can and we should implement logdog in a failsafe way.  Fixing logdog to reduce these classes of failures is going to be a major undertaking, but is also a major part of the roadmap for the next couple quarters.

As for the Milo case, currently have two choices when we encounter a stale logdog annotation stream:

1. Show as much as possible, and add an indication that something went wrong.
2. Don't show anything.

We currently do (1), and the indication that something went wrong is a purple step at the end (maybe this isn't a good enough signal?)
I'm not convinced (2) is the right way to go, since there is a lot of correct data we could be showing.
machenbach, the milo problem will be fixed by issue 850113 (by removing logdog from the equation and using same channel for build status and steps)
Re 38: Choice 1 sg, but a bit more readable indication than just making the last step purple would be great. E.g. a line at the purple step saying "broken stream" or something similar.

Re 39: sg :)
Status: Fixed (was: Started)
PM at go/chops-pm-98
Archivist is turned off, tracking bug for turning it back on in  crbug.com/901549 

Sign in to add a comment