New issue
Advanced search Search tips

Issue 891196 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Milo reports spurious purple steps

Project Member Reported by machenb...@chromium.org, Oct 2

Issue description

E.g.:
https://ci.chromium.org/p/v8/builders/luci.v8.ci/V8%20Linux64%20ASAN/27574

On the overview page the build appears green:
http://shortn/_ApSke0lKoX

The underlying swarming task is completed:
https://chromium-swarm.appspot.com/task?id=4048d95948831e10&refresh=10&show_raw=1&wide_logs=true

Also the purple step's shards are all completed. And when going to the step's stdout (which is huge) everything looks like a successful step.

Also the duration of the step seems to never end. Now it's at 16h:
http://shortn/_lOzcsyjf0c

On our lkgr status page, all those builds appear as unfinished, see:
https://storage.cloud.google.com/chromium-v8/lkgr-status/v8-lkgr-status.html
Screenshot:
http://shortn/_nh2re3T9oG
 

Comment 1 Deleted

Comment 2 Deleted

Labels: -Infra-Troopers Foundation-Troopers
Owner: tandrii@chromium.org
Status: Assigned (was: Untriaged)
Sorry, the whole foundation had onsite, which made trooper (me) not very responsive. Looking today.
Status: Started (was: Assigned)
Very interesting, indeed. Thanks for filing in such detail, Michael!

The buildER view page is sourced purely from buildbucket, so green implies buildbucket says build finished with SUCCESS.
OTH, build view uses logdog step data to display steps and their status. Step may indeed be purple even though build succeeds.

But this makes lkgr view weird IF lkgr take only build status. So, maybe lkgr takes into account individual build steps, and hence noticed the purpleness?
Hm, logdog output for a step is OK (from swarming task page):

substep: <
    step: <
      name: "[trigger] Test262"
      status: SUCCESS
      stdout_stream: <
        name: "steps/s__trigger__Test262/0/stdout"
      >
      started: <
        seconds: 1538404589
        nanos: 89256289
      >
      ended: <
        seconds: 1538404591
        nanos: 587541228
      >
      text: "Run on OS: 'Ubuntu-14.04'"
      other_links: <
        label: "json.output"
        logdog_stream: <
          name: "steps/s__trigger__Test262/0/logs/json.output/0"
        >
      >
      other_links: <
        label: "shard #0"
        url: "https://chromium-swarm.appspot.com/user/task/4048e0a888c4ae10"
      >
      other_links: <
        label: "shard #1"
        url: "https://chromium-swarm.appspot.com/user/task/4048e0a9ef85da10"
      >
      other_links: <
        label: "shard #2"
        url: "https://chromium-swarm.appspot.com/user/task/4048e0ab4c7d3210"
      >
      other_links: <
        label: "shard #3"
        url: "https://chromium-swarm.appspot.com/user/task/4048e0aca4043c10"
      >
      other_links: <
        label: "shard #4"
        url: "https://chromium-swarm.appspot.com/user/task/4048e0ae31846410"
      >
    >


Components: -Infra Infra>Platform>LogDog
Verified that buildbucket is indeed reporting success for this build:

"build": {
  "status": "COMPLETED",
  "result": "SUCCESS",
}

So, it think it's logdog which hasn't finalized the state for this step, which is thus interpreted by Milo as a purple failure.
Thanks to Ryan,
$ cit logdog cat logdog://logs.chromium.org/v8/buildbucket/cr-buildbucket.appspot.com/8933860088081465920/+/annotations

indeed shows that logdog server isn't aware about the rest of the log.
Project Member

Comment 9 by bugdroid1@chromium.org, Oct 4

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/17e0185edfa074051e77ae442a3d7481bb5aec8e

commit 17e0185edfa074051e77ae442a3d7481bb5aec8e
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Oct 04 00:44:59 2018

Cc: tandrii@chromium.org
Labels: -Foundation-Troopers
Owner: hinoka@chromium.org
Michael, according to our investigation, the log stream is unfortunately lost for good. AFAIU, lkgr is doing OK now, so I'm removing this from trooper queue.

And assigning to Ryan to deploy a fix a to logdog.

Sign in to add a comment