New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 666047 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
away until April
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

isolate_tests step hangs when data dependencies are missing

Project Member Reported by ehmaldonado@chromium.org, Nov 16 2016

Issue description

Cc: katthomas@chromium.org
Labels: -Pri-3 Pri-1
Owner: vadimsh@chromium.org
Status: Assigned (was: Untriaged)
Elevating this to P1 because it is causing infra failures and increasing CQ cycle time.

A couple more examples:

https://luci-milo.appspot.com/buildbot/tryserver.chromium.linux/linux_chromium_chromeos_ozone_rel_ng/272502
https://luci-milo.appspot.com/buildbot/tryserver.chromium.linux/linux_chromium_chromeos_ozone_rel_ng/272669

We saw a lot of these failures on this builder yesterday.

Assigning to @vadimsh because that's my best guess at the moment. @vadimsh, if you're not the right person to own this, please advise who might be more appropriate?
Cc: djd@chromium.org tandrii@chromium.org tansell@chromium.org
As part of this, can we add a timeout to the isolate test step in the recipe?
see  issue 665895  for context. Vadim isn't the right person, me or our SYD colleagues are.
katthomas@ on timeout - please find proposal doc Pawel made about step timeouts. While isolate isn't listed there (afair), the same discussion applies. Maybe file  another bug.
Owner: tandrii@chromium.org
I'm not familiar with 'isolate' tool code. It was written by maruel@ and tandrii@ and currently being picked up by tansell@, djd@ and @mcgreevy. Assigning to tandrii@.
Owner: djd@chromium.org
There is clearly a bug in the luci-go isolate client. It should definitely fail (loudly) if a file referenced by the .isolate file doesn't exist.

I'm going to assign to djd@ who is currently working on an improved archive command in isolate. This will also hopefully solve this bug.
#5 - Discussed offline. Filed crbug.com/666091.

Thanks @tansell!

Comment 9 by djd@chromium.org, Nov 21 2016

Status: Started (was: Assigned)

Comment 10 by djd@chromium.org, Nov 21 2016

My local toy examples haven't been able to repro this bug, so it's not immediately clear why it's hanging indefinitely.

I might need to apply a sledgehammer approach if this is causing pain on the builders.
My experience was similar, djd@. I did see this bug ~1 year ago, but also couldn't repro. Once I did get a reliable repro by running a hung tryjob locally, there were 40 isolate files! The trouble is that there are so many goroutines that GDB is totally useless. Then I tried Golang debugger, but it was unbearably slow as it was parsing the whole stack every time I ran a command inside, but maybe it got better since then?

So, I just ran tryjob locally myself and I can repro it again.
If you want, I can try to put that into gstorage with command I'm executing so you can have it on your side (or you can try running any hung tryjob recipe yourself).
Components: -Infra Infra>Platform>Swarming
(Putting this in the Infra>Platform>Swarming component, since apparently Infra>Platform>Isolate doesn't exist.)

Comment 13 by djd@chromium.org, Nov 23 2016

@tandrii, do you know why the isolate client goes to such lengths to try to close "cleanly"? It seems like it would be much simpler + safer to just kill the binary on errors like this.
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/d1e2b7c4a43da90682972424bd0745fa0217a814

commit d1e2b7c4a43da90682972424bd0745fa0217a814
Author: djd <djd@chromium.org>
Date: Wed Nov 30 02:58:15 2016

isolate: give up and die on file unavailability

Don't try to gracefully die if a file is unavailable during the initial
stat / walk; it's much safer to just kill binary.

BUG= 666047 

Review-Url: https://codereview.chromium.org/2535803004

[modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/archiver/directory.go
[modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/archiver/directory_test.go
[modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/isolate/isolate.go

Comment 16 by djd@chromium.org, Dec 5 2016

Status: Fixed (was: Started)
This should be fixed now. @tandrii, how do I go about running a tryjob locally?
Re #13: from my PoV, because I was really trying to do propagate Go errors, but yes, something went wrong with logic inside archiver pipeline.

Re #16: go to any tryjob like [1] find setup_build step inside which there is "run_recipe" [2] which provides copy-pastable repro. One thing to make sure before you run it though is that chromium recipes hardcode "/b" paths on Linux, so first do:
 $ sudo mkdir -p /b && sudo chown $USER /b

[1] https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/273458/
[2] https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/273458/steps/setup_build/logs/run_recipe
Cc: mar...@chromium.org serg...@chromium.org
 Issue 587319  has been merged into this issue.
 Issue 525175  has been merged into this issue.
Cc: vadimsh@chromium.org
 Issue 526728  has been merged into this issue.
Labels: Infra-Failures
Labels: Hotlist-Infra-Failures

Sign in to add a comment