isolate_tests step hangs when data dependencies are missing |
|||||||||
Issue descriptionIf a data dependency is missing, the isolate_step just hangs, and doesn't fail. Example: https://build.chromium.org/p/tryserver.webrtc/builders/linux_dbg/builds/10374/steps/isolate%20tests/logs/stdio https://build.chromium.org/p/tryserver.webrtc/builders/linux_dbg/builds/10375/steps/isolate%20tests/logs/stdio
,
Nov 16 2016
,
Nov 16 2016
As part of this, can we add a timeout to the isolate test step in the recipe?
,
Nov 16 2016
see issue 665895 for context. Vadim isn't the right person, me or our SYD colleagues are.
,
Nov 16 2016
katthomas@ on timeout - please find proposal doc Pawel made about step timeouts. While isolate isn't listed there (afair), the same discussion applies. Maybe file another bug.
,
Nov 16 2016
I'm not familiar with 'isolate' tool code. It was written by maruel@ and tandrii@ and currently being picked up by tansell@, djd@ and @mcgreevy. Assigning to tandrii@.
,
Nov 17 2016
There is clearly a bug in the luci-go isolate client. It should definitely fail (loudly) if a file referenced by the .isolate file doesn't exist. I'm going to assign to djd@ who is currently working on an improved archive command in isolate. This will also hopefully solve this bug.
,
Nov 17 2016
#5 - Discussed offline. Filed crbug.com/666091. Thanks @tansell!
,
Nov 21 2016
,
Nov 21 2016
My local toy examples haven't been able to repro this bug, so it's not immediately clear why it's hanging indefinitely. I might need to apply a sledgehammer approach if this is causing pain on the builders.
,
Nov 21 2016
My experience was similar, djd@. I did see this bug ~1 year ago, but also couldn't repro. Once I did get a reliable repro by running a hung tryjob locally, there were 40 isolate files! The trouble is that there are so many goroutines that GDB is totally useless. Then I tried Golang debugger, but it was unbearably slow as it was parsing the whole stack every time I ran a command inside, but maybe it got better since then? So, I just ran tryjob locally myself and I can repro it again. If you want, I can try to put that into gstorage with command I'm executing so you can have it on your side (or you can try running any hung tryjob recipe yourself).
,
Nov 23 2016
(Putting this in the Infra>Platform>Swarming component, since apparently Infra>Platform>Isolate doesn't exist.)
,
Nov 23 2016
@tandrii, do you know why the isolate client goes to such lengths to try to close "cleanly"? It seems like it would be much simpler + safer to just kill the binary on errors like this.
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/d1e2b7c4a43da90682972424bd0745fa0217a814 commit d1e2b7c4a43da90682972424bd0745fa0217a814 Author: djd <djd@chromium.org> Date: Wed Nov 30 02:58:15 2016 isolate: give up and die on file unavailability Don't try to gracefully die if a file is unavailable during the initial stat / walk; it's much safer to just kill binary. BUG= 666047 Review-Url: https://codereview.chromium.org/2535803004 [modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/archiver/directory.go [modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/archiver/directory_test.go [modify] https://crrev.com/d1e2b7c4a43da90682972424bd0745fa0217a814/client/isolate/isolate.go
,
Dec 5 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bf2e505f84276ed16cd888cda97026d6151e8169 commit bf2e505f84276ed16cd888cda97026d6151e8169 Author: djd <djd@chromium.org> Date: Mon Dec 05 04:43:54 2016 Roll isolate binaries generated at infra@12ec732 This contains luci/luci-go@47c9792. R=tansell@chromium.org BUG= 666047 Review-Url: https://codereview.chromium.org/2553493002 Cr-Commit-Position: refs/heads/master@{#436234} [modify] https://crrev.com/bf2e505f84276ed16cd888cda97026d6151e8169/tools/luci-go/linux64/isolate.sha1 [modify] https://crrev.com/bf2e505f84276ed16cd888cda97026d6151e8169/tools/luci-go/mac64/isolate.sha1 [modify] https://crrev.com/bf2e505f84276ed16cd888cda97026d6151e8169/tools/luci-go/win64/isolate.exe.sha1
,
Dec 5 2016
This should be fixed now. @tandrii, how do I go about running a tryjob locally?
,
Dec 5 2016
Re #13: from my PoV, because I was really trying to do propagate Go errors, but yes, something went wrong with logic inside archiver pipeline. Re #16: go to any tryjob like [1] find setup_build step inside which there is "run_recipe" [2] which provides copy-pastable repro. One thing to make sure before you run it though is that chromium recipes hardcode "/b" paths on Linux, so first do: $ sudo mkdir -p /b && sudo chown $USER /b [1] https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/273458/ [2] https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/273458/steps/setup_build/logs/run_recipe
,
Dec 5 2016
,
Dec 5 2016
Issue 525175 has been merged into this issue.
,
Dec 5 2016
,
Dec 8 2016
,
Jan 25 2017
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by katthomas@chromium.org
, Nov 16 2016Labels: -Pri-3 Pri-1
Owner: vadimsh@chromium.org
Status: Assigned (was: Untriaged)