New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 601640 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
OOO until 2019-01-24
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 0
Type: Bug



Sign in to add a comment

Multiple GPU Windows bots failing all tests due to missing *.isolated file

Project Member Reported by ajuma@chromium.org, Apr 7 2016

Issue description

Comment 1 by ajuma@chromium.org, Apr 7 2016

Labels: Infra-Swarming

Comment 2 by kbr@chromium.org, Apr 7 2016

Cc: dpranke@chromium.org
Labels: Infra
Was a change just made to the bots' configuration that doesn't show up in the Chromium changelogs? Here was the last successful build/test pair:

https://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder/builds/42223
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516

and the first failing one:

https://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder/builds/42224
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517

All the Windows GPU bots are broken.

I just landed a build-side change that stopped running crash_service on the win bots.

Comment 4 by kbr@chromium.org, Apr 7 2016

That seems OK; it shouldn't have affected the build of .isolated files, right?

Cc: martiniss@chromium.org
Yes, it doesn't look to me like the crash_service change would've affected anything.

the log says "*.isolated file for target telemetry_gpu_test is missing"
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14670/steps/%5Berror%5D%20context_lost_tests%20on%20ATI%20GPU%20on%20Windows/logs/stdio

but it looks like it's there in the extract_build step: 

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14670/steps/extract%20build/logs/stdio

+martiniss - maybe something else changed in infra that's affecting what the 'inline python' is doing?


Comment 6 by kbr@chromium.org, Apr 7 2016

Labels: -Pri-1 Pri-0
Raising to P0. This is urgent.

Comparing these two builds:
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517

An essential difference seems to be that the build property "swarm_hashes" is being set in the first, but not the second.

It looks like the working build was at tools/build rev:
https://chromium.googlesource.com/chromium/tools/build/+/2e837e7171f83e04145b11357ba9686b8e2f3789

and failing build was at tools/build rev:
https://chromium.googlesource.com/chromium/tools/build/+/4586453c37b8a0978cab1d7fe7e257e0333bc16b

but that may be a red herring since the problem might be a tools/build upgrade on the builder -- haven't gotten that far yet.

Comment 7 by kbr@chromium.org, Apr 7 2016

In fact, the tools/build change between the working and failing builds on the builder was the recipe roll:

https://chromium.googlesource.com/chromium/tools/build/+log/03c2b4bbe796560dc3b7377750084e4bb1d4f41d..2e837e7171f83e04145b11357ba9686b8e2f3789?pretty=full


The depot_tools change which landed here was written by robbie. It breaks locks when gclient runs, see https://chromium.googlesource.com/chromium/tools/depot_tools/+/bf525dc9be49933435ee51bfeacdb7b3c8b3c1b0

Comment 9 by kbr@chromium.org, Apr 8 2016

Cc: vadimsh@chromium.org
OK, maybe it wasn't that.

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516/steps/steps/logs/stdio

step "find isolated tests" finds the isolated tests.

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517/steps/steps/logs/stdio

step "find isolated tests" finds nothing.

What's happening? What could have changed to affect the behavior here?

the 'swarm_hashes' thing is a red herring: that value is set by the output of the build itself, during the 'find isolated tests' step, which is the first step that failed. So the fact that 'swarm_hashes' is empty is a symptom, not a cause.

Comment 11 by kbr@chromium.org, Apr 8 2016

I don't understand. The JSON output of the "find isolated tests" step (which is an execution of scripts/slave/recipe_modules/isolate/resources/find_isolated_tests.py) is the JSON blob containing the swarm hashes for the *.isolated files. It seems to me that find_isolated_tests is failing for some reason, not something else in the recipe. The fact that this dictionary is empty causes the recipe to take a different, non-swarmed, code path for running the rest of the tests, which doesn't work on these bots.

kbr@ Yes, your understanding is exactly correct. "find isolated tests" is failing (namely, not finding any isolated tests), and that's causing things to fail.

I have no idea why it isn't finding anything though. Not yet.
I suppose we can try reverting the crash_service CL and see if that affects anything?
Here's a difference (likely another red herring):

The passing build deletes the old build directory successfully, the failing build throws an error:
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516/steps/rmtree%20build%20directory/logs/stdio
vs
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517/steps/rmtree%20build%20directory/logs/stdio

The second one has:
e:\b\build\slave\win7_release__nvidia_\build\src\out\release\crash_service.exe - Access is denied.
Actually, that error is directly related to the crash_service CL.

I think we should revert the crash_service CL.

Comment 18 by kbr@chromium.org, Apr 8 2016

Yes, let's do that. Thanks.

Comment 19 by kbr@chromium.org, Apr 8 2016

Cc: d...@chromium.org
Note: dnj@ and I logged on to one of the bots (vm90-m1) and found that the directory structure was:

out/Release/full-build-win32/...

instead of

out/Release/...

after extract_build was done.

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517

Comment 20 by kbr@chromium.org, Apr 8 2016

I'll change the configuration of these bots to reboot after each run. I'd turned that off a while ago to speed up their cycle time. That's probably why crash_service.exe was still running on them.

Comment 21 by kbr@chromium.org, Apr 8 2016

Owner: kbr@chromium.org
Status: Started (was: Untriaged)
kbr noted in chat that these bots don't auto-reboot. So if we assume that the culprit is the crash_service running, then just rebooting the bots once should cause them to heal. I'm issuing restarts to all the affected bots right now, and then having them rebuild their latest task, to see if that works.

Comment 24 by kbr@chromium.org, Apr 8 2016

Labels: -Infra-Troopers
Thanks Aaron for diagnosing and fixing the problem.

https://codereview.chromium.org/1866403003 will re-enable auto-reboots on these two waterfalls. Sorry for the instability this "optimization" caused.

Project Member

Comment 25 by bugdroid1@chromium.org, Apr 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/f597ce0781f7268c5da6474edb0af26f466305c5

commit f597ce0781f7268c5da6474edb0af26f466305c5
Author: kbr@chromium.org <kbr@chromium.org>
Date: Fri Apr 08 01:17:41 2016

Auto-reboot the chromium.gpu and chromium.gpu.fyi bots.

Auto-reboots were disabled some time ago in the spirit of efficiency,
but this caused problems like lingering processes on Windows which were
severely detrimental.

The only four slaves that do not auto-reboot are Linux slaves which use
the "subdir" property to run multiple slaves on the same physical host
and can not be auto-rebooted.

BUG= 601640 

Review URL: https://codereview.chromium.org/1866403003

git-svn-id: svn://svn.chromium.org/chrome/trunk/tools/build@299786 0039d316-1c4b-4281-b951-d872f2087c98

[modify] https://crrev.com/f597ce0781f7268c5da6474edb0af26f466305c5/masters/master.chromium.gpu.fyi/slaves.cfg
[modify] https://crrev.com/f597ce0781f7268c5da6474edb0af26f466305c5/masters/master.chromium.gpu/slaves.cfg

Project Member

Comment 26 by bugdroid1@chromium.org, Apr 8 2016

Project Member

Comment 27 by bugdroid1@chromium.org, Apr 8 2016

Comment 28 by kbr@chromium.org, Apr 8 2016

Restarted the chromium.gpu and chromium.gpu.fyi waterfalls above to start auto-rebooting the bots.

Everything looks happy and green. Marking fixed!

Thanks everyone for your patience and help.
Status: Fixed (was: Started)
Components: Infra>Platform>Swarming
Labels: -Infra-Swarming
Labels: -Infra

Sign in to add a comment