Multiple GPU Windows bots failing all tests due to missing *.isolated file |
|||||||||||
Issue descriptionAt least four Windows bots on the GPU and GPU fyi waterfalls are failing all tests, with output: *.isolated file for target telemetry_gpu_test is missing Sample builds: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14670 https://build.chromium.org/p/chromium.gpu.fyi/builders/Win7%20Release%20%28NVIDIA%29/builds/23747 https://build.chromium.org/p/chromium.gpu.fyi/builders/Win8%20Release%20%28NVIDIA%29/builds/21462 It's unclear to me if this is at all related to issue 601564.
,
Apr 7 2016
Was a change just made to the bots' configuration that doesn't show up in the Chromium changelogs? Here was the last successful build/test pair: https://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder/builds/42223 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516 and the first failing one: https://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder/builds/42224 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517 All the Windows GPU bots are broken.
,
Apr 7 2016
I just landed a build-side change that stopped running crash_service on the win bots.
,
Apr 7 2016
That seems OK; it shouldn't have affected the build of .isolated files, right?
,
Apr 7 2016
Yes, it doesn't look to me like the crash_service change would've affected anything. the log says "*.isolated file for target telemetry_gpu_test is missing" https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14670/steps/%5Berror%5D%20context_lost_tests%20on%20ATI%20GPU%20on%20Windows/logs/stdio but it looks like it's there in the extract_build step: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14670/steps/extract%20build/logs/stdio +martiniss - maybe something else changed in infra that's affecting what the 'inline python' is doing?
,
Apr 7 2016
Raising to P0. This is urgent. Comparing these two builds: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517 An essential difference seems to be that the build property "swarm_hashes" is being set in the first, but not the second. It looks like the working build was at tools/build rev: https://chromium.googlesource.com/chromium/tools/build/+/2e837e7171f83e04145b11357ba9686b8e2f3789 and failing build was at tools/build rev: https://chromium.googlesource.com/chromium/tools/build/+/4586453c37b8a0978cab1d7fe7e257e0333bc16b but that may be a red herring since the problem might be a tools/build upgrade on the builder -- haven't gotten that far yet.
,
Apr 7 2016
In fact, the tools/build change between the working and failing builds on the builder was the recipe roll: https://chromium.googlesource.com/chromium/tools/build/+log/03c2b4bbe796560dc3b7377750084e4bb1d4f41d..2e837e7171f83e04145b11357ba9686b8e2f3789?pretty=full
,
Apr 7 2016
The depot_tools change which landed here was written by robbie. It breaks locks when gclient runs, see https://chromium.googlesource.com/chromium/tools/depot_tools/+/bf525dc9be49933435ee51bfeacdb7b3c8b3c1b0
,
Apr 8 2016
OK, maybe it wasn't that. https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516/steps/steps/logs/stdio step "find isolated tests" finds the isolated tests. https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517/steps/steps/logs/stdio step "find isolated tests" finds nothing. What's happening? What could have changed to affect the behavior here?
,
Apr 8 2016
the 'swarm_hashes' thing is a red herring: that value is set by the output of the build itself, during the 'find isolated tests' step, which is the first step that failed. So the fact that 'swarm_hashes' is empty is a symptom, not a cause.
,
Apr 8 2016
I don't understand. The JSON output of the "find isolated tests" step (which is an execution of scripts/slave/recipe_modules/isolate/resources/find_isolated_tests.py) is the JSON blob containing the swarm hashes for the *.isolated files. It seems to me that find_isolated_tests is failing for some reason, not something else in the recipe. The fact that this dictionary is empty causes the recipe to take a different, non-swarmed, code path for running the rest of the tests, which doesn't work on these bots.
,
Apr 8 2016
Update_scripts: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516/steps/update_scripts/logs/gclient_json vs https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517/steps/update_scripts/logs/gclient_json The only different between passing and failing in the build/ repo is the crash_service CL: https://chromium.googlesource.com/chromium/tools/build/+log/2e837e7171f83e04145b11357ba9686b8e2f3789..4586453c37b8a0978cab1d7fe7e257e0333bc16b?pretty=full The only other difference between passing and failing is in build_internal/scripts/slave: https://chrome-internal.googlesource.com/chrome/tools/build_limited/scripts/slave/+log/9a81227376e8acb18cfdd9bdcdc944059bc27e63~..25543bc2dd32d0984911a49b85604cef7328dc47?pretty=full depot_tools does not change between passing and failing.
,
Apr 8 2016
kbr@ Yes, your understanding is exactly correct. "find isolated tests" is failing (namely, not finding any isolated tests), and that's causing things to fail. I have no idea why it isn't finding anything though. Not yet.
,
Apr 8 2016
This is the diff in chromium src between passing and failing: https://chromium.googlesource.com/chromium/src/+log/fca9ef5451d1dc700bbf2370e1147b05ad704cfc..aa575cd9100984ba7cf7fc0377d7c5d40abbdb5f?pretty=full
,
Apr 8 2016
I suppose we can try reverting the crash_service CL and see if that affects anything?
,
Apr 8 2016
Here's a difference (likely another red herring): The passing build deletes the old build directory successfully, the failing build throws an error: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47516/steps/rmtree%20build%20directory/logs/stdio vs https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517/steps/rmtree%20build%20directory/logs/stdio The second one has: e:\b\build\slave\win7_release__nvidia_\build\src\out\release\crash_service.exe - Access is denied.
,
Apr 8 2016
Actually, that error is directly related to the crash_service CL. I think we should revert the crash_service CL.
,
Apr 8 2016
Yes, let's do that. Thanks.
,
Apr 8 2016
Note: dnj@ and I logged on to one of the bots (vm90-m1) and found that the directory structure was: out/Release/full-build-win32/... instead of out/Release/... after extract_build was done. https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47517
,
Apr 8 2016
I'll change the configuration of these bots to reboot after each run. I'd turned that off a while ago to speed up their cycle time. That's probably why crash_service.exe was still running on them.
,
Apr 8 2016
,
Apr 8 2016
kbr noted in chat that these bots don't auto-reboot. So if we assume that the culprit is the crash_service running, then just rebooting the bots once should cause them to heal. I'm issuing restarts to all the affected bots right now, and then having them rebuild their latest task, to see if that works.
,
Apr 8 2016
We have a successful "find isolated tests" step after a reboot: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28ATI%29/builds/14676/steps/find%20isolated%20tests/logs/stdio
,
Apr 8 2016
Thanks Aaron for diagnosing and fixing the problem. https://codereview.chromium.org/1866403003 will re-enable auto-reboots on these two waterfalls. Sorry for the instability this "optimization" caused.
,
Apr 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/f597ce0781f7268c5da6474edb0af26f466305c5 commit f597ce0781f7268c5da6474edb0af26f466305c5 Author: kbr@chromium.org <kbr@chromium.org> Date: Fri Apr 08 01:17:41 2016 Auto-reboot the chromium.gpu and chromium.gpu.fyi bots. Auto-reboots were disabled some time ago in the spirit of efficiency, but this caused problems like lingering processes on Windows which were severely detrimental. The only four slaves that do not auto-reboot are Linux slaves which use the "subdir" property to run multiple slaves on the same physical host and can not be auto-rebooted. BUG= 601640 Review URL: https://codereview.chromium.org/1866403003 git-svn-id: svn://svn.chromium.org/chrome/trunk/tools/build@299786 0039d316-1c4b-4281-b951-d872f2087c98 [modify] https://crrev.com/f597ce0781f7268c5da6474edb0af26f466305c5/masters/master.chromium.gpu.fyi/slaves.cfg [modify] https://crrev.com/f597ce0781f7268c5da6474edb0af26f466305c5/masters/master.chromium.gpu/slaves.cfg
,
Apr 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/54d507d341778d6c2f93d56901be9604f8d3da6d commit 54d507d341778d6c2f93d56901be9604f8d3da6d Author: kbr <kbr@google.com> Date: Fri Apr 08 01:20:20 2016
,
Apr 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/90f9ef62b3fae91af5848c9a1d6cf5f1cfa49f71 commit 90f9ef62b3fae91af5848c9a1d6cf5f1cfa49f71 Author: kbr <kbr@google.com> Date: Fri Apr 08 01:20:29 2016
,
Apr 8 2016
Restarted the chromium.gpu and chromium.gpu.fyi waterfalls above to start auto-rebooting the bots.
,
Apr 8 2016
Everything looks happy and green. Marking fixed! Thanks everyone for your patience and help.
,
Apr 8 2016
,
Apr 26 2016
,
Apr 27 2016
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by ajuma@chromium.org
, Apr 7 2016