windows bots are broken |
||||||||||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng?numbuilds=200 The "steps" phase is failing with subprocess.CalledProcessError: Command '['git.bat', 'fetch']' returned non-zero exit status 1 RemoveDirectory running cmd.exe /c rd /q /s e:\b\build\slave\win\build\.recipe_runtime Also, why isn't this purple?
,
Mar 31 2016
,
Mar 31 2016
adding some folks who know recipe engine better than me, please ping me when you are awake
,
Mar 31 2016
i don't understand comment 1. shouldn't all errors in recipe engine be marked as infra failure?
,
Mar 31 2016
jam: not if the recipe engine, which marks them, is itself failing
,
Mar 31 2016
happening elsewhere too: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng?numbuilds=200
,
Mar 31 2016
Pawel said he'll take over.
,
Mar 31 2016
This is probably related to the recent Git 2.7 update that was pushed. A lot of foundation scripts were change,d and this likely hits an untested case. Ping @iannucci, the mastermind behind the migration.
,
Mar 31 2016
Filed https://bugs.chromium.org/p/chromium/issues/detail?id=599518 to fix indication to purple.
,
Mar 31 2016
PATH=E:\b\depot_tools;E:\b\depot_tools\python276_bin;C:\Windows\system32;C:\Windows\system32\WBEM;C:\Program Files\7-Zip;C:\cmake\bin;E:\b\build_internal\tools
...
@@@STEP_CURSOR setup_properties@@@
@@@STEP_CLOSED@@@
INFO:annotated_run:Master tryserver.chromium.win, builder win_chromium_compile_dbg_ng is not whitelisted for LogDog.
INFO:annotated_run:Not using LogDog. Invoking `recipes.py` directly.
fatal: Needed a single revision
INFO:root:Freshening repository E:\b\build\scripts\slave\.recipe_deps\depot_tools
The system cannot find the path specified.
INFO:root:['git.bat', 'fetch'] (in E:\b\build\scripts\slave\.recipe_deps\depot_tools)
The system cannot find the path specified.
Traceback (most recent call last):
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipes.py", line 397, in <module>
sys.exit(main())
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipes.py", line 356, in main
overrides=args.project_override)
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 600, in create
package_deps._root_package = package_deps._create_package(RootRepoSpec(proto_file))
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 617, in _create_package
return self._create_from_spec(repo_spec, package_spec)
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 634, in _create_from_spec
deps[dep] = self._create_package(dep_repo)
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 606, in _create_package
repo_spec.checkout(self._context)
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 212, in checkout
_run_cmd([self._git, 'fetch'], cwd=dep_dir)
File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 663, in _run_cmd
subprocess.check_call(cmd, cwd=cwd)
File "E:\b\depot_tools\python276_bin\lib\subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git.bat', 'fetch']' returned non-zero exit status 1
This expects git.bat to be in PATH, but for some reason it isn't. It only occurs on a subset of slaves, I may need to investigate on ones where it does occur to get more info.
,
Mar 31 2016
I bet the "depot_tools" recipe pin in "build/" needs to be bumped so that the recent Git script changes are picked up.
,
Mar 31 2016
Can we revert the migration while this is being investigated?
,
Mar 31 2016
I think that is not a good idea. The migration is fairly invasive to a critical dependency (Git).
,
Mar 31 2016
Agreed with dnj@ . Let's investigate more at least for little bit. Since I'm an owner for now let me coordinate this. You can ping me on IM. I connected to vm772-m4 and that slave seems to have had errors updating its code during startup: C:\Windows\system32>chdir /d E:\b E:\b>call E:\b\build\slave\run_slave.bat fatal: unable to access 'https://chromium.googlesource.com/a/chromium/tools/depo t_tools.git/': Couldn't resolve host 'chromium.googlesource.com' fatal: unable to access 'https://chromium.googlesource.com/a/chromium/tools/depo t_tools.git/': Couldn't resolve host 'chromium.googlesource.com' Syncing projects: 100% ( 1/ 1) slave.DEPS slave.DEPS (ERROR) ---------------------------------------- [0:00:00] Started. ---------------------------------------- Error: Command 'git remote update' returned non-zero exit status 1 in E:\b\slave .DEPS Fetching origin fatal: unable to access 'https://chrome-internal.googlesource.com/chrome/tools/b uild/slave.DEPS.git/': Couldn't resolve host 'chrome-internal.googlesource.com' error: Could not fetch origin (E:\b\build\slave\\run_slave.py) `gclient sync` failed; proceeding anyway... Using buildbot_slave_8_4 and twisted_10_2 HotPatchSlaveBuilder(False) Using slave name vm772-m4 Using master host master4a.golo.chromium.org Using master port 8190
,
Mar 31 2016
Actually that bot seems to be more deeply broken: cmd.exe not found?
,
Mar 31 2016
I just tried this on vm246-m4 and hit a similar issue: cmd.exe not found. I had to manually run "c:\Program Files\Windows\System32\cmd.exe", but "gclient sync" in /b worked for me, though. I think the system breakage is indicative of the problem, but your specific slave.DEPS thing is not.
,
Mar 31 2016
Actually, the "cmd.exe" issue suggests a larger default PATH issue, which is possibly affected by the new Git scripts.
,
Mar 31 2016
PATH on this box: e:\b\depot_tools;c:\tools This is notably missing c:\Windows\system32, which explains the cmd.exe issues.
,
Mar 31 2016
I've triggered a reboot of vm772-m4 using https://chrome-infra.appspot.com/vm-power On another slave where this issue occurs, vm330-m4 , this is the result of trying to open cmd.exe (I was getting similar results on vm246-m4).
,
Mar 31 2016
Dan, how did you check the PATH? I don't have a working prompt, and trying to get to the GUI results in this error:
,
Mar 31 2016
I've found two independent problems here, probably from the same cause: 1) Recipe engine uses "git.bat" from its "depot_tools" checkout, not the one in /b. 2) The Git update (or something) broke default user PATH (HKEY_CURRENT_USER\Environment). WRT (1), I don't know the right course of action. If the recipe engine is *supposed* to use its own Git, then it needs to make a point of calling "gclient.bat" in the checkout. I think this is stupid, though, as the recipe engine shouldn't be responsible for installing a parallel toolchain. Consequently, I think the bug here is that recipe engine isn't calling "e:\b\depot_tools\git.bat" for Git operations. WRT (2), this needs to be looked farther into, but is probably not causing the actual bot breakages that we're seeing.
,
Mar 31 2016
FWIW, vm772-m4 either came just fine from the reboot, or didn't reboot at all. I think it's lost connectivity when I requested the reboot, and when I connected again the system also disconnected saying that it's about to reboot, so I think it was real.
,
Mar 31 2016
re: comment #21: 1) why do you think so? it seems to rely on git.bat in PATH 2) that looks like exactly the cause of this outage; do we have any ideas how to further debug and fix this?
,
Mar 31 2016
I'm going to reboot vm246-m4 and see if anything good happens to the PATH. I think (1) is still going to be an issue regardless, though.
,
Mar 31 2016
wrt comment 13 "I think that is not a good idea. The migration is fairly invasive to a critical dependency (Git).". I can't parse what this means. Git is critical but so are the bots working. My cursory googling shows that the on disk files are backward compatible; but if I'm wrong please let me know. In that case, why wouldn't we undo this change immediately? At least 1/4 of the bots are broken now and this will rise as more update afaik.
,
Mar 31 2016
(1) I know this is the case because I CD into E:\b\build\scripts\slave\.recipe_deps\depot_tools, run "git.bat", and see it fail. I then run "gclient.bat", let it do its thing (install the Git toolchain), then run "git.bat" again and see it succeed. Because (2) is NOT a problem once in the command-line environment, it probably isn't a problem to the bot, which runs in the command line environment.
,
Mar 31 2016
jam@, it is not an easy thing to roll back an invasive change on the bots. Rolling back now would likely introduce more problems.
,
Mar 31 2016
To follow up, most of the bots have already successfully updated their Git, these bots included. The change has been active for 12+ hours. If I'm right, this is a problem with the recipe engine, not the Git update.
,
Mar 31 2016
To follow up on #24, rebooting did *not* fix the PATH problem.
,
Mar 31 2016
INFO:root:['git.bat', 'fetch'] (in E:\b\build\scripts\slave\.recipe_deps\depot_tools) This indicates that "git.bat" is being run out of the "depot_tools" CWD, causing it to pick up the recipe engine depot_tools "git.bat" instead of the system one. I think we need to change recipe engine to not change directory, maybe pass "--git-dir" instead?
,
Mar 31 2016
Note that the change did not change anything in the registry, nor should PATH have changed outside of depot tools. I'll be online shortly. Can we get labs folks to comment on it these boys are different somehow than the working ones?
,
Mar 31 2016
Ok, I'm properly online now. For #30: of course it's running git in the depot_tools repo to do the fetch. that's how cwd works. However, there's no git.bat in that directory (it's generated by running `gclient`, which the recipe engine does not do), so it's probably not using the non existent bat file (though if it exists, that would be strange and good to know).
,
Mar 31 2016
,
Mar 31 2016
> Actually that bot seems to be more deeply broken: cmd.exe not found? This is a typical symptom on some bots and I've seen it earlier. I don't think that it's related to current outage.
,
Mar 31 2016
I think Sergiy is right. I've seen the default file association for .exe get hammered on some windows slaves and have fixed as I have come across them while looking at other problems. But I have not seen this affect builds in the past so I think this is a red herring.
,
Mar 31 2016
Trying to debug these bots: they can't open executable files (like cmd.exe). I've tried to restore the exe association, but HKCR\exefile\shell\open\command already has the correct value of `"%1" %*`
,
Mar 31 2016
How are we supposed to believe that ".exe's don't work" is a solid basis for debugging other issues?
,
Mar 31 2016
exes work fine once you're in cmd.exe :)
,
Mar 31 2016
It looks like dnj was right: gclient is being run and does download its binary dependencies, but the way it does so, it ends up failing to update the bat files (for some reason that's still not clear). However, the typical flow of running 'gclient' before the engine, doesn't apply when in that cwd, and so the update doesn't happen, and git is permanently stuck. I suspect this would have broken with any git update: there hasn't been one for over a year, and a lot has changed since then.
,
Mar 31 2016
"the fire isn't so bad once you put on a firefighting jacket" /me is not reassured
,
Mar 31 2016
See issue 599550 for workaround on how to launch cmd.exe on borked bots.
,
Mar 31 2016
Yeah: http://i.imgur.com/c4jt321.png But in this case I think it really is.
,
Mar 31 2016
if it really is fine, we should break it uniformly across all the bots. If not, we should make these bots behave the way that the rest of the bots do. Either way, the bots should behave the same way. The fact that you found a workaround is nice, but mostly irrelevant.
,
Mar 31 2016
I am not by any means actually saying that broken environment is good. Just not related to this P0.
,
Mar 31 2016
Issue 599394 has been merged into this issue.
,
Mar 31 2016
,
Mar 31 2016
Ok, looks like this is roughly fixed... see https://uberchromegw.corp.google.com/i/tryserver.chromium.win/builders/win_clang/builds/676
,
Mar 31 2016
https://uberchromegw.corp.google.com/i/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng/builds/167272 This build was failing before, is now not failing, so this looks like it's fixed. Please re-open if any *new* incidents are seen on builds started after ~12:20PM PST.
,
Mar 31 2016
What was the fix?
,
Mar 31 2016
Yeah no idea where bugdroid is, but fix was: - https://codereview.chromium.org/1849903002 - https://codereview.chromium.org/1849563004/ (b/c I'm stupid)
,
Mar 31 2016
Thanks. Can we have a post-mortem for this please, since many bots were broken for at least half a day? It would be good to know what lessons we can apply so this doesn't reoccur.
,
Mar 31 2016
Assigning to trooper for postmortem.
,
Mar 31 2016
Although I'm not sure a postmortem is particularly useful here. There really isn't a specific lesson to learn - we have bots, some subset of which were in a weird state that our pre-launch testing didn't pick up. I think that's just going to be the case for all invasive deployments, since we have a non-homogenous fleet, and we're already working on solutions to make more deployments non-invasive ones. The summary of this problem was: on bots with a recipe engine checkout that had run "gclient" in its "depot_tools" directory before, it used the wrong Git. Do you think there's value writing a postmortem about that?
,
Mar 31 2016
Issue 599455 has been merged into this issue.
,
Mar 31 2016
I think there's value in writing that plus the description of the steps we're taking to move to a consistently-imaged fleet (w/ tracking bugs, timelines, etc., if they exist). Postmortems don't have to be long.
,
Mar 31 2016
+1 to comment 55, and also for adding monitoring that would catch ~25% of builds suddenly failing. Once bug 599518 is fixed and this is marked as an infra failure, the spike should be something that can be alerted on automatically.
,
Apr 27 2016
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by serg...@chromium.org
, Mar 31 2016