New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 599512 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

windows bots are broken

Project Member Reported by jam@chromium.org, Mar 31 2016

Issue description

https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng?numbuilds=200

The "steps" phase is failing with 

subprocess.CalledProcessError: Command '['git.bat', 'fetch']' returned non-zero exit status 1
RemoveDirectory running cmd.exe /c rd /q /s e:\b\build\slave\win\build\.recipe_runtime


Also, why isn't this purple?
 
Labels: Infra-Troopers Infra
this looks like error in the recipe engine itself, which would explain why step wasn't marked as infra failure and thus not colored purple
Owner: serg...@chromium.org
Status: Started (was: Untriaged)
Cc: iannucci@chromium.org phajdan.jr@chromium.org
adding some folks who know recipe engine better than me, please ping me when you are awake

Comment 4 by jam@chromium.org, Mar 31 2016

i don't understand comment 1. shouldn't all errors in recipe engine be marked as infra failure?
jam: not if the recipe engine, which marks them, is itself failing

Comment 6 by jam@chromium.org, Mar 31 2016

Summary: windows bots are broken (was: win_chromium_compile_dbg_ng is broken)
happening elsewhere too: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng?numbuilds=200
Owner: phajdan.jr@chromium.org
Pawel said he'll take over.

Comment 8 by d...@chromium.org, Mar 31 2016

Cc: d...@chromium.org
This is probably related to the recent Git 2.7 update that was pushed. A lot of foundation scripts were change,d and this likely hits an untested case.

Ping @iannucci, the mastermind behind the migration.
Filed https://bugs.chromium.org/p/chromium/issues/detail?id=599518 to fix indication to purple.
PATH=E:\b\depot_tools;E:\b\depot_tools\python276_bin;C:\Windows\system32;C:\Windows\system32\WBEM;C:\Program Files\7-Zip;C:\cmake\bin;E:\b\build_internal\tools

...

@@@STEP_CURSOR setup_properties@@@

@@@STEP_CLOSED@@@
INFO:annotated_run:Master tryserver.chromium.win, builder win_chromium_compile_dbg_ng is not whitelisted for LogDog.
INFO:annotated_run:Not using LogDog. Invoking `recipes.py` directly.
fatal: Needed a single revision
INFO:root:Freshening repository E:\b\build\scripts\slave\.recipe_deps\depot_tools
The system cannot find the path specified.
INFO:root:['git.bat', 'fetch'] (in E:\b\build\scripts\slave\.recipe_deps\depot_tools)
The system cannot find the path specified.
Traceback (most recent call last):
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipes.py", line 397, in <module>
    sys.exit(main())
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipes.py", line 356, in main
    overrides=args.project_override)
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 600, in create
    package_deps._root_package = package_deps._create_package(RootRepoSpec(proto_file))
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 617, in _create_package
    return self._create_from_spec(repo_spec, package_spec)
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 634, in _create_from_spec
    deps[dep] = self._create_package(dep_repo)
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 606, in _create_package
    repo_spec.checkout(self._context)
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 212, in checkout
    _run_cmd([self._git, 'fetch'], cwd=dep_dir)
  File "E:\b\build\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\package.py", line 663, in _run_cmd
    subprocess.check_call(cmd, cwd=cwd)
  File "E:\b\depot_tools\python276_bin\lib\subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git.bat', 'fetch']' returned non-zero exit status 1

This expects git.bat to be in PATH, but for some reason it isn't. It only occurs on a subset of slaves, I may need to investigate on ones where it does occur to get more info.

Comment 11 by d...@chromium.org, Mar 31 2016

I bet the "depot_tools" recipe pin in "build/" needs to be bumped so that the recent Git script changes are picked up.

Comment 12 by jam@chromium.org, Mar 31 2016

Can we revert the migration while this is being investigated?

Comment 13 by d...@chromium.org, Mar 31 2016

I think that is not a good idea. The migration is fairly invasive to a critical dependency (Git).
Agreed with dnj@ . Let's investigate more at least for little bit. Since I'm an owner for now let me coordinate this. You can ping me on IM.

I connected to vm772-m4 and that slave seems to have had errors updating its code during startup:

C:\Windows\system32>chdir /d E:\b

E:\b>call E:\b\build\slave\run_slave.bat
fatal: unable to access 'https://chromium.googlesource.com/a/chromium/tools/depo
t_tools.git/': Couldn't resolve host 'chromium.googlesource.com'
fatal: unable to access 'https://chromium.googlesource.com/a/chromium/tools/depo
t_tools.git/': Couldn't resolve host 'chromium.googlesource.com'
Syncing projects: 100% ( 1/ 1) slave.DEPS

slave.DEPS (ERROR)
----------------------------------------
[0:00:00] Started.
----------------------------------------
Error: Command 'git remote update' returned non-zero exit status 1 in E:\b\slave
.DEPS
Fetching origin

fatal: unable to access 'https://chrome-internal.googlesource.com/chrome/tools/b
uild/slave.DEPS.git/': Couldn't resolve host 'chrome-internal.googlesource.com'
error: Could not fetch origin

(E:\b\build\slave\\run_slave.py) `gclient sync` failed; proceeding anyway...
Using buildbot_slave_8_4 and twisted_10_2
HotPatchSlaveBuilder(False)
Using slave name vm772-m4
Using master host master4a.golo.chromium.org
Using master port 8190

Actually that bot seems to be more deeply broken: cmd.exe not found?
slave.png
36.3 KB View Download

Comment 16 by d...@chromium.org, Mar 31 2016

I just tried this on vm246-m4 and hit a similar issue: cmd.exe not found. I had to manually run "c:\Program Files\Windows\System32\cmd.exe", but "gclient sync" in /b worked for me, though.

I think the system breakage is indicative of the problem, but your specific slave.DEPS thing is not.

Comment 17 by d...@chromium.org, Mar 31 2016

Actually, the "cmd.exe" issue suggests a larger default PATH issue, which is possibly affected by the new Git scripts.

Comment 18 by d...@chromium.org, Mar 31 2016

PATH on this box: e:\b\depot_tools;c:\tools

This is notably missing c:\Windows\system32, which explains the cmd.exe issues.
I've triggered a reboot of vm772-m4 using https://chrome-infra.appspot.com/vm-power

On another slave where this issue occurs, vm330-m4 , this is the result of trying to open cmd.exe (I was getting similar results on vm246-m4).
slave.png
87.9 KB View Download
Dan, how did you check the PATH? I don't have a working prompt, and trying to get to the GUI results in this error:
slave.png
114 KB View Download

Comment 21 by d...@chromium.org, Mar 31 2016

I've found two independent problems here, probably from the same cause:
1) Recipe engine uses "git.bat" from its "depot_tools" checkout, not the one in /b.
2) The Git update (or something) broke default user PATH (HKEY_CURRENT_USER\Environment).

WRT (1), I don't know the right course of action. If the recipe engine is *supposed* to use its own Git, then it needs to make a point of calling "gclient.bat" in the checkout. I think this is stupid, though, as the recipe engine shouldn't be responsible for installing a parallel toolchain. Consequently, I think the bug here is that recipe engine isn't calling "e:\b\depot_tools\git.bat" for Git operations.

WRT (2), this needs to be looked farther into, but is probably not causing the actual bot breakages that we're seeing.
FWIW, vm772-m4 either came just fine from the reboot, or didn't reboot at all. I think it's lost connectivity when I requested the reboot, and when I connected again the system also disconnected saying that it's about to reboot, so I think it was real.
re: comment #21:

1) why do you think so? it seems to rely on git.bat in PATH

2) that looks like exactly the cause of this outage; do we have any ideas how to further debug and fix this?

Comment 24 by d...@chromium.org, Mar 31 2016

I'm going to reboot vm246-m4 and see if anything good happens to the PATH. I think (1) is still going to be an issue regardless, though.

Comment 25 by jam@chromium.org, Mar 31 2016

wrt comment 13 "I think that is not a good idea. The migration is fairly invasive to a critical dependency (Git).". I can't parse what this means.

Git is critical but so are the bots working. My cursory googling shows that the on disk files are backward compatible; but if I'm wrong please let me know. In that case, why wouldn't we undo this change immediately? At least 1/4 of the bots are broken now and this will rise as more update afaik.

Comment 26 by d...@chromium.org, Mar 31 2016

(1) I know this is the case because I CD into E:\b\build\scripts\slave\.recipe_deps\depot_tools, run "git.bat", and see it fail. I then run "gclient.bat", let it do its thing (install the Git toolchain), then run "git.bat" again and see it succeed.

Because (2) is NOT a problem once in the command-line environment, it probably isn't a problem to the bot, which runs in the command line environment.

Comment 27 by d...@chromium.org, Mar 31 2016

jam@, it is not an easy thing to roll back an invasive change on the bots. Rolling back now would likely introduce more problems.

Comment 28 by d...@chromium.org, Mar 31 2016

To follow up, most of the bots have already successfully updated their Git, these bots included. The change has been active for 12+ hours. If I'm right, this is a problem with the recipe engine, not the Git update.

Comment 29 by d...@chromium.org, Mar 31 2016

To follow up on #24, rebooting did *not* fix the PATH problem.

Comment 30 by d...@chromium.org, Mar 31 2016

INFO:root:['git.bat', 'fetch'] (in E:\b\build\scripts\slave\.recipe_deps\depot_tools)

This indicates that "git.bat" is being run out of the "depot_tools" CWD, causing it to pick up the recipe engine depot_tools "git.bat" instead of the system one. I think we need to change recipe engine to not change directory, maybe pass "--git-dir" instead?
Note that the change did not change anything in the registry, nor should PATH have changed outside of depot tools. 

I'll be online shortly. Can we get labs folks to comment on it these boys are different somehow than the working ones? 
Ok, I'm properly online now.

For #30: of course it's running git in the depot_tools repo to do the fetch. that's how cwd works. However, there's no git.bat in that directory (it's generated by running `gclient`, which the recipe engine does not do), so it's probably not using the non existent bat file (though if it exists, that would be strange and good to know).
Owner: d...@chromium.org
> Actually that bot seems to be more deeply broken: cmd.exe not found?
This is a typical symptom on some bots and I've seen it earlier. I don't think that it's related to current outage.
I think Sergiy is right.  I've seen the default file association for .exe get hammered on some windows slaves and have fixed as I have come across them while looking at other problems.   But I have not seen this affect builds in the past so I think this is a red herring.
Trying to debug these bots: they can't open executable files (like cmd.exe). I've tried to restore the exe association, but HKCR\exefile\shell\open\command already has the correct value of `"%1" %*`
How are we supposed to believe that ".exe's don't work" is a solid basis for debugging other issues?

Comment 38 by d...@chromium.org, Mar 31 2016

exes work fine once you're in cmd.exe :)
It looks like dnj was right: gclient is being run and does download its binary dependencies, but the way it does so, it ends up failing to update the bat files (for some reason that's still not clear).

However, the typical flow of running 'gclient' before the engine, doesn't apply when in that cwd, and so the update doesn't happen, and git is permanently stuck.

I suspect this would have broken with any git update: there hasn't been one for over a year, and a lot has changed since then.
"the fire isn't so bad once you put on a firefighting jacket"

/me is not reassured
See issue 599550 for workaround on how to launch cmd.exe on borked bots.

Comment 42 by d...@chromium.org, Mar 31 2016

Yeah: http://i.imgur.com/c4jt321.png

But in this case I think it really is.
if it really is fine, we should break it uniformly across all the bots. If not, we should make these bots behave the way that the rest of the bots do. Either way, the bots should behave the same way. The fact that you found a workaround is nice, but mostly irrelevant.

Comment 44 by d...@chromium.org, Mar 31 2016

I am not by any means actually saying that broken environment is good. Just not related to this P0.
Issue 599394 has been merged into this issue.
Cc: martiniss@chromium.org hinoka@chromium.org
 Issue 599342  has been merged into this issue.

Comment 48 by d...@chromium.org, Mar 31 2016

Status: Fixed (was: Started)
https://uberchromegw.corp.google.com/i/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng/builds/167272

This build was failing before, is now not failing, so this looks like it's fixed. Please re-open if any *new* incidents are seen on builds started after ~12:20PM PST.

Comment 49 by jam@chromium.org, Mar 31 2016

What was the fix?

Comment 50 by d...@chromium.org, Mar 31 2016

Yeah no idea where bugdroid is, but fix was:
- https://codereview.chromium.org/1849903002
- https://codereview.chromium.org/1849563004/ (b/c I'm stupid)

Comment 51 by jam@chromium.org, Mar 31 2016

Thanks.

Can we have a post-mortem for this please, since many bots were broken for at least half a day? It would be good to know what lessons we can apply so this doesn't reoccur.

Comment 52 by d...@chromium.org, Mar 31 2016

Owner: seanmccullough@chromium.org
Assigning to trooper for postmortem.

Comment 53 by d...@chromium.org, Mar 31 2016

Although I'm not sure a postmortem is particularly useful here. There really isn't a specific lesson to learn - we have bots, some subset of which were in a weird state that our pre-launch testing didn't pick up. I think that's just going to be the case for all invasive deployments, since we have a non-homogenous fleet, and we're already working on solutions to make more deployments non-invasive ones.

The summary of this problem was: on bots with a recipe engine checkout that had run "gclient" in its "depot_tools" directory before, it used the wrong Git. Do you think there's value writing a postmortem about that?
Issue 599455 has been merged into this issue.
I think there's value in writing that plus the description of the steps we're taking to move to a consistently-imaged fleet (w/ tracking bugs, timelines, etc., if they exist).

Postmortems don't have to be long.

Comment 56 by jam@chromium.org, Mar 31 2016

+1 to comment 55, and also for adding monitoring that would catch ~25% of builds suddenly failing. Once bug 599518 is fixed and this is marked as an infra failure, the spike should be something that can be alerted on automatically.
Components: Infra
Labels: -Infra

Sign in to add a comment