Windows buildbots failing update_bot and other steps with "fatal error - add_item" from sh.exe, find.exe, mv.exe and others |
||||||||||||||||||
Issue descriptionExample builds: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/413 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/418 Excerpts: 0 [main] find 5560 c:\b\rr\tmpvwttjv\w\src\third_party\llvm-build-tools\gnuwin\find.EXE: *** fatal error - add_item ("\??\c:\b\rr\tmpvwttjv\w\src\third_party", "/", ...) failed, errno 1 0 [main] mv 5868 c:\b\rr\tmpvwttjv\w\src\third_party\llvm-build-tools\gnuwin\mv.EXE: *** fatal error - add_item ("\??\c:\b\rr\tmpvwttjv\w\src\third_party", "/", ...) failed, errno 1 This is blocking us from building new Clang packages so it's pretty bad. It's also very strange because we haven't changed anything on the Clang packaging side.
,
Nov 14
win_upload_clang is served by two builders: build4-m4 and build6-m4 These are all the builds that show the "add_item" error: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/413 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/414 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/415 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/416 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/418 The first build was performed by build6-m4, and all the rest by build4-m4. The failures are mostly the same, but not completely. 413, 414 and 415 had four failures: Failing Tests (4): Clang :: Coverage/html-diagnostics.c Clang :: Coverage/html-multifile-diagnostics.c LLVM :: ExecutionEngine/OrcMCJIT/load-object-a.ll LLVM :: Transforms/ThinLTOBitcodeWriter/no-type-md.ll 416 and 418 only has three: Failing Tests (3): Clang :: Coverage/html-diagnostics.c Clang :: Coverage/html-multifile-diagnostics.c LLVM :: ExecutionEngine/OrcMCJIT/load-object-a.ll What happened to no-type-md.ll? In the three builds above it failed with "mv.EXE: *** fatal error - add_item ("\??\c:\b\rr\tmp7ybnff\w\src\third_party", "/", ...) failed" but in these two last builds it didn't fail. Very strange.
,
Nov 14
I verified that the gnuwin package hasn't changed: $ gsutil.py ls -la gs://chromium-browser-clang/tools/gnuwin-8.zip 4443986 2018-08-21T16:47:47Z gs://chromium-browser-clang/tools/gnuwin-8.zip#1534870067321180 metageneration=1 TOTAL: 1 objects, 4443986 bytes (4.24 MiB) And the bot output shows that they did pull that in each build.
,
Nov 14
Here's the last build that succeeded: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/406 And first that failed in this mode: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/407 Something must have changed.
,
Nov 14
,
Nov 14
What does the find invocation in html-diagnostics.c look like?
// RUN: rm -rf %t
// RUN: %clang_cc1 -analyze -analyzer-output=html -analyzer-checker=core -o %t %s
// RUN: find %t -name "*.html" -exec cat "{}" ";" | FileCheck %s
It's unlikely to be hitting a bad find executable; we put our gnuwin tools first in path.
But it would be nice to be able to ssh to the bot and verify that.
,
Nov 14
The error message comes from the cygwin runtime: https://github.com/mirror/newlib-cygwin/blob/17918cc6a6e6471162177a1125c6208ecce8a72e/winsup/cygwin/mount.cc#L473 This supports the theory that the wrong find.exe and mv.exe are getting run, because the gnuwin tools don't use cygwin, and that maybe something in the chrome infra environment changed to make these binaries appear on the system in a place where they get picked up when executing the test. I don't understand how this happens though, as we put the gnuwin tools first on path...
,
Nov 14
Okay, this is not a Clang packaging problem, this is a Chrome Infra problem. From this tryjob: https://chromium-review.googlesource.com/c/chromium/src/+/1335581 Already on the bot_update step, git is failing with the same error message: src/third_party/gperf (Elapsed: 0:00:47) ---------------------------------------- [0:01:08] Started. _____ src\third_party\gperf at d892d79f64f9449770443fb06da49b5a1e5d33c1 [0:01:08] running "git cat-file -e d892d79f64f9449770443fb06da49b5a1e5d33c1^^^^{commit}" in "C:\b\c\git_cache\chromium.googlesource.com-chromium-deps-gperf" skipping mirror update, it has rev=d892d79f64f9449770443fb06da49b5a1e5d33c1 already ________ running 'git -c core.deltaBaseCacheLimit=512m clone --no-checkout --progress --shared --verbose C:\b\c\git_cache\chromium.googlesource.com-chromium-deps-gperf C:\b\rr\tmppdtmxd\w\src\third_party\_gclient_gperf_azzpv_' in 'C:\b\rr\tmppdtmxd\w' [0:01:10] Cloning into 'C:\b\rr\tmppdtmxd\w\src\third_party\_gclient_gperf_azzpv_'... [0:01:25] 0 [main] sh 6324 C:\b\cipd_path_tools\usr\bin\sh.exe: *** fatal error - add_item ("\??\C:\b\cipd_path_tools", "/", ...) failed, errno 1 [0:01:25] Stack trace: [0:01:25] [0:01:25] Frame Function Args [0:01:25] [0:01:25] 000FFFF9BC0 0018005E0DE (0018025366A, 00180230C39, 00600010000, 000FFFF8B40) [0:01:25] [0:01:25] 000FFFF9BC0 001800468F9 (000FFFFABF0, 000FFFF9BC0, 00000000000, 00000000000) [0:01:25] [0:01:25] 000FFFF9BC0 00180046932 (000FFFF9BC0, 00000000001, 00600010000, 625C3A435C3F3F5C) [0:01:25] [0:01:25] 000FFFF9BC0 001800CD2CB (00000000000, 00040000024, 00000000000, 00000000000) [0:01:25] [0:01:25] 1D47C2EF4D07F5B 0018011BF05 (001800B463C, 00000000000, 00000000000, 00000000000) [0:01:25] [0:01:25] 000FFFFCCD0 00180046EF3 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:25] [0:01:25] 00000000000 00180045A03 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:25] [0:01:25] 000FFFFFFF0 00180045AB4 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:25] [0:01:25] End of stack trace [0:01:25] fatal: Could not read from remote repository. [0:01:25] [0:01:25] Please make sure you have the correct access rights [0:01:25] and the repository exists. ________ running 'git -c core.deltaBaseCacheLimit=512m clone --no-checkout --progress --shared --verbose C:\b\c\git_cache\chromium.googlesource.com-chromium-deps-gperf C:\b\rr\tmppdtmxd\w\src\third_party\_gclient_gperf_azzpv_' in 'C:\b\rr\tmppdtmxd\w' [0:01:25] Cloning into 'C:\b\rr\tmppdtmxd\w\src\third_party\_gclient_gperf_azzpv_'... [0:01:40] 0 [main] sh 6172 C:\b\cipd_path_tools\usr\bin\sh.exe: *** fatal error - add_item ("\??\C:\b\cipd_path_tools", "/", ...) failed, errno 1 [0:01:40] Stack trace: [0:01:40] [0:01:40] Frame Function Args [0:01:40] [0:01:40] 000FFFF9BC0 0018005E0DE (0018025366A, 00180230C39, 00600010000, 000FFFF8B40) [0:01:40] [0:01:40] 000FFFF9BC0 001800468F9 (000FFFFABF0, 000FFFF9BC0, 00000000000, 00000000000) [0:01:40] [0:01:40] 000FFFF9BC0 00180046932 (000FFFF9BC0, 00000000001, 00600010000, 625C3A435C3F3F5C) [0:01:40] [0:01:40] 000FFFF9BC0 001800CD2CB (00000000000, 00040000024, 00000000000, 00000000000) [0:01:40] [0:01:40] 1D47C2EFE4179AF 0018011BF05 (001800B463C, 00000000000, 00000000000, 00000000000) [0:01:40] [0:01:40] 000FFFFCCD0 00180046EF3 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:40] [0:01:40] 00000000000 00180045A03 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:40] [0:01:40] 000FFFFFFF0 00180045AB4 (00000000000, 00000000000, 00000000000, 00000000000) [0:01:40] [0:01:40] End of stack trace [0:01:40] fatal: Could not read from remote repository. [0:01:40] [0:01:40] Please make sure you have the correct access rights [0:01:40] and the repository exists.
,
Nov 14
Bumping to P0 to get infra's attention. Did depot_tools, some cipd package, or whatever else is providing these binaries or their dlls get updated recently? The the windows image change or some windows update run?
,
Nov 14
Issue 904922 has been merged into this issue.
,
Nov 14
Here's another add_item failure from a Windows bot completely unrelated to these two: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8929878250662536112/+/steps/bot_update/0/stdout
,
Nov 14
Please file bugs like that with g.co/bugatrooper, and keep it Untriaged. I only noticed it by accident.
These two machines build{4,6}-m4 had trouble yesterday, and were rebooted by Labs (I also couldn't SSH to them). It seems they are down again - something must be killing them hard.
+Infra>Labs - could you reboot the machines again please?
Everyone on this bug: Any idea what might be taking them down? (I have very little Windows knowledge...)
,
Nov 14
Issue 905296 has been merged into this issue.
,
Nov 14
+vhang@ for Labs triage. Thanks!
,
Nov 14
Also +Foundation-Troopers - did we change anything on the Windows buildbot slaves recently re: #c7 ?
,
Nov 14
,
Nov 14
The machines don't seem to be down completely, they are still connected to a buildbot master, and am able to rdesktop into them and poke around. Granted ssh doesn't seem to work for me either. Not sure if they may just me in an unstable state, but they are not totally down. Note we are tracking some general GOLO network issues related to lavacake migration in issue 905345. Not sure if that's coming into play here or not, but something to be aware of.
,
Nov 14
Going to reboot build4-m4 now, leaving build6-m4 in it's current state for comparison/investigation.
,
Nov 14
Thanks, John! Yesterday, I was going to clobber build directories, but didn't get a chance since the machines immediately started a build once rebooted. Maybe I should clobber the build before any new ones start this time and see if it helps.
,
Nov 14
ssh didn't come back up on build4-m4, even after a reboot. Looks like ssh was busted on these, likely for some time. re-ran ssh config/setup and that seemed to fix it (reboot wasn't needed to fix that, it seems) build4-m4 was rebooted + ssh fix applied build6-m4 was not rebooted, just ssh fix applied chrome-bot@build4-m4's password: Last login: Wed Jun 6 04:48:36 2018 from ssh-dev-g.golo.chromium.org chrome-bot@BUILD4-M4 ~ At any rate, these should be available for inspection at this point.
,
Nov 14
No problem. These seem to be very old installs. If clobber doesn't fix, we can look at other options. I guess GCE isn't an option for these?
,
Nov 14
> At any rate, these should be available for inspection at this point. Excellent, thank you very much! > These seem to be very old installs. If clobber doesn't fix, we can look at other options. I guess GCE isn't an option for these? Yeah, maybe we have beefy enough GCE instances now. But note that the ""fatal error - add_item" from sh.exe, find.exe, mv.exe and others" problem isn't confied to build4-m4 and build6-m4, but seems to be happen on may Windows bots. E.g. the one linked in #11 was done by build448-m1.
,
Nov 14
Thanks, John! I'll take the bug for clobber, and will ping Foundation trooper for the path problem.
,
Nov 14
+johnw back (removed by accident)
,
Nov 14
SSH'ed to build4-m4: /cygdrive/c/b/build/slave doesn't have any build directory, so I'm guessing it got clobbered already? I'll look into build6-m4.
,
Nov 14
Same story for build6-m4 - I couldn't find a build directory. Marking as Untriaged for Foundation trooper to take a look re: #c7.
,
Nov 14
Please also see #11 which shows the same error from a completely different step on a completely different bot. Starting by looking at the "package clang" step in #0 where we originally saw this might lead down a rabbit hole.
,
Nov 14
I asked iannucci@ (the current Foundation trooper) - he's not aware of any recent changes to our environment setup. However, the fact that cygwin gets in the mix is a big red flag - something is indeed not right. I'll check on #c11 - that's a good find, it's odd that we run into cygwin even on a fairly standard builder.
,
Nov 14
Another curious tidbit: https://github.com/git-for-windows/git/issues/493 Turns out, git for windows uses a forked cygwin under the hood (or at least it used it back in 2016). This may explain the "add_item" crash in #c11. Since it resulted in a green build and is likely not directly related to cygwin on the system, I'm inclined to believe that it's benign, and likely a red herring. Or, maybe git is really old on build{4,6}-m4 and it leads to fatal failures, whereas on the newer systems it reports a crash but recovers? IDK...
,
Nov 15
,
Nov 15
Realistically, this shouldn't be a trooper P0 - the bot didn't attempt another build since the morning, and it should at least temporarily be fixed. We may look into a longer term fix if it breaks again, e.g. moving it to GCE bots, and ideally migrating to LUCI stack so a Machine Provider can allocate and reboot the bots periodically as needed.
,
Nov 15
The bot is blocking a (minor) clang roll for 3 days already: https://chromium-review.googlesource.com/c/chromium/src/+/1331613
,
Nov 15
I kicked off another build on that CL, let's see what happens. I'm not sure what else to do at the moment.
,
Nov 15
> This may explain the "add_item" crash in #c11. > Since it resulted in a green build and is likely not directly related to cygwin on the system, I'm inclined to believe that it's benign, and likely a red herring. The reason the build succeeded is that "gclient sync" re-tries failed "git clone" operations, normally because of network problems. I wouldn't call git crashing benign. Just looking at the most recent builds on https://ci.chromium.org/p/chromium/g/chromium.win/console, I found another one: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8929802333930120064/+/steps/bot_update/0/stdout > Or, maybe git is really old on build{4,6}-m4 and it leads to fatal failures As pointed out, it's not just these two machines. Also, we don't use local git; I believe git comes from depot_tools, which we auto-update? Or maybe from somewhere else these days. That's why I'm wondering if anything changed in chrome infra recently. Did we push any new software into the build environment? It looks like the cipd client was updated in this date range: https://chromium-review.googlesource.com/c/1327823 could that be related? > Realistically, this shouldn't be a trooper P0 - the bot didn't attempt another build since the morning, and it should at least temporarily be fixed. I agree it's more a P1 than P0. But this is blocking Clang updates, which is a serious problem. Also, *nothing is fixed* (except that ssh access has been restored). I'm not sure what you mean is temporarily fixed. The bot is a trybot that's run on demand as part of pushing new Clang versions. The job Max started last night failed just the same as the others: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/421
,
Nov 15
I've been confused why we're seeing the "add_item" error in gnuwin's find, because the add_item comes from cygwin (and also msys), but gnuwin doesn't use that. Except, our gnuwin package has msys's find, rm and mv after https://codereview.chromium.org/1917853002 due to long path support :-(
,
Nov 15
This is the best explanation I've been able to find for the error: https://sourceforge.net/p/msys2/mailman/message/35070454/ http://cygwin.1069669.n5.nabble.com/spinlock-h-timeout-causing-fatal-error-add-item-abort-td126660.htm
,
Nov 15
Sorry, dropped an l on the last link: http://cygwin.1069669.n5.nabble.com/spinlock-h-timeout-causing-fatal-error-add-item-abort-td126660.html
,
Nov 15
I can repro the problem on the bot now: - rdp to build4 or build6 (probably works on others too) - open cmd.exe, cd to \b\cipd_path_tools\usr\bin\, run "find ." - this takes 10+ s to start If during those 10 s, one runs "find ." in another cmd.exe instance, it crashes with the add_item error :-]
,
Nov 15
Hah, someone has been here before, trying to do the thing from https://cygwin.com/faq.html#faq.using.startup-slow to cipd_path_tools --- C:\Users\chrome-bot>type \b\cipd_path_tools\etc\nsswitch.conf # Begin /etc/nsswitch.conf passwd: files db group: files # db db_enum: cache builtin db_home: env windows cygwin desc db_shell: env windows # cygwin desc db_gecos: env # cygwin desc # End /etc/nsswitch.conf --- That doesn't seem to work though. But if I make it just -- passwd: files group: files -- it works. Or at least "find ." works.
,
Nov 15
Hm. I can include this change to nsswitch.conf in the latest git package. What's the implied effect? /me goes to read link
,
Nov 15
ah, ok, so I think I understand (slightly more) now. hans did you have to run `mkpasswd and mkgroup` to update the passwd/group files too? Could you go/paste the contents of the passwd/group files on that bot?
,
Nov 15
Assigning to iannucci@ to investigate deeper or triage further. It appears to be a long standing problem that we didn't notice before.
,
Nov 15
,
Nov 15
> hans did you have to run `mkpasswd and mkgroup` to update the passwd/group files too? No. In fact, depending on which cygwin/msys instance you're looking at, there aren't even and passwd/group files. The gnuwin tools we use to build & test clang are just a dir full of binaries and dlls. > It appears to be a long standing problem that we didn't notice before. It might not be long standing. It went from "we've never seen this before" to "fails every time" with these two builds: https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/406 https://ci.chromium.org/buildbot/tryserver.chromium.win/win_upload_clang/407 I suppose the msys/cygwin problem is longstanding, but something must have changed in our environment to trigger it. IIUC, the failure is due to cygwin getting user/group info from the active directory service is taking too long, as in 15+ seconds. Perhaps something changed with those servers, or the networking, or the config of the machines. We now have a crazy workaround for the clang builds that seems to work: https://chromium-review.googlesource.com/c/chromium/src/+/1337614 (fingers crossed) That unblocks us, but I don't recommend that as a solution to the wider problem of msys processes crashing; it would be better to figure out why they can't reach the AD server or whatever they're trying to do.
,
Nov 15
Note: The AD communication issue may be related to the recent lavacake network migration. That's being investigated in issue 905826. Thanks.
,
Nov 15
If that's the case, then we have a cause as well as a fix; I don't think there's any reason that the git tools NEED to talk to AD for anything. I can definitely update the git bundle that we ship, but IIUC you're using a separate gnuwin tool collection as well?
,
Nov 15
> I can definitely update the git bundle that we ship, but IIUC you're using a separate gnuwin tool collection as well? Yes; the workaround linked to in comment 44 takes care of isolating that collection from the effects of (presumably) issue 905826.
,
Nov 15
re communicating to AD; we don't use AD in any way that should intersect with the msys tools on our bots; we don't do network mounts that the tools should be touching, the only accounts they need to care about are local. Anything to remove talking to the network from local file operations seems like something we really, really want to do :)
,
Nov 15
,
Nov 16
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/29a3a80a284ce7fb6fefbbeb6171945006b52e93 commit 29a3a80a284ce7fb6fefbbeb6171945006b52e93 Author: Hans Wennborg <hans@chromium.org> Date: Fri Nov 16 14:56:41 2018 Roll clang 346388-1:346388-3 This picks up package.py changes #607265 and #608413. It does not change the version of clang. It also includes a crazy workaround for msys binaries (in our case find.exe and mv.exe, used by lit tests) crashing during some unknown chrome infra problem (see last bug). Bug: 870331 , 905289 Change-Id: Ic1d9fa64d6fcd4b590139c9343bed5bbe4d3faa3 Reviewed-on: https://chromium-review.googlesource.com/c/1337614 Commit-Queue: Hans Wennborg <hans@chromium.org> Reviewed-by: Max Moroz <mmoroz@chromium.org> Reviewed-by: Reid Kleckner <rnk@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#608776} [modify] https://crrev.com/29a3a80a284ce7fb6fefbbeb6171945006b52e93/tools/clang/scripts/package.py [modify] https://crrev.com/29a3a80a284ce7fb6fefbbeb6171945006b52e93/tools/clang/scripts/update.py
,
Nov 16
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/83630c9e56e1b089c4981608b41f6d2cd83206b5 commit 83630c9e56e1b089c4981608b41f6d2cd83206b5 Author: Reid Kleckner <rnk@chromium.org> Date: Fri Nov 16 23:49:20 2018 Revert "Roll clang 346388-1:346388-3" This reverts commit 29a3a80a284ce7fb6fefbbeb6171945006b52e93. Reason for revert: We still need to package asan for i686-android: https://crbug.com/906246 Original change's description: > Roll clang 346388-1:346388-3 > > This picks up package.py changes #607265 and #608413. It does not change > the version of clang. > > It also includes a crazy workaround for msys binaries (in our case > find.exe and mv.exe, used by lit tests) crashing during some unknown > chrome infra problem (see last bug). > > Bug: 870331 , 905289 > Change-Id: Ic1d9fa64d6fcd4b590139c9343bed5bbe4d3faa3 > Reviewed-on: https://chromium-review.googlesource.com/c/1337614 > Commit-Queue: Hans Wennborg <hans@chromium.org> > Reviewed-by: Max Moroz <mmoroz@chromium.org> > Reviewed-by: Reid Kleckner <rnk@chromium.org> > Reviewed-by: Nico Weber <thakis@chromium.org> > Cr-Commit-Position: refs/heads/master@{#608776} TBR=thakis@chromium.org,hans@chromium.org,rnk@chromium.org,mmoroz@chromium.org Change-Id: I225d64bd4e166c0150d63cfa7ab0bad7e0dd92f7 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 870331 , 905289 , 906246 Reviewed-on: https://chromium-review.googlesource.com/c/1340836 Reviewed-by: Reid Kleckner <rnk@chromium.org> Commit-Queue: Reid Kleckner <rnk@chromium.org> Cr-Commit-Position: refs/heads/master@{#609028} [modify] https://crrev.com/83630c9e56e1b089c4981608b41f6d2cd83206b5/tools/clang/scripts/package.py [modify] https://crrev.com/83630c9e56e1b089c4981608b41f6d2cd83206b5/tools/clang/scripts/update.py
,
Nov 20
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3e3b34828c23ab8d66267c9d9fccf79fd7324693 commit 3e3b34828c23ab8d66267c9d9fccf79fd7324693 Author: Max Moroz <mmoroz@chromium.org> Date: Tue Nov 20 01:22:07 2018 Clang scripts: reland workaround for msys binaries by hans@ from https://crrev.com/c/1337614. Bug: 870331 , 905289 Change-Id: I4c5decd8b299d64d2b6c302bcd0049f8f9581f48 Reviewed-on: https://chromium-review.googlesource.com/c/1343258 Commit-Queue: Max Moroz <mmoroz@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#609544} [modify] https://crrev.com/3e3b34828c23ab8d66267c9d9fccf79fd7324693/tools/clang/scripts/update.py
,
Nov 26
Between Hans's workaround and the underlying issue being fixed in issue 905826, I think we can call this done. Maybe we want to revert Hans's workaround now that it's no longer strictly needed (i.e. revert https://chromium-review.googlesource.com/c/1343258); not sure. |
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by h...@chromium.org
, Nov 14