generate_build_files flaked a few times on Builder: Android Arm64 Builder (dbg) |
||||||||||
Issue descriptionOn https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20(dbg) , generate_build_files failed once, then passed, then failed 4 times, then passed again, without any related-looking change. The failure each time was: Writing """\ ffmpeg_branding = "Chrome" goma_dir = "/b/c/cipd/goma" is_component_build = false is_debug = true proprietary_codecs = true symbol_level = 1 target_cpu = "arm64" target_os = "android" use_goma = true """ to /b/c/b/Android_Arm64_Builder__dbg_/src/out/Debug/args.gn. /b/c/b/Android_Arm64_Builder__dbg_/src/buildtools/linux64/gn gen //out/Debug --check -> returned 1 ERROR at build arg file (use "gn args <out_dir>" to edit):1:19: Build argument has no effect. ffmpeg_branding = "Chrome" ^------- The variable "ffmpeg_branding" was set as a build argument but never appeared in a declare_args() block in any buildfile. To view all possible args, run "gn args --list <builddir>" ERROR No targets. I could not find any targets to write, so I'm doing nothing. ERROR GN gen failed: 1 step returned non-z This doesn't repro locally. Failing builds: https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38833 https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38834 (this one passed) https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38835 https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38836 https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38837 https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38838 https://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/38839 passed again. cc'ing gn and ffmpeg folks. If it never happens again, then I guess we don't need to worry about it.
,
Dec 14 2016
Hmm, not sure how this could happen, ffmpeg_branding defined here: https://cs.chromium.org/chromium/src/third_party/ffmpeg/ffmpeg_options.gni Always included here: https://cs.chromium.org/chromium/src/third_party/ffmpeg/BUILD.gn Which is always included here: https://cs.chromium.org/chromium/src/media/BUILD.gn
,
Dec 15 2016
I'm increasingly suspecting that something has regressed / broken in GN. We've seen a few flaky reports of things like this that shouldn't be happening lately. I'm suspecting that my fix for bug 542846 (r433944, https://codereview.chromium.org/2509333003) might be the culprit. We could revert it without breaking anything, I think, but I'm reluctant to do so unless we get better luck at reproducing these flakes. brettw@, any thoughts? Maybe you could take another look at that CL and see if I screwed up something? Removing the ffmpeg component, I doubt this is actually an ffmpeg thing at all.
,
Dec 18 2016
,
Dec 18 2016
,
Dec 18 2016
,
Dec 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8d1f54effdc07f29a6fd00411e453ae3d67f9a86 commit 8d1f54effdc07f29a6fd00411e453ae3d67f9a86 Author: dpranke <dpranke@chromium.org> Date: Sun Dec 18 05:13:02 2016 Revert GN declare_args() change. We've seen a bunch of bug reports for flaky build failures complaining about unused or undeclared arguments; it seems possible that this CL might be the cause, so I'm speculatively reverting it and we can see if things get better; reverting the change makes GN more permissive, so this should be reasonably safe. This reverts r433944 (f6fd4d747c77e1c7968eb5af8b195a94a7ed2fac), i.e. https://codereview.chromium.org/2509333003. This CL leaves the build files changes that were in that CL because they are backwards-compatible and improvements regardless. TBR=brettw@chromium.org BUG= 674213 Review-Url: https://codereview.chromium.org/2586073002 Cr-Commit-Position: refs/heads/master@{#439366} [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/docs/reference.md [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/functions.cc [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/functions.h [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/functions_unittest.cc [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/parse_tree.cc [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/scope.cc [modify] https://crrev.com/8d1f54effdc07f29a6fd00411e453ae3d67f9a86/tools/gn/scope.h
,
Dec 19 2016
This is hopefully fixed now; please re-open if you see this again.
,
Jan 4 2017
We started seeing the same issue downstream around gn r436733 (not 100% sure here). Updating to r439377 did not help, builds still occasionally fail with "ERROR No targets" and a reference to a random build argument being unused. Usually retrying the gn run succeeds, so the issue is not easy to track. We run gn with tracing enabled and it looks like the failure happened in such a way that the log was not written at all. We're currently trying to see if rolling back to gn r436326 is enough to make the issues go away.
,
Jan 4 2017
Correction: we're trying r422996 as the "probabaly good" version.
,
Jan 4 2017
,
Jan 4 2017
Hm. If the revert didn't fix things (and I'm not super-surprised by this, since I didn't know what in that CL would've broken it), then I'm not sure what's going on. I'm punting this over to brettw@, who's probably better suited to debug this sort of thing. Brett, can you take a look at this, since it's causing flaky build failures?
,
Jan 5 2017
The issue reproduces more often for me when the machine is under load, I'm able to get a ~10% reproduction rate when running our downstream build script up till the GN point under high background load (and <1% when no load).
,
Jan 5 2017
Reproduces nearly 100% with "stress --cpu 8" running on the builder. I should have enough cycles to try and bisect this today.
,
Jan 5 2017
Bisects down to https://codereview.chromium.org/2515383005 aka "gn: Remove Scheduler::pool()." Confirmed by checking tip of master behavior: fails as-is, works after reverting that CL.
,
Jan 6 2017
Awesome work! After studying this, I understand what's going on. Lemme post a patch.
,
Jan 7 2017
What I think happened is that there's a race condition in noticing there's no work (work count == 0) and adding more work. Previously loading files (the first part of the build) didn't change the work count since they were posted to the pool. Since these never decremented the work count, we never checked for 0 when they completed. Later, writing targets would modify the work count, but there would be lots of these happening so the race condition was exceedingly rare. With the identified change, we started doing work counting for file loading. So whenever a file load finishes and there's no other work, GN will declare that the build is done and exit (resulting in confusing errors). The race condition happens when we post the declared targets back to the main thread for hooking up into the build graph. And mostly it will happen after processing the root BUILD file. There will only be one file load at this point as all other loads will be triggered by the main thread noticing unresolved dependencies. So if the file load finishes before the main thread can hook up any targets and post more loads to the work queue, we'll exit. Most of the time, the main thread wins this race because it doesn't have anything else to do and everything will be fine. But if the file load wins because there is high thread contention, the bug will trigger.
,
Jan 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/717182cd794aae4577fbae651a88e3ba6d973dc8 commit 717182cd794aae4577fbae651a88e3ba6d973dc8 Author: brettw <brettw@chromium.org> Date: Tue Jan 10 00:25:19 2017 Fix a GN race condition. The loader maintains a reference count on the "pending work". But the Setup object was starting the load before incrementing this reference count. It the loader happened to load anything on the background thread before the main thread could increment the reference count, GN would think there is no work left and exit early. There is another possible race condition between defining items and doing work on the thread pool can cause the build to be marked complete before everything is actually done. The race condition is described in more detail in the added comment in setup.cc. BUG= 674213 Review-Url: https://codereview.chromium.org/2617253002 Cr-Commit-Position: refs/heads/master@{#442416} [modify] https://crrev.com/717182cd794aae4577fbae651a88e3ba6d973dc8/tools/gn/setup.cc
,
Jan 10 2017
That CL seems to fix the issue for me. I guess we can now un-revert the speculative declare_args change that was reverted earlier.
,
Jan 15 2017
This is rolled in in https://codereview.chromium.org/2635713002 https://chromium.googlesource.com/chromium/src/+/2592262cc9dc8ea6cbe00a69a9b25b37970dae13 Cr-Commit-Position: refs/heads/master@{#443805} and hopefully fixed. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by scottmg@chromium.org
, Dec 14 2016