Failing ninja-confirm-no-op steps are difficult to detect and debug |
|||||
Issue descriptionSee bug 690090 for where this was noticed. On most builders we invoke ninja twice when we do a compile, in order to make sure that the second invocation doesn't actually do anything (in theory, the first build should make sure everything is up-to-date; if a second invocation of the same targets thinks there's more work to do, we got the dependencies wrong and there's a problem with the build, so that's bad). It looks like we're in a situation now, at least on some configurations, where this is happening, but it's very difficult to look at the failed build and see what happened. Example: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_re l_ng/builds/361033 The normal way to tell what happened with a failed build is to look for red steps on the build page. Here, the only red step is `steps`, and generally we don't want people to ever have to look at the output of `steps` logs for failures (these are infra-side things). In this case, the steps log does say, at the end: @@STEP_LOG_LINE@reason@Failing build because ninja reported work to do.@@@ @@@STEP_LINK@logdog-->reason@https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Flinux_chromium_chromeos_rel_ng%2F361033%2F%2B%2Frecipes%2Fsteps%2FFailure_reason%2F0%2Flogs%2Freason%2F0@@@ @@@STEP_LOG_LINE@reason@ This means that after completing a compile, another was run and@@@ But you don't actually see *what* we thought was out of date. You can then go back and look at the build log, and see that there is a *Green* build step called "failure reason" that also contains this message, and it's very confusing that this is green. In addition, if you look through the whole steps log, it's actually somewhere between difficult and impossible to tell that it's actually the `compile with patch (confirm no-op)` step that failed, and that the `compile without patch (confirm no-op) step was okay, partially because the latter step doesn't appear to have any useful output at all :(. The with patch log is: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/361033/steps/compile%20%28with%20patch%29%20confirm%20no-op/logs/stdio which, at least, does show the real problem. So, I think there are several issues here, most of which probably relate to how the recipes are being used. We need to figure out why the right steps aren't being marked as failing. And, of course, we need to fix the build so that confirm no-op is a no-op :).
,
Feb 9 2017
It doesn't look like this problem is happening on the main waterfalls, so it's unclear to me if there were just a few CLs that underspecified dependencies that we are "correctly" catching, or if this is the result of flakiness already in the committed code (and we know there's some of this) that is just manifesting here. Ikuta-san, since this is perhaps largely fallout from your changes to the way we do compiles w/ goma, can you work on this? The relevant code is here, I think: https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium/api.py?rcl=cb4c17a892538ffea98b5ad4171f06e310a25df1&l=183
,
Feb 9 2017
One obvious question: why doesn't raising self.m.step.StepFailure() result in a red step? Related question: is there a way we can actually mark the step that we want to consider as failed (the call on line 183)?
,
Feb 9 2017
So the way that steps in recipes work is roughly this:
* run the actual command
* check it's exit code
* if the exit code isn't in the `ok_ret` set:
* change the step's presentation.status to FAILURE
* raise StepFailure (or InfraFailure if it's an infra_step)
Non-obviously, that logic lives in the engine, right about here: https://github.com/luci/recipes-py/blob/master/recipe_engine/run.py#L403
(we should probably move this logic into the step module itself so that it's closer to user-space and the behavior is more obvious)
However, realizing that this is all that a failing step actually is, the recipe can be modified to give a better representation of what's actually happening by doing something like:
second_compile = api.step(....)
if not was_no_op(second_compile):
second_compile.presentation.status = api.step.FAILURE
second_compile.presentation.step_text = "This should have been a no-op, but it wasn't!"
raise api.step.StepFailure("no-op compile wasn't a no-op!")
,
Feb 9 2017
I think this used to work fine in the compile.py days. I don't understand recipes, so I can't help, sorry.
,
Feb 9 2017
@thakis - Yes, it did work fine in the compile.py days, and I'm not expecting you to fix this, but mostly just wanted to make you aware of the current issue so you didn't also discover it and re-report it :)
,
Feb 9 2017
The exit code of `ninja -d explain -n` is 0 even if code tree is dirty. So "compile confirm no-op" looks succeeded. Which fix is better? 1. check the output of `ninja -d explain -n` 2. `ninja -d explain -n` returns 1 when dirty
,
Feb 9 2017
We used to do 1; it worked.
,
Feb 9 2017
note: Unfortunately tikuta is ooo today. if it's easy and in hurry, I will pay his debt, though...
,
Feb 9 2017
I don't think this has to be fixed today.
,
Feb 10 2017
I fix this issue following iannucci's solution. https://chromium-review.googlesource.com/c/440744/
,
Feb 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/0f07e03ad937f9e01a1c9f19300623c470b48ab2 commit 0f07e03ad937f9e01a1c9f19300623c470b48ab2 Author: Takuto Ikuta <tikuta@google.com> Date: Fri Feb 10 02:39:18 2017 Make sure to fail if build dependency is broken BUG= 690269 Change-Id: Id13745b85a981fa54bc042c6eb85a757a84b755a Reviewed-on: https://chromium-review.googlesource.com/440744 Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Reviewed-by: Robbie Iannucci <iannucci@chromium.org> [modify] https://crrev.com/0f07e03ad937f9e01a1c9f19300623c470b48ab2/scripts/slave/recipe_modules/chromium/example.py [add] https://crrev.com/0f07e03ad937f9e01a1c9f19300623c470b48ab2/scripts/slave/recipe_modules/chromium/example.expected/basic_out_dir_ninja_no_op_failure.json [modify] https://crrev.com/0f07e03ad937f9e01a1c9f19300623c470b48ab2/scripts/slave/recipe_modules/chromium/api.py
,
Feb 14 2017
Just saw this failure on a change which just deletes a bunch of files: https://codereview.chromium.org/2694113002 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/365646 Is this check flaky?
,
Feb 14 2017
Not normally, no.
,
Feb 14 2017
Second run of the bot passed, so something flaky is happening :(
,
Feb 14 2017
I think it has false-positive. if build rules miss some dependency, for example obj/chrome/browser/test_support_ui/web_ui_test_handler.obj doesn't have explicit dependency on gen/third_party/WebKit/public/web/window_features.mojom.h there would be race between 1) compiling web_ui_test_handler.obj 2) generating window_features.mojom.h if 1 happens before 2, it will report error like output obj/chrome/browser/test_support_ui/web_ui_test_handler.obj older than most recent input gen/third_party/WebKit/public/web/window_features.mojom.h (508733603 vs 508733645) if 2 happens before 1, nothing will be reported.
,
Feb 14 2017
note: in incremental build, ninja knows such dependency from deps generated by previous build, so such error may not be reported.
,
Feb 14 2017
That's a true positive, right? It means there's a missing dep, and because of that work might happen in the second build. It's not deterministic as you say, but it's due to a real bug (a missing dep) that needs fixing.
,
Feb 14 2017
I agree w/ thakis, identifying those cases is exactly the point.
,
Feb 14 2017
I also agree with thakis to fix the issue. anyway, could ninja have a tool that reports an error when it detects generated targets are found not in explicit (direct or indirect) deps, but found only in implicit deps?
,
Feb 14 2017
iannucci's solution looks worked on buildbot. Thank you for your instruction! Let me update status to fixed. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dpranke@chromium.org
, Feb 9 2017