New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 690269 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Failing ninja-confirm-no-op steps are difficult to detect and debug

Project Member Reported by dpranke@chromium.org, Feb 9 2017

Issue description

See  bug 690090  for where this was noticed.

On most builders we invoke ninja twice when we do a compile, in order to make sure that the second invocation doesn't actually do anything (in theory, the first build should make sure everything is up-to-date; if a second invocation of the same targets thinks there's more work to do, we got the dependencies wrong and there's a problem with the build, so that's bad).

It looks like we're in a situation now, at least on some configurations, where this is happening, but it's very difficult to look at the failed build and see what happened.

Example:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_re
l_ng/builds/361033

The normal way to tell what happened with a failed build is to look for red steps on the build page. Here, the only red step is `steps`, and generally we don't want people to ever have to look at the output of `steps` logs for failures (these are infra-side things).

In this case, the steps log does say, at the end:

@@STEP_LOG_LINE@reason@Failing build because ninja reported work to do.@@@
@@@STEP_LINK@logdog-->reason@https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Flinux_chromium_chromeos_rel_ng%2F361033%2F%2B%2Frecipes%2Fsteps%2FFailure_reason%2F0%2Flogs%2Freason%2F0@@@
@@@STEP_LOG_LINE@reason@          This means that after completing a compile, another was run and@@@

But you don't actually see *what* we thought was out of date.

You can then go back and look at the build log, and see that there is a *Green* build step called "failure reason" that also contains this message, and it's very confusing that this is green.

In addition, if you look through the whole steps log, it's actually somewhere between difficult and impossible to tell that it's actually the `compile with patch (confirm no-op)` step that failed, and that the `compile without patch (confirm no-op) step was okay, partially because the latter step doesn't appear to have any useful output at all :(.

The with patch log is:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/361033/steps/compile%20%28with%20patch%29%20confirm%20no-op/logs/stdio

which, at least, does show the real problem.

So, I think there are several issues here, most of which probably relate to how the recipes are being used. We need to figure out why the right steps aren't being marked as failing.

And, of course, we need to fix the build so that confirm no-op is a no-op :).


 
Cc: holte@chromium.org
Here's another link to another failing build with the same problem, but on a different builder and from a different CL:

https://build.chromium.org/p/tryserver.chromium.linux/builders/chromeos_amd64-generic_chromium_compile_only_ng/builds/278845
Cc: katthomas@chromium.org
Owner: tikuta@chromium.org
Status: Assigned (was: Available)
It doesn't look like this problem is happening on the main waterfalls, so it's unclear to me if there were just a few CLs that underspecified dependencies that we are "correctly" catching, or if this is the result of flakiness already in the committed code (and we know there's some of this) that is just manifesting here.

Ikuta-san, since this is perhaps largely fallout from your changes to the way we do compiles w/ goma, can you work on this?

The relevant code is here, I think: 

https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium/api.py?rcl=cb4c17a892538ffea98b5ad4171f06e310a25df1&l=183
Cc: martiniss@chromium.org
One obvious question: why doesn't raising self.m.step.StepFailure() result in a red step? Related question: is there a way we can actually mark the step that we want to consider as failed (the call on line 183)?
So the way that steps in recipes work is roughly this:

  * run the actual command
  * check it's exit code
    * if the exit code isn't in the `ok_ret` set:
      * change the step's presentation.status to FAILURE
      * raise StepFailure (or InfraFailure if it's an infra_step)

Non-obviously, that logic lives in the engine, right about here: https://github.com/luci/recipes-py/blob/master/recipe_engine/run.py#L403

(we should probably move this logic into the step module itself so that it's closer to user-space and the behavior is more obvious)


However, realizing that this is all that a failing step actually is, the recipe can be modified to give a better representation of what's actually happening by doing something like:

  second_compile = api.step(....)
  if not was_no_op(second_compile):
    second_compile.presentation.status = api.step.FAILURE
    second_compile.presentation.step_text = "This should have been a no-op, but it wasn't!"
    raise api.step.StepFailure("no-op compile wasn't a no-op!")


I think this used to work fine in the compile.py days. I don't understand recipes, so I can't help, sorry.
@thakis - Yes, it did work fine in the compile.py days, and I'm not expecting you to fix this, but mostly just wanted to make you aware of the current issue so you didn't also discover it and re-report it :)
The exit code of `ninja -d explain -n` is 0 even if code tree is dirty. So "compile confirm no-op" looks succeeded.

Which fix is better?
1. check the output of `ninja -d explain -n`
2. `ninja -d explain -n` returns 1 when dirty

We used to do 1; it worked.
note: Unfortunately tikuta is ooo today.
if it's easy and in hurry, I will pay his debt, though...
I don't think this has to be fixed today.
Status: Started (was: Assigned)
I fix this issue following iannucci's solution.

https://chromium-review.googlesource.com/c/440744/

Just saw this failure on a change which just deletes a bunch of files:

https://codereview.chromium.org/2694113002
https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/365646

Is this check flaky?
Not normally, no.
Second run of the bot passed, so something flaky is happening :(

Comment 16 by ukai@chromium.org, Feb 14 2017

I think it has false-positive.

if build rules miss some dependency, for example
 obj/chrome/browser/test_support_ui/web_ui_test_handler.obj doesn't have explicit dependency on gen/third_party/WebKit/public/web/window_features.mojom.h

there would be race between
 1) compiling web_ui_test_handler.obj
 2) generating window_features.mojom.h

if 1 happens before 2, it will report error like
output obj/chrome/browser/test_support_ui/web_ui_test_handler.obj older than most recent input gen/third_party/WebKit/public/web/window_features.mojom.h (508733603 vs 508733645)

if 2 happens before 1, nothing will be reported.

Comment 17 by ukai@chromium.org, Feb 14 2017

note: in incremental build, ninja knows such dependency from deps generated by previous build, so such error may not be reported.
That's a true positive, right? It means there's a missing dep, and because of that work might happen in the second build. It's not deterministic as you say, but it's due to a real bug (a missing dep) that needs fixing.
I agree w/ thakis, identifying those cases is exactly the point.

Comment 20 by ukai@chromium.org, Feb 14 2017

I also agree with thakis to fix the issue.

anyway, could ninja have a tool that reports an error when it detects generated targets are found not in explicit (direct or indirect) deps, but found only in implicit deps?
Status: Fixed (was: Started)
iannucci's solution looks worked on buildbot.
Thank you for your instruction!

Let me update status to fixed.

Sign in to add a comment