New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment
link

Issue 921967: linux_jumbo_rel didn't recompile ui_jumbo_1.cc when it should have

Reported by brat...@opera.com, Jan 15 Project Member

Issue description

avi reported on a mailinglist that certain build failures on linux_jumbo_rel that seems to be caused by missing recompilation when source changes.

One example is
https://chromium-review.googlesource.com/c/chromium/src/+/1410004/2
which first failed the build because ui_jumbo_1.cc did not recompile ui_jumbo_1.o that still referenced non-existing code[1]. The second compilation worked.

It used gce-trusty-e833d7b0-us-east1-b-5qg3 [2] and I would like to see what it had done before this, and whether that bot kept not compiling correctly but I don't have access to that information.

The next compilation used gce-trusty-e833d7b0-us-east1-b-w6zm and it worked.

Another example is that https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8924308573690053872/+/steps/compile/0/stdout failed because browser/browser_jumbo_15.o referenced non-existing code. browser/browser_jumbo_15.cc had not been recompiled in that build so it used an object file compiled with an earlier version of the tree.

Looking at build in general, 120 of the last 400 linux-jumbo-rel builds have failed. Going back more, only 5-10% of the builds failed. Could be some corrupt dependency databases on some bots? But why and how? If that is the case, I'd like to see the last build before the failures started.

[1] 
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel/156092
https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8924314679325563984/+/steps/compile__with_patch_/0/stdout
[2] https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc#
 

Comment 1 by brat...@opera.com, Jan 15

Cc: thakis@chromium.org
Going through builds and mapping to bots is very slow but I now see enough pattern to say that a number of bots seem "broken" (probably bad ninja dependency databases). This includes

gce-trusty-e833d7b0-us-east1-b-5qg3 
gce-trusty-e833d7b0-us-west1-b-nxzm 
swarm1342-c4
swarm1343-c4
swarm1345-c4
swarm1360-c4 

It might be more as well. Anyone with access to one of them that can check when they started failing and if something special happened right before, or possibly if there was a particular code change that broke them?

Comment 2 by battre@chromium.org, Jan 15

Labels: -Pri-3 Sheriff-Chromium Pri-1

Comment 3 by thakis@chromium.org, Jan 15

Broken ninja databases seem very unlikely to me given that we haven't touched ninja in a long time. It seems more likely that some deps are missing in jumbo builds.

Who owns jumbo builds?

Comment 4 by brat...@opera.com, Jan 15

If anyone owns jumbo, I do, though dpranke and martiniss created the bots and have access to the information that is hidden from me.

The thing here is that for ninja there is nothing special with jumbo builds. The units are larger and there is an extra level of indirection in the translation units so there is a higher chance of triggering timeouts, and timing might be different but nothing that is fundamentally different from any other build.

Looking at the symptoms we can see that the bot started the compilation by invoking ninja. Ninja elected to skip recompiling some files which later results in link errors. I assume that ninja knows what files to recompile by looking in .ninja_deps so my suspicion is that on the 6 bots above, and maybe a couple more, .ninja_deps doesn't contain correct information.

The sudden increase in errors happened 18-20 hours ago but not being able to check what the bots did at that time, I've no real idea what might have happened and can only speculate. I'm still digging through the public data (very slow) to see if I can find a pattern. Someone with access to bot logs might see something right away.

Comment 5 by thakis@chromium.org, Jan 15

All other bots seem happy and we haven't touched ninja in a while. Ninja generally does what it's told to do. It's possible that something corrupted .ninja_deps, but that seems somewhat unlikely to me given that this happened on several bots doing jumbo builds and nowhere else (given current knowledge of the situation).

Comment 6 by brat...@opera.com, Jan 15

If I'm moving into the realm of speculation it could be that some patch overloaded ninja/clang/the bot and made something crash when compiled in jumbo mode preventing proper storage of data in ninja dep file. And that someone tried that patch ~5 times, breaking 5 bots, and then gave up.  (swarm1345-c4  that I listed above might be fine; thus 5 instead of 6; the rest still look broken as I waddle through the data). 

But that is speculation. I really want to know what happened so that it doesn't happen again.

Comment 7 by brat...@opera.com, Jan 15

I found a similar problem on the "Jumbo Mac" bot but in code that does not use jumbo.

1. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 compiled fine.
2. https://chromium-review.googlesource.com/c/1377616 removed content::MockRenderProcessHost::LockToOrigin(GURL const&) in content/public/test/mock_render_process_host.h
3. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37477 fails to link because vtable for content::(anonymous namespace)::RenderWidgetHostProcess in render_widget_host_unittest.o can't find the removed function.

In step #3 render_widget_host_unittest.o was not recompiled despite having an include #include "content/public/test/mock_render_process_host.h"

Since then (another 60 builds) the same error is there in every build. ninja just won't recompile render_widget_host_unittest.o so my guess/speculation is that it thinks that file doesn't depend on anything.

So what made that happen? Well, a couple of hours earlier, that bot had the same problem that avi saw on linux-jumbo-rel, which resolved itself with a revert. That is an interesting coincidence, but it's not an answer. What made ninja forget that render_widget_host_unittest.o depends on content/public/test/mock_render_process_host.h? Did it forget all dependencies? 

I welcome any ideas. I'm trying to reproduce it but no success yet (mostly because everything is slow).

Comment 8 by thakis@chromium.org, Jan 15

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 says "use_jumbo_build = true" -- does that not have an effect on mac?

Comment 9 by no...@chromium.org, Jan 15

Components: -Infra>Platform>Buildbot Infra>Client>Chrome

Comment 10 by thakis@chromium.org, Jan 15

Nvm, I misread "in code that does not use jumbo" as something else.

Comment 11 by brat...@opera.com, Jan 15

Jumbo can still be a factor, as in, something happens in jumbo builds that doesn't happen in other builds that breaks things at the bot. The error I reported above is really just a symptom of the bot having lost track of dependencies but I don't know when and how.

"Jumbo Mac" did have 2 builds failing purple with Internal Failure/Infra Failure but it did complete a couple of builds after that successfully. Maybe related, maybe not. The builds logs are truncated so no idea what happened to the bots.

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37457 
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37458

Comment 12 by a...@chromium.org, Jan 15

Something feels weird.

The jumbo bots started failing last night about 8pm EST. The Mac bots started generating broken code that crashes on launch starting last night about 8pm EST.

Are they related? Both are strange compilation issues.

Comment 13 by sergeybe...@chromium.org, Jan 15

CCI trooper here.

As an example, https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc shows that the bot started to crash 1/14/2019, 2:31:38 PM (Pacific Standard Time), followed by lots of failing builds.
https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-nxzm&sort_stats=total%3Adesc shows the same crash at exactly the same time, and so the other bots from #c1.

Maybe the theory that the bot crashed and corrupted some ninja DB has merits.

Comment 14 by thakis@chromium.org, Jan 15

Can a trooper grab build_dir/.ninja_deps and attach it here?

In theory, ninja tries to write checksums to the deps log to make it resilient against things like this, but that code could of course be buggy.

Comment 15 by sergeybe...@chromium.org, Jan 15

Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
Will do - thanks for a specific file request.

Comment 16 by sergeybe...@chromium.org, Jan 15

In the meantime, I did some investigation:

The list of all bots for the builder: https://chromium-swarm.appspot.com/botlist?c=id&c=task&c=os&c=status&d=asc&f=builder%3Alinux-jumbo-rel&f=pool%3Aluci.chromium.try&s=id

Exhibiting problem:

https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5315&sort_stats=total%3Adesc (crashed 2:27:24 PM)
https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-nxzm&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm2-c4&sort_stats=total%3Adesc (crashed 2:50:24 PM, never ran a build after that)
https://chromium-swarm.appspot.com/bot?id=swarm3-c4&sort_stats=total%3Adesc (crashed twice since 1/14/2019, 2:39:11 PM, but didn't break)
https://chromium-swarm.appspot.com/bot?id=swarm5-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm7-c4&sort_stats=total%3Adesc (also crashed but didn't break)
https://chromium-swarm.appspot.com/bot?id=swarm10-c4&sort_stats=total%3Adesc (crashed, didn't break)
https://chromium-swarm.appspot.com/bot?id=swarm11-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm12-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm1342-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm1343-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm1344-c4&sort_stats=total%3Adesc (crashed, didn't run a build after)
https://chromium-swarm.appspot.com/bot?id=swarm1345-c4&sort_stats=total%3Adesc

That's 14 bots out of 31 total. All of them crashed around Jan 14 2:27 PM (PST) or later. I'm going to take this bug for further triage - maybe we can clobber the bots to revive them.

Comment 17 by sergeybe...@chromium.org, Jan 15

ok, I have .ninja_deps file in the Drive (it's 190MB): https://drive.google.com/file/d/18W_uaapTFkr6rh24ETvdyYF5m2bv7fN9/view?usp=sharing

This is from swarm1342-c4.

Comment 18 by thakis@chromium.org, Jan 15

sergeyberenzin: Is that from https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 or from the Linux bot?

Comment 19 by thakis@chromium.org, Jan 15

Probably the linux bot -- can you get the file off https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 too?

Comment 20 by brat...@opera.com, Jan 15

One possibility regarding the other (Mac) problems, is that they are also not recompiling all the source but that the object files are still compatible enough to link, but not compatible enough to actually run.

A 190 MB .ninja_deps file means that it's not empty at least. Looking at a failed build (not the last one) at that machine, obj/content/browser/browser/browser_jumbo_7.o was not recompiled so it still references symbols from content/browser/site_instance_impl.h
that were removed in the CL that started this thread. There should be a reference in there between those two.

Do we need the date stamp of all the files too? In case the clock on the machines went haywire and started timestamping new files with an old date.

Comment 21 by sergeybe...@chromium.org, Jan 15

Oops, sorry - didn't realize you won't have access... Here's a public link to the .ninja_deps file: https://drive.google.com/file/d/18S3HWXr353DcsYR-Za6CYBtGNIrfrYXj/view?usp=sharing

Comment 22 by a...@chromium.org, Jan 15

Cc: ellyjo...@chromium.org
A new theory from me and Elly:

In comment 12 I found it suspicious that this started happening around the time that the Mac bots started failing. And ninja messing up dependencies doesn't explain my compilation issue of "I *modified* chrome/browser/ui/views/frame/browser_view.cc to remove symbols yet the linker still found the symbols". Could ninja mess things up to the point where a modified file wasn't recompiled?

On the other hand, might this be a Goma issue? Perhaps ninja is just fine, realized that some files needed recompiling, and sent them to Goma, but Goma incorrectly returned stale precompiled object files?

Did anyone change Goma last night? This started early/normal morning Tokyo time, didn't it?

Comment 23 by sergeybe...@chromium.org, Jan 15

thakis: the file is from the swarm1342-c4 bot which runs https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel .
Most likely, it's from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel/156945 (but at this point I can't be too sure)

Comment 24 by sergeybe...@chromium.org, Jan 15

Re #c22 - note, that all the crashes in this bot were happening around 2:30pm PST (5:30pm EST). I don't know if they are correlated with 8pm EST Mac problem. Regardless, both Tokyo and Australia were asleep at that time, so it's unlikely that anyone pushed new Goma version at that time.

Comment 25 by thakis@chromium.org, Jan 15

> "I *modified* chrome/browser/ui/views/frame/browser_view.cc to remove symbols yet the linker still found the symbols"

Do you have more details on this? Where was this? The jumbo bots? I don't know what this refers to.


(Whatever this is, it's pretty unlikely to me that this is something due to gn/goma/clang/ninja etc, else failures would be more widespread. My current guess is that something corrupted a bunch of files on a bunch of bots, and that just clobbering the affected bots (or their builders, for tester bots) will help.)

Comment 26 by sergeybe...@chromium.org, Jan 15

Re #c20 - I'd be very surprised if the clock is a problem, these are GCE machines... But a corrupted checkout due to the bot crash might do all sorts of things. I'm tempted to just clobber all the "bad" machines for now and be done with it. I can leave swarm1342-c4 for a bit to investigate more, but we should take it offline so the builder can move on.

Comment 27 by ellyjo...@chromium.org, Jan 15

Clobbering all the affected machines seems like it's at least worth trying, given that this (or maybe a similar issue) is keeping the tree closed right now.

Comment 28 by brat...@opera.com, Jan 15

I'd like to believe it was a one time event, but https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Linux%20x64 started failing with the same symptoms ~4 hours ago, half a day or more after the others. Though maybe the problem had been latent there since the others broke.

But looking at that .ninja_deps file, it has this section:

----
[...]
    gen/services/service_manager/public/mojom/service_manager.mojom-shared-internal.h

obj/content/browser/browser/browser_jumbo_7.o: #deps 0, deps mtime 1547504867 (STALE)

obj/extensions/browser/api/declarative_net_request/declarative_net_request/ruleset_manager.o: #deps 650, deps mtime 1547548874 (VALID)
[...]
-------

So the deps for browser_jumbo_7.o is empty, which is very wrong (should have been thousands). I also found empty deps for 3 other content/browser/browser_jumbo*.o files and for 2 other content/renderer/renderer_jumbo*.o files and nothing else.

Comment 29 by thakis@chromium.org, Jan 15

How did you get that output? I tried copying it to my build dir and then running `ninja -C out/gn -t deps > deps.txt`, but over here deps.txt doesn't contain the string "jumbo" at all.

Comment 30 by brat...@opera.com, Jan 15

I put it in an existing output tree after running gn gen with

use_jumbo_build = true
jumbo_file_merge_limit = 50

(and some more that I don't think matters). 

It seemed to only list files that it also found in build.ninja.

Comment 31 by most...@vewd.com, Jan 15

Is the size of browser_jumbo_7.o > 2M, in this configuration, when the build succeeds?  And when the build fails, does it contain large blocks of null bytes (perhaps starting/ending on 2M boundaries)?

Goma has special handling for files greater than this size, and FWIW I have seen (custom) goma backend bugs in this part of the code which cause the goma client to create invalid object files locally (either full of zeros, or 2M chunks of zeros, I can't remember the exact details).  These invalid object files then eventually trigger a failed link command due to missing symbols.  So perhaps browser_jumbo_7.o crossed that threshold, triggered a latent goma bug, and started causing this issue even though goma wasn't changed around the time these problems appeared?

Comment 32 by a...@chromium.org, Jan 15

Nico, re comment 25:

This was https://ci.chromium.org/b/8924314679325563984 (the linux-jumbo-rel trybot) on https://crrev.com/c/1410004. The failure was

ld.lld: error: undefined symbol: Browser::HasCompletedUnloadProcessing() const
>>> referenced by ui_jumbo_11.cc
>>>               ui/ui_jumbo_11.o:(BrowserView::CanClose()) in archive obj/chrome/browser/ui/libui.a

where lld was complaining it could not link a symbol (HasCompletedUnloadProcessing) that was _removed_ from the file in the CL.

Comment 33 by brat...@opera.com, Jan 15

That build didn't compile either of ui_jumbo_11.cc or ui_jumbo_1.cc so it seems that bot had forgotten already earlier what ui_jumbo_11 and ui_jumbo_1 contained. 

bsep's and your patch, avi, might have been the first one to change symbols in one of the affected object files after the deps were lost so that the real culprit is one of the patches right before it, or some external event (like goma; I can't determine the likeliness of mostynb's hypothesis being valid) close in time.

I'll sleep on it now. Clobbering bots seem reasonable to me in case anyone wonders. My guess is that there are no traces left of the event anyway.

Comment 34 by sergeybe...@chromium.org, Jan 15

Clobbering the bots:

bots=(gce-trusty-e833d7b0-us-east1-b-5qg3 gce-trusty-e833d7b0-us-east1-b-5315 gce-trusty-e833d7b0-us-west1-b-nxzm swarm2-c4 swarm3-c4 swarm5-c4 swarm7-c4 swarm10-c4 swarm11-c4 swarm12-c4 swarm1342-c4 swarm1343-c4 swarm1344-c4 swarm1345-c4)

for bot in "${bots[@]}"; do
    ./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.try -d id "$bot" --named-cache builder_2e217ab2339c9327591fea465df3c104db2747030c87cbe731f33ab8030a0bd2_v2 cache/builder --tags=clobber:linux-jumbo-rel --raw-cmd -- /bin/rm -rf cache/builder
done

Tasks: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1547592720000&f=clobber-tag%3Alinux-jumbo-rel&l=50&n=true&s=created_ts%3Adesc&st=1547506320000

Comment 37 by arthursonzogni@google.com, Jan 16

Issue 922392 has been merged into this issue.

Comment 38 by pwnall@chromium.org, Jan 16

Cc: pwnall@chromium.org

Comment 39 by arthurso...@chromium.org, Jan 16

To fix issue 922392, I need to execute the same command as in comment 34, on gce-trusty-e833d7b0-us-east1-b-sfdk  (android-jumbo-rel)
I am not allowed to schedule tasks. sergeyberezin@, or  jbudorick@, could you please fire clobbers?

./tools/swarming_client/swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.try -d id "gce-trusty-e833d7b0-us-east1-b-sfdk" --named-cache builder_2e217ab2339c9327591fea465df3c104db2747030c87cbe731f33ab8030a0bd2_v2 cache/builder --tags=clobber:android-jumbo-rel --raw-cmd -- /bin/rm -rf cache/builder

Comment 40 by brat...@opera.com, Jan 16

There were a couple more bots that needed clobbering but it looks like someone did that a couple of hours ago.

One scary thought here is that the kind of errors we see could exist on a build machine for a long time without any obvious signs of errors. I'm assuming that the crashy mac builds and this had the same root cause here, but symptoms might be very discreet or nearly invisible.

Since we (at least not I) have any useful idea for where the bug is, could clobbering be made a normal/automatic followup of an "InfraFailure"?

Comment 41 by thakis@chromium.org, Jan 16

Comment 31 kind of explains all the symptoms we're seeing: If goma sends back an empty deps list for a cc file for some reason, then ninja will store that in its deps log and only rebuild the obj file if the cc file itself is touched.

Comment 42 by thakis@chromium.org, Jan 16

Re comment 41: We don't use physical depfiles on Windows, so that kind of gels with us not seeing this on Windows bots (...right? But even if we did, the Windows mechanism could've also seen empty deps for a similar reason. But it's a different mechanism.)

Comment 43 by jbudorick@chromium.org, Jan 16

#39: per the chromium-dev thread, we're firing clobbers for everything.

#40: we'll be writing a postmortem for this, and we'll try to have a public version w/ action items. We've been discussing something along those lines.

Comment 44 by jbudorick@chromium.org, Jan 16

Fired clobbers for all linux and mac bots on luci.chromium.ci. In the middle of firing clobbers for all linux and mac bots on luci.chromium.try. Clobber tasks are visible at http://shortn/_c6WWXhKPbG

#39: android-jumbo-rel clobber was https://chromium-swarm.appspot.com/task?id=42709606b7c99d10

Comment 45 by sergeybe...@chromium.org, Jan 17

Status: Fixed (was: Assigned)
The bot is (mostly) green, closing the bug. If the problem comes back, please reopen or (better yet) file a new trooper bug at https://g.co/bugatrooper.

Sign in to add a comment