linux_jumbo_rel didn't recompile ui_jumbo_1.cc when it should have |
|||||||
Issue descriptionavi reported on a mailinglist that certain build failures on linux_jumbo_rel that seems to be caused by missing recompilation when source changes. One example is https://chromium-review.googlesource.com/c/chromium/src/+/1410004/2 which first failed the build because ui_jumbo_1.cc did not recompile ui_jumbo_1.o that still referenced non-existing code[1]. The second compilation worked. It used gce-trusty-e833d7b0-us-east1-b-5qg3 [2] and I would like to see what it had done before this, and whether that bot kept not compiling correctly but I don't have access to that information. The next compilation used gce-trusty-e833d7b0-us-east1-b-w6zm and it worked. Another example is that https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8924308573690053872/+/steps/compile/0/stdout failed because browser/browser_jumbo_15.o referenced non-existing code. browser/browser_jumbo_15.cc had not been recompiled in that build so it used an object file compiled with an earlier version of the tree. Looking at build in general, 120 of the last 400 linux-jumbo-rel builds have failed. Going back more, only 5-10% of the builds failed. Could be some corrupt dependency databases on some bots? But why and how? If that is the case, I'd like to see the last build before the failures started. [1] https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel/156092 https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8924314679325563984/+/steps/compile__with_patch_/0/stdout [2] https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc#
,
Jan 15
,
Jan 15
Broken ninja databases seem very unlikely to me given that we haven't touched ninja in a long time. It seems more likely that some deps are missing in jumbo builds. Who owns jumbo builds?
,
Jan 15
If anyone owns jumbo, I do, though dpranke and martiniss created the bots and have access to the information that is hidden from me. The thing here is that for ninja there is nothing special with jumbo builds. The units are larger and there is an extra level of indirection in the translation units so there is a higher chance of triggering timeouts, and timing might be different but nothing that is fundamentally different from any other build. Looking at the symptoms we can see that the bot started the compilation by invoking ninja. Ninja elected to skip recompiling some files which later results in link errors. I assume that ninja knows what files to recompile by looking in .ninja_deps so my suspicion is that on the 6 bots above, and maybe a couple more, .ninja_deps doesn't contain correct information. The sudden increase in errors happened 18-20 hours ago but not being able to check what the bots did at that time, I've no real idea what might have happened and can only speculate. I'm still digging through the public data (very slow) to see if I can find a pattern. Someone with access to bot logs might see something right away.
,
Jan 15
All other bots seem happy and we haven't touched ninja in a while. Ninja generally does what it's told to do. It's possible that something corrupted .ninja_deps, but that seems somewhat unlikely to me given that this happened on several bots doing jumbo builds and nowhere else (given current knowledge of the situation).
,
Jan 15
If I'm moving into the realm of speculation it could be that some patch overloaded ninja/clang/the bot and made something crash when compiled in jumbo mode preventing proper storage of data in ninja dep file. And that someone tried that patch ~5 times, breaking 5 bots, and then gave up. (swarm1345-c4 that I listed above might be fine; thus 5 instead of 6; the rest still look broken as I waddle through the data). But that is speculation. I really want to know what happened so that it doesn't happen again.
,
Jan 15
I found a similar problem on the "Jumbo Mac" bot but in code that does not use jumbo. 1. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 compiled fine. 2. https://chromium-review.googlesource.com/c/1377616 removed content::MockRenderProcessHost::LockToOrigin(GURL const&) in content/public/test/mock_render_process_host.h 3. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37477 fails to link because vtable for content::(anonymous namespace)::RenderWidgetHostProcess in render_widget_host_unittest.o can't find the removed function. In step #3 render_widget_host_unittest.o was not recompiled despite having an include #include "content/public/test/mock_render_process_host.h" Since then (another 60 builds) the same error is there in every build. ninja just won't recompile render_widget_host_unittest.o so my guess/speculation is that it thinks that file doesn't depend on anything. So what made that happen? Well, a couple of hours earlier, that bot had the same problem that avi saw on linux-jumbo-rel, which resolved itself with a revert. That is an interesting coincidence, but it's not an answer. What made ninja forget that render_widget_host_unittest.o depends on content/public/test/mock_render_process_host.h? Did it forget all dependencies? I welcome any ideas. I'm trying to reproduce it but no success yet (mostly because everything is slow).
,
Jan 15
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 says "use_jumbo_build = true" -- does that not have an effect on mac?
,
Jan 15
,
Jan 15
Nvm, I misread "in code that does not use jumbo" as something else.
,
Jan 15
Jumbo can still be a factor, as in, something happens in jumbo builds that doesn't happen in other builds that breaks things at the bot. The error I reported above is really just a symptom of the bot having lost track of dependencies but I don't know when and how. "Jumbo Mac" did have 2 builds failing purple with Internal Failure/Infra Failure but it did complete a couple of builds after that successfully. Maybe related, maybe not. The builds logs are truncated so no idea what happened to the bots. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37457 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37458
,
Jan 15
Something feels weird. The jumbo bots started failing last night about 8pm EST. The Mac bots started generating broken code that crashes on launch starting last night about 8pm EST. Are they related? Both are strange compilation issues.
,
Jan 15
CCI trooper here. As an example, https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc shows that the bot started to crash 1/14/2019, 2:31:38 PM (Pacific Standard Time), followed by lots of failing builds. https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-nxzm&sort_stats=total%3Adesc shows the same crash at exactly the same time, and so the other bots from #c1. Maybe the theory that the bot crashed and corrupted some ninja DB has merits.
,
Jan 15
Can a trooper grab build_dir/.ninja_deps and attach it here? In theory, ninja tries to write checksums to the deps log to make it resilient against things like this, but that code could of course be buggy.
,
Jan 15
Will do - thanks for a specific file request.
,
Jan 15
In the meantime, I did some investigation: The list of all bots for the builder: https://chromium-swarm.appspot.com/botlist?c=id&c=task&c=os&c=status&d=asc&f=builder%3Alinux-jumbo-rel&f=pool%3Aluci.chromium.try&s=id Exhibiting problem: https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5qg3&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-east1-b-5315&sort_stats=total%3Adesc (crashed 2:27:24 PM) https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-nxzm&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm2-c4&sort_stats=total%3Adesc (crashed 2:50:24 PM, never ran a build after that) https://chromium-swarm.appspot.com/bot?id=swarm3-c4&sort_stats=total%3Adesc (crashed twice since 1/14/2019, 2:39:11 PM, but didn't break) https://chromium-swarm.appspot.com/bot?id=swarm5-c4&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm7-c4&sort_stats=total%3Adesc (also crashed but didn't break) https://chromium-swarm.appspot.com/bot?id=swarm10-c4&sort_stats=total%3Adesc (crashed, didn't break) https://chromium-swarm.appspot.com/bot?id=swarm11-c4&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm12-c4&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm1342-c4&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm1343-c4&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=swarm1344-c4&sort_stats=total%3Adesc (crashed, didn't run a build after) https://chromium-swarm.appspot.com/bot?id=swarm1345-c4&sort_stats=total%3Adesc That's 14 bots out of 31 total. All of them crashed around Jan 14 2:27 PM (PST) or later. I'm going to take this bug for further triage - maybe we can clobber the bots to revive them.
,
Jan 15
ok, I have .ninja_deps file in the Drive (it's 190MB): https://drive.google.com/file/d/18W_uaapTFkr6rh24ETvdyYF5m2bv7fN9/view?usp=sharing This is from swarm1342-c4.
,
Jan 15
sergeyberenzin: Is that from https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 or from the Linux bot?
,
Jan 15
Probably the linux bot -- can you get the file off https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac/37476 too?
,
Jan 15
One possibility regarding the other (Mac) problems, is that they are also not recompiling all the source but that the object files are still compatible enough to link, but not compatible enough to actually run. A 190 MB .ninja_deps file means that it's not empty at least. Looking at a failed build (not the last one) at that machine, obj/content/browser/browser/browser_jumbo_7.o was not recompiled so it still references symbols from content/browser/site_instance_impl.h that were removed in the CL that started this thread. There should be a reference in there between those two. Do we need the date stamp of all the files too? In case the clock on the machines went haywire and started timestamping new files with an old date.
,
Jan 15
Oops, sorry - didn't realize you won't have access... Here's a public link to the .ninja_deps file: https://drive.google.com/file/d/18S3HWXr353DcsYR-Za6CYBtGNIrfrYXj/view?usp=sharing
,
Jan 15
A new theory from me and Elly: In comment 12 I found it suspicious that this started happening around the time that the Mac bots started failing. And ninja messing up dependencies doesn't explain my compilation issue of "I *modified* chrome/browser/ui/views/frame/browser_view.cc to remove symbols yet the linker still found the symbols". Could ninja mess things up to the point where a modified file wasn't recompiled? On the other hand, might this be a Goma issue? Perhaps ninja is just fine, realized that some files needed recompiling, and sent them to Goma, but Goma incorrectly returned stale precompiled object files? Did anyone change Goma last night? This started early/normal morning Tokyo time, didn't it?
,
Jan 15
thakis: the file is from the swarm1342-c4 bot which runs https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel . Most likely, it's from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux-jumbo-rel/156945 (but at this point I can't be too sure)
,
Jan 15
Re #c22 - note, that all the crashes in this bot were happening around 2:30pm PST (5:30pm EST). I don't know if they are correlated with 8pm EST Mac problem. Regardless, both Tokyo and Australia were asleep at that time, so it's unlikely that anyone pushed new Goma version at that time.
,
Jan 15
> "I *modified* chrome/browser/ui/views/frame/browser_view.cc to remove symbols yet the linker still found the symbols" Do you have more details on this? Where was this? The jumbo bots? I don't know what this refers to. (Whatever this is, it's pretty unlikely to me that this is something due to gn/goma/clang/ninja etc, else failures would be more widespread. My current guess is that something corrupted a bunch of files on a bunch of bots, and that just clobbering the affected bots (or their builders, for tester bots) will help.)
,
Jan 15
Re #c20 - I'd be very surprised if the clock is a problem, these are GCE machines... But a corrupted checkout due to the bot crash might do all sorts of things. I'm tempted to just clobber all the "bad" machines for now and be done with it. I can leave swarm1342-c4 for a bit to investigate more, but we should take it offline so the builder can move on.
,
Jan 15
Clobbering all the affected machines seems like it's at least worth trying, given that this (or maybe a similar issue) is keeping the tree closed right now.
,
Jan 15
I'd like to believe it was a one time event, but https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Linux%20x64 started failing with the same symptoms ~4 hours ago, half a day or more after the others. Though maybe the problem had been latent there since the others broke. But looking at that .ninja_deps file, it has this section: ---- [...] gen/services/service_manager/public/mojom/service_manager.mojom-shared-internal.h obj/content/browser/browser/browser_jumbo_7.o: #deps 0, deps mtime 1547504867 (STALE) obj/extensions/browser/api/declarative_net_request/declarative_net_request/ruleset_manager.o: #deps 650, deps mtime 1547548874 (VALID) [...] ------- So the deps for browser_jumbo_7.o is empty, which is very wrong (should have been thousands). I also found empty deps for 3 other content/browser/browser_jumbo*.o files and for 2 other content/renderer/renderer_jumbo*.o files and nothing else.
,
Jan 15
How did you get that output? I tried copying it to my build dir and then running `ninja -C out/gn -t deps > deps.txt`, but over here deps.txt doesn't contain the string "jumbo" at all.
,
Jan 15
I put it in an existing output tree after running gn gen with use_jumbo_build = true jumbo_file_merge_limit = 50 (and some more that I don't think matters). It seemed to only list files that it also found in build.ninja.
,
Jan 15
Is the size of browser_jumbo_7.o > 2M, in this configuration, when the build succeeds? And when the build fails, does it contain large blocks of null bytes (perhaps starting/ending on 2M boundaries)? Goma has special handling for files greater than this size, and FWIW I have seen (custom) goma backend bugs in this part of the code which cause the goma client to create invalid object files locally (either full of zeros, or 2M chunks of zeros, I can't remember the exact details). These invalid object files then eventually trigger a failed link command due to missing symbols. So perhaps browser_jumbo_7.o crossed that threshold, triggered a latent goma bug, and started causing this issue even though goma wasn't changed around the time these problems appeared?
,
Jan 15
Nico, re comment 25: This was https://ci.chromium.org/b/8924314679325563984 (the linux-jumbo-rel trybot) on https://crrev.com/c/1410004. The failure was ld.lld: error: undefined symbol: Browser::HasCompletedUnloadProcessing() const >>> referenced by ui_jumbo_11.cc >>> ui/ui_jumbo_11.o:(BrowserView::CanClose()) in archive obj/chrome/browser/ui/libui.a where lld was complaining it could not link a symbol (HasCompletedUnloadProcessing) that was _removed_ from the file in the CL.
,
Jan 15
That build didn't compile either of ui_jumbo_11.cc or ui_jumbo_1.cc so it seems that bot had forgotten already earlier what ui_jumbo_11 and ui_jumbo_1 contained. bsep's and your patch, avi, might have been the first one to change symbols in one of the affected object files after the deps were lost so that the real culprit is one of the patches right before it, or some external event (like goma; I can't determine the likeliness of mostynb's hypothesis being valid) close in time. I'll sleep on it now. Clobbering bots seem reasonable to me in case anyone wonders. My guess is that there are no traces left of the event anyway.
,
Jan 15
Clobbering the bots:
bots=(gce-trusty-e833d7b0-us-east1-b-5qg3 gce-trusty-e833d7b0-us-east1-b-5315 gce-trusty-e833d7b0-us-west1-b-nxzm swarm2-c4 swarm3-c4 swarm5-c4 swarm7-c4 swarm10-c4 swarm11-c4 swarm12-c4 swarm1342-c4 swarm1343-c4 swarm1344-c4 swarm1345-c4)
for bot in "${bots[@]}"; do
./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.try -d id "$bot" --named-cache builder_2e217ab2339c9327591fea465df3c104db2747030c87cbe731f33ab8030a0bd2_v2 cache/builder --tags=clobber:linux-jumbo-rel --raw-cmd -- /bin/rm -rf cache/builder
done
Tasks: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1547592720000&f=clobber-tag%3Alinux-jumbo-rel&l=50&n=true&s=created_ts%3Adesc&st=1547506320000
,
Jan 15
These have similar problems: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Linux%20x64 swarm1859-c4 and https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Jumbo%20Mac build225-m9
,
Jan 16
#35: fired similar clobbers for both: https://chromium-swarm.appspot.com/task?id=426d2cb065f9ba10 https://chromium-swarm.appspot.com/task?id=426d2e3895c1d710
,
Jan 16
(6 days ago)
Issue 922392 has been merged into this issue.
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
To fix issue 922392, I need to execute the same command as in comment 34, on gce-trusty-e833d7b0-us-east1-b-sfdk (android-jumbo-rel) I am not allowed to schedule tasks. sergeyberezin@, or jbudorick@, could you please fire clobbers? ./tools/swarming_client/swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.try -d id "gce-trusty-e833d7b0-us-east1-b-sfdk" --named-cache builder_2e217ab2339c9327591fea465df3c104db2747030c87cbe731f33ab8030a0bd2_v2 cache/builder --tags=clobber:android-jumbo-rel --raw-cmd -- /bin/rm -rf cache/builder
,
Jan 16
(6 days ago)
There were a couple more bots that needed clobbering but it looks like someone did that a couple of hours ago. One scary thought here is that the kind of errors we see could exist on a build machine for a long time without any obvious signs of errors. I'm assuming that the crashy mac builds and this had the same root cause here, but symptoms might be very discreet or nearly invisible. Since we (at least not I) have any useful idea for where the bug is, could clobbering be made a normal/automatic followup of an "InfraFailure"?
,
Jan 16
(6 days ago)
Comment 31 kind of explains all the symptoms we're seeing: If goma sends back an empty deps list for a cc file for some reason, then ninja will store that in its deps log and only rebuild the obj file if the cc file itself is touched.
,
Jan 16
(6 days ago)
Re comment 41: We don't use physical depfiles on Windows, so that kind of gels with us not seeing this on Windows bots (...right? But even if we did, the Windows mechanism could've also seen empty deps for a similar reason. But it's a different mechanism.)
,
Jan 16
(6 days ago)
#39: per the chromium-dev thread, we're firing clobbers for everything. #40: we'll be writing a postmortem for this, and we'll try to have a public version w/ action items. We've been discussing something along those lines.
,
Jan 16
(6 days ago)
Fired clobbers for all linux and mac bots on luci.chromium.ci. In the middle of firing clobbers for all linux and mac bots on luci.chromium.try. Clobber tasks are visible at http://shortn/_c6WWXhKPbG #39: android-jumbo-rel clobber was https://chromium-swarm.appspot.com/task?id=42709606b7c99d10
,
Jan 17
(6 days ago)
The bot is (mostly) green, closing the bug. If the problem comes back, please reopen or (better yet) file a new trooper bug at https://g.co/bugatrooper. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by brat...@opera.com
, Jan 15