False rejects and widespread test failures on macOS. |
||
Issue descriptionAssigning this bug to haraken for further triage/investigation. Symptoms: Our test binaries are non-deterministically experiencing wide-spread renderer crashes. The non-determinism happens at compile time. Either all test binaries [content_browsertests, webkit_unit_tests, browser_tests, etc.] crash with the same error, or they all pass with no error. The problematic binaries will also crash when run on a local device. The most likely explanation is that there is at least one translation unit in blink with undefined behavior. Sometimes, this translation unit compiles into functional code, and all test binaries linked with this TU work. Sometimes, this TU compiles into non-functional code, and all test binaries linked with this TU will crash. This error has recently spiked, causing massive test failures on the public waterfall, and false rejects on the CQ. However, I observed the exact same problem a month ago: https://bugs.chromium.org/p/chromium/issues/detail?id=915285#c1. This suggests that the root problem has been around for at least a month. ===================Analysis=================== This build experienced the same crash across many binaries [content_browsertests, webkit_unit_tests, GPU tests, etc]: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/225418 """ Received signal 11 <unknown> 000000000000 0 browser_tests 0x000000010809231f base::debug::StackTrace::StackTrace(unsigned long) + 31 1 browser_tests 0x0000000108092171 base::debug::(anonymous namespace)::StackDumpSignalHandler(int, __siginfo*, void*) + 2385 2 libsystem_platform.dylib 0x00007fff73794f5a _sigtramp + 26 3 ??? 0x0000000000000000 0x0 + 0 4 browser_tests 0x000000010cd6f9c9 blink::WorkletAnimationController::~WorkletAnimationController() + 41 5 browser_tests 0x00000001077bb450 blink::NormalPage::Sweep() + 544 6 browser_tests 0x00000001077b6651 blink::BaseArena::LazySweepWithDeadline(base::TimeTicks) + 385 7 browser_tests 0x00000001077ab40a blink::ThreadHeap::AdvanceLazySweep(base::TimeTicks) + 74 8 browser_tests 0x00000001077c6e0b blink::ThreadState::PerformIdleLazySweep(base::TimeTicks) + 491 ... """ However, previous and subsequent builds had no problems at all. See mac_chromium_rel_ng builds on https://chromium-review.googlesource.com/c/chromium/src/+/1388054/2. I can repro the crash locally: """ python ~/projects/chromium/src/tools/swarming_client/isolateserver.py download -I https://isolateserver.appspot.com --namespace default-gzip -s f88886e32e6db7958024477e401e7000beb9d54c --target foo cd /Users/erikchen/temp/foo/out/Release ../../testing/test_env.py ./browser_tests --test-launcher-bot-mode --asan=0 --msan=0 --tsan=0 --cfi-diag=0 --gtest_filter=SecureOriginWhitelistBrowsertest/SecureOriginWhitelistBrowsertest.SecurityIndicators/0 """ The 10.13 public waterfall bot has been failing across many test suites as well [different error] since build 8912: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.13%20Tests/8912 None of the landed patches look immediately relevant. There are other CLs on the CQ that are experiencing the same symptoms: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng?cursor=Ci0SJ2oQc35jci1idWlsZGJ1Y2tldHITCxIFQnVpbGQY8JC37NTT2ex7DBgAIAA%3D&limit=200 e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/225418 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/225421 We've seen similar symptoms a month ago, although the exact crash was different: https://bugs.chromium.org/p/chromium/issues/detail?id=915285#c1
,
Jan 15
The public waterfall problem in 922025 found and reverted a culprit: https://bugs.chromium.org/p/chromium/issues/detail?id=922025#c5 I went through the history of the CQ failures, and the failures do start slightly after the culprit CL landed. This suggest that the failures were likely caused by the culprit CL. That being said...the symptoms on the CQ appear so different than the public waterfall problem that I wonder if we're seeing two different problems [especially since we've seen this particular problem 1 month ago].
,
Jan 15
The GPU CL was reverted in bug 922025 , but only made the GPU-related ASAN failures go away; just seeing a different crash (now in ~GraphicsContext while painting).
,
Jan 15
I could not reproduce the problem locally but I got these error messages: https://bugs.chromium.org/p/chromium/issues/detail?id=289453
,
Jan 15
Can we grab the build dir from the bot? Most of the build is pretty deterministic by now. The one thing we know isn't is ld64. If we have some ODR violation somewhere and ld64 parallelism makes it nondeterministic which version is pulled in, that might result in what we're seeing.
,
Jan 15
CCI trooper here - what do you want to pull from which bot? Let me know and I'll see what I can do.
,
Jan 15
From the chat - I'm shutting down https://chromium-swarm.appspot.com/bot?id=vm73-m9&sort_stats=total%3Adesc which will stop https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.13%20Tests temporarily. The bot will finish https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.13%20Tests/8931 and then terminate. I'll grab the resulting build dir and restart the bot.
,
Jan 15
The bot finished the build and terminated. I'll be downloading the files shortly.
,
Jan 15
The build dir is 28GB. That'll take a while...
,
Jan 15
Created 11GB tarball. Downloading it from Golo is another ~30min... I'll probably then upload it to an internal Google Drive or something.
,
Jan 15
Tree closures should be P-0, no?
,
Jan 15
This reminds me of a previous issue (https://bugs.chromium.org/p/chromium/issues/detail?id=900405#c13) that dpranke@, tbansal@, horo@, thakis@, and I previously did a bunch of debugging on. The only conclusion we came to at the time is that there were stale .o files that ninja was failing to rebuild, even though its dependencies changed. Unfortunately, we never figured out the root cause, and we 'resolved' the issue by clobbering the affected bots.
,
Jan 15
Clobbering Mac Builder: ./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.ci -d id vm74-m9 --named-cache builder_ea778128bf936f6571437fe2d8c833f98214957b2868d01f3dd50f6fcbe3d309_v2 cache/builder --raw-cmd -- /bin/rm -rf cache/builder Task: https://chromium-swarm.appspot.com/user/task/426c6fb59137ef10
,
Jan 15
Mac Builder completed its build: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20Builder/94314 Mac testers are starting, e.g.: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac10.13%20Tests/8933
,
Jan 16
Clobbering "Mac FYI GPU ASAN Release": ./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.ci -d id build91-m9 --named-cache builder_cab593cb012d77b5769967b178802a9590f1f6cab32db18f3fd9b34757dcee62_v2 cache/builder --raw-cmd -- /bin/rm -rf cache/builder https://chromium-swarm.appspot.com/user/task/426ccc1691d75d10
,
Jan 16
Clobbering "Mac Asan 64 Builder" (2 bots): ./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.ci -d id vm259-m9 --tags clobber:mac-asan-64-builder --named-cache builder_f6e6bada9972e10bc0841fcf0461958d34fecfc7e713a2f894c66d00b195e6ab_v2 cache/builder --raw-cmd -- /bin/rm -rf cache/builder ./swarming.py trigger -S chromium-swarm.appspot.com -d pool luci.chromium.ci -d id vm262-m9 --tags clobber:mac-asan-64-builder --named-cache builder_f6e6bada9972e10bc0841fcf0461958d34fecfc7e713a2f894c66d00b195e6ab_v2 cache/builder --raw-cmd -- /bin/rm -rf cache/builder Triggered task: sergeyberezin@google.com/id=vm262-m9_pool=luci.chromium.ci Tasks: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1547597640000&f=clobber-tag%3Amac-asan-64-builder&l=50&n=true&s=created_ts%3Adesc&st=1547511240000
,
Jan 16
For posterity, the theory was that infra failures on the Builder bot last night left partial, stale or corrupted .o's, or messed up the ninja dependencies in a way that they didn't get rebuilt. The Tester bots just kept picking up those broken builds and crashing with them. Elly noted that "yesterday night, two back to back "infra failure"s on this bot, where the build output simply stops partway through". first complete build after that: 2019-01-14 7:28 PM (EST) first dead mac10.13 tests run: 2019-01-14 8:10 PM (EST)". Similarly the "Mac ASan 64 Tests" and "Mac FYI GPU ASAN Release" bots had infra failures yesterday, just prior to going flaky then solid red. So they've been clobbered as well.
,
Jan 16
I was a sheriff at that moment. Yup, there was a massive LUCI outage right before these failures started. Would be good to write a postmortem.
,
Jan 16
If we do write a postmortem, I've saved the Hangouts Chat log here: https://docs.google.com/document/d/1CdJ_8RXaWVn2oKTc0TK1JQzI7p3gatuKd4aMh05eWvc/edit?usp=sharing (we didn't turn history on, so it would otherwise have been purged after 24 hours)
,
Jan 16
(6 days ago)
In the "jumbo builds didn't build correctly" bug 921967 we found 6 empty dependency slots in ninja's dependency databases on the bot we investigated. The object file didn't even depend on the cc file itself and nothing could make those object files rebuild. It matches the theory you have, but we have no idea how it happened.
,
Jan 17
(5 days ago)
FTR, if still needed, a copy of build dir from #c10 is here: https://drive.google.com/file/d/13GUYGUbjSluvNIgNpM5MoUD_n7EP6GVu/view?usp=sharing (I'll likely delete it from my drive in a few weeks, please make a copy if you need it). |
||
►
Sign in to add a comment |
||
Comment 1 by rsesek@chromium.org
, Jan 15