Protect the Autotest lab from Chrome core files
Reported by
jrbarnette@chromium.org,
Apr 25 2017
|
||||||||||||||||||||||||||
Issue description
We've got evidence of ongoing chrome crashes in R59 Chrome OS
release builds. Here's one example:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=113937947
Although the test didn't fail, the results captured multiple
chrome crashes.
Based on aggregate data, we believe that the crashes are probably
affecting many/most boards, and probably more than just cheets
tests.
,
Apr 25 2017
Albert, can someone from your team look into symbolizing these crashes and hopefully getting to the root cause? These crashes took out the lab last week and we thought this cl fixed it: https://codereview.chromium.org/2833243002 But we're seeing the crashes again on M59.
,
Apr 25 2017
,
Apr 25 2017
Although there's no solid evidence, my best guess is that this problem is also happening on ToT.
,
Apr 25 2017
,
Apr 25 2017
Seeing this on celes as well https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=test_detail_view&object_id=470613871 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113924605-chromeos-test/chromeos2-row6-rack10-host13/cheets_CTSHelper.stress/sysinfo/iteration.1/var/spool/crash/?authuser=1
,
Apr 25 2017
,
Apr 25 2017
Investigating.
,
Apr 25 2017
I keep trying to update this issue and keep getting interrupted. I started taking a look as well.
,
Apr 25 2017
Also, it's worth mentioning: Although tests reporting the crashes are generally cheets tests, that's likely an artifact of the way Autotest is collecting the crashes. It's very likely that the crashes are _not_ caused by cheets tests; they're caused by other, non-cheets tests that ran before.
,
Apr 25 2017
Symbolizing is a bit tedious and I may have gotten a bad trace, as there is lots of gdb unhappiness. But the crash may be in https://cs.chromium.org/chromium/src/content/zygote/zygote_main_linux.cc?q=zygote_main_linux.cc+package:%5Echromium$&dr&l=620 #0 0x000059dd2b90a6ae in content::ZygoteMain(content::MainFunctionParams const&, std::vector<std::unique_ptr<content::ZygoteForkDelegate, std::default_delete<content::ZygoteForkDelegate> >, std::allocator<std::unique_ptr<content::ZygoteForkDelegate, std::default_delete<content::ZygoteForkDelegate> > > >) () at ../../../../../../../home/chrome-bot/chrome_root/src/content/zygote/zygote_main_linux.cc:620 #1 0x000059dd2cc001e1 in content::RunZygote(content::MainFunctionParams const&, content::ContentMainDelegate*) () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:337 #2 0x000059dd2cc01168 in content::ContentMainRunnerImpl::Run() () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:740 #3 0x000059dd2e15afab in service_manager::Main(service_manager::MainParams const&) () at ../../../../../../../home/chrome-bot/chrome_root/src/services/service_manager/embedder/main.cc:179 #4 0x000059dd2cc00112 in content::ContentMain(content::ContentMainParams const&) () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main.cc:19 #5 0x000059dd2b4894e4 in ChromeMain () at ../../../../../../../home/chrome-bot/chrome_root/src/chrome/app/chrome_main.cc:123 I will check more dumps.
,
Apr 26 2017
I haven't been able to get a trace, but I did notice this pointer on the stack: blink::InputMethodController::SetComposition(WTF::String const&, WTF::Vector<blink::CompositionUnderline, 0ul, WTF::PartitionAllocator> const&, int, int) And this edit to the file 30 hours ago merged into 3071: https://chromium.googlesource.com/chromium/src/+/60cfc544d6a46df2a54cb222f12fdf12a48d73f0 Is... is that anything? :-\
,
Apr 26 2017
I am having problems with other dumps. Why don't you do a speculative revert?
,
Apr 26 2017
,
Apr 26 2017
Actually celes looks the same (I made some mistakes as I kept getting distracted): #0 0x00005b3a4f1e96ae in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/content/zygote/zygote_main_linux.cc:620 #1 0x00005b3a504df1d1 in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:337 #2 0x00005b3a504e0158 in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:740 #3 0x00005b3a51a39f9b in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/services/service_manager/embedder/main.cc:179 #4 0x00005b3a504df102 in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main.cc:19 #5 0x00005b3a4ed684f4 in ?? () at ../../../../../../../home/chrome-bot/chrome_root/src/chrome/app/chrome_main.cc:123 #6 0x00007988a8f64816 in __libc_start_main (main=warning: Could not find DWO CU obj/chrome/chrome_initial/chrome_exe_main_aura.dwo(0x383488f00bec892f) referenced by CU at offset 0x0 [in module /mnt/host/source/src/scripts/tmp_celes/debug/opt/google/chrome/chrome.debug]
,
Apr 26 2017
I haven't seen than Blink function in any other dump, so it's likely not related. None of the dumps I've gotten have common stack pointers -- I must be doing something wrong, but I'm following the breakpad instructions [1]. The symbol file and the minidump_stackwalk output hashes match, but minidump_stackwalk also sees a second chrome instance with a hash of 0 that it complains about not having symbols for. [1] http://www.chromium.org/developers/decoding-crash-dumps The //content files in your stack trace haven't been touched recently from 59.0.3071.25. I've also manually looked through the commit log between 59.0.3071.25 and the fix for issue 713968 but saw nothing particularly suspicious. I've started running autotests locally on 9460.11.0 with samus-cheets but no crashes yet.
,
Apr 26 2017
Try running cheets_CTSHelper, both crashes listed were from that.
,
Apr 26 2017
#17: Doing that, seems to take quite a while to run all the stress tests.
Unrelatedly: not sure what I'm doing differently but I'm now able to get a stack trace from the core file in gdb. Shows the same as what ihf@ posted:
#0 0x00005b3a4f1e96ae in content::ZygoteMain(content::MainFunctionParams const&, std::vector<std::unique_ptr<content::ZygoteForkDelegate, std::default_delete<content::ZygoteForkDelegate> >, std::allocator<std::unique_ptr<content::ZygoteForkDelegate, std::default_delete<content::ZygoteForkDelegate> > > >) () at ../../../../../../../home/chrome-bot/chrome_root/src/content/zygote/zygote_main_linux.cc:620
#1 0x00005b3a504df1d1 in content::RunZygote(content::MainFunctionParams const&, content::ContentMainDelegate*) ()
at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:337
#2 0x00005b3a504e0158 in content::ContentMainRunnerImpl::Run() ()
at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main_runner.cc:740
#3 0x00005b3a51a39f9b in service_manager::Main(service_manager::MainParams const&) ()
at ../../../../../../../home/chrome-bot/chrome_root/src/services/service_manager/embedder/main.cc:179
#4 0x00005b3a504df102 in content::ContentMain(content::ContentMainParams const&) ()
at ../../../../../../../home/chrome-bot/chrome_root/src/content/app/content_main.cc:19
#5 0x00005b3a4ed684f4 in ChromeMain () at ../../../../../../../home/chrome-bot/chrome_root/src/chrome/app/chrome_main.cc:123
#6 0x00007988a8f64816 in ?? ()
#7 0x00007fffa10c4780 in ?? ()
#8 0x00007fffa10c4828 in ?? ()
#9 0x00000009a10c4780 in ?? ()
#10 0x00005b3a4ed68410 in frame_dummy ()
#11 0x0000000000000000 in ?? ()
,
Apr 26 2017
Adding rockot@ in case any of the ServiceManager and content changes could be surfacing these crashes. Wondering if the crash is really happening there or if this backtrace is just a byproduct of how the process is created.
,
Apr 26 2017
jrbarnette: could you update us with the scope of this bug, now that the fix for issue 713968 is verified? Are we still looking at a large number of crashes with immediate potential for network overload?
,
Apr 26 2017
This is a duplicate of https://bugs.chromium.org/p/chromium/issues/detail?id=692227
,
Apr 26 2017
,
Apr 26 2017
Issue 692227 has been merged into this issue.
,
Apr 26 2017
This crash has been on all builds in the history of wmatrix, including M56. Not exactly a recent regression.
,
Apr 26 2017
Reproducibility varies. I would say in 40% if all cheets_CTSHelper runs a 41MB core file is generated. Notice this is much less than the other crash, which was 3 cores of 300MB each. This is one of the reasons it stayed under the radar so long.
,
Apr 26 2017
,
Apr 26 2017
When I was seeing issue 692227 it seemed to be related to shutdown. It seemed like the main process was exiting and around that time zygote crashed. I thought it was mustash-only (chrome --mash) and it seemed to go away so I stopped pursuing it. Theories at the time: * session_manager trying respawn chrome (and hence conflicting with an existing zygote somehow?) * Shutdown accidentally triggering an attempt to spawn another process (like some code trying to connect to a mojo service during shutdown)
,
Apr 26 2017
> Try running cheets_CTSHelper, both crashes listed were from that. It's worth saying again: I don't believe any of the crashes were caused by the cheets tests. The cheets tests are merely the only tests to be gathering the crashes in test results.
,
Apr 26 2017
> When I was seeing issue 692227 it seemed to be related
> to shutdown. It seemed like the main process was exiting
> and around that time zygote crashed. I thought it was
> mustash-only (chrome --mash) and it seemed to go away so
> I stopped pursuing it.
Ah. One way in which the test environment is different from
the user environment is that we kill and restart Chrome in
between every test. If there's a bug in Chrome occurring
during shutdown, it would explain a number of the observed
behaviors:
* We're not seeing test failures because the chrome restarts
happen outside of the tests.
* Tests were generally gathering crashes as part of their
results, so they still consumed bandwidth.
* The problem appears to be ubiquitous, not board related,
not cheets related.
If all the theories thus far hold up, it would mean that this
problem has a critical lab impact, but _possibly_ a limited
user impact.
As for the obvious question "why now?", my best guess would be
that some unrelated change made this bug much more frequent.
It's quite likely that the triggering change is innocent, and
that we should just go after the root cause.
,
Apr 26 2017
I'd propose using this bug to track the infra problems arising from the crash, and use issue 692227 for investigating the cause of the crash since it does have the same symptoms and stack trace. Is there a way to safeguard the lab while we work out the crash? I don't think we should expect to find an immediate resolution :-\
,
Apr 26 2017
> I'd propose using this bug to track the infra problems
> arising from the crash, and use issue 692227 for
> investigating the cause of the crash since it does have
> the same symptoms and stack trace.
Good enough.
> Is there a way to safeguard the lab while we work out
> the crash? I don't think we should expect to find an
> immediate resolution :-\
Maybe...
ATM, I think the lab is muddling through all right, but if
this problem is still happening on ToT (the current theory
says it should be), then we're only protected by a code
change that's blocking the gathering of all chrome crash
information (except that this bug shows that it's not stopping
gathering during cheets tests on the R59 builders, <sigh>).
That's not sustainable. This change is meant to produce a
more sustainable solution:
https://chromium-review.googlesource.com/#/c/486968/
That change needs to be tested, but if it works the way
I think it will, the lab should be protected in a more
permanent fashion.
,
Apr 26 2017
,
Apr 26 2017
,
Apr 26 2017
I'm wondering why the Chrome OS test team don't see any crashes during manual testing. They kill and restart Chrome between tests too.
,
Apr 26 2017
As we had the same problem with telemetry creating lots of PNG files before, focusing on core files only is too narrow. Infra needs a general way to sanitize results/sizes, not play a game of whack-a-mole. Let me quote "Comment 122 by jrbarnette@chromium.org, Sep 3 2015" https://bugs.chromium.org/p/chromium/issues/detail?id=524814#c122 The essence of the summary: * Bandwidth increased because test result sizes increased. * Test result sizes increased because of large numbers of screen shot files in the results. * The screenshots were taken because of telemetry timing out trying to log in to chrome. * We don't know exactly why login is failing, except that it's a bug in chrome. We're going to disable the screenshot-taking code in telemetry, then unpin chrome and let 'er rip.
,
Apr 26 2017
Before we switch to issue 692227 - Does anyone have examples of non-cheets tests that triggered the zygote crash? And which boards they seem to happen on? If we have a simple test that repros locally we can bisect.
,
Apr 26 2017
... This change is specifically only about core files. Anything that causes problems beyond core files is a different bug.
,
Apr 26 2017
Fixing the problem only for core files means that the problem will happen again in the future, in another form. That doesn't sound like sustainable practices to me.
,
Apr 26 2017
> Before we switch to issue 692227 - Does anyone have > examples of non-cheets tests that triggered the zygote > crash? And which boards they seem to happen on? I think I can't say this enough: These core files likely aren't coming from cheets tests. I believe that they're being gathered by cheets tests, because those are the only tests that still gather core files. We shouldn't assume that these crashes are in any way related to cheets. In fact, the evidence is that cheets is a bad bet, because the network volume is too large for it to be only cheets. As for finding more examples: It's hard. Currently, what I do is to pick shard at random, log in, and look for core files in recent tests that haven't yet been offloaded. If we need more examples, I can do some more of that. We don't have a reliable way of finding these core files by searching the AFE database, and searching results in googlestorage is a high-bandwidth needle-in-a-haystack problem that we want to avoid, at least until we're more desperate.
,
Apr 26 2017
> Fixing the problem only for core files means that > the problem will happen again in the future, in > another form. That doesn't sound like sustainable > practices to me. I agree, but that's a _different_ bug. This is a P0 bug that's going to stay focused on a narrow, short-term goal.
,
Apr 26 2017
@40: then can you please file a bug and assign to yourself for this milestone to fix the *real* problem once this P0 is addressed? Otherwise things will fall through the cracks once again, and create more P0s.
,
Apr 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4127a8a81751425e392c8e041e7f8f583ae683d6 commit 4127a8a81751425e392c8e041e7f8f583ae683d6 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Apr 27 17:08:32 2017 [autotest] Block '*.core' from test results. When Chrome crashes, we care about the minidump files, but not the core files. The core files are large, and expensive to include in test results, so we want to exclude them. There's a config option that allows selecting whether core files are uploaded in sysinfo for test results. By default, that option is true. This changes the option to be false in global_config.ini. This also re-enables code to include /var/spool/crash, since with core files excluded, that directory should be safe to include. BUG= chromium:715228 TEST=Run tests against at DUT with dumps; see non-*.core files collected Change-Id: I4be856c45df1456893347cf3a48179c92d5f8330 Reviewed-on: https://chromium-review.googlesource.com/486968 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/4127a8a81751425e392c8e041e7f8f583ae683d6/client/bin/site_sysinfo.py
,
Apr 27 2017
,
Apr 27 2017
,
Apr 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/737ff457dd446f1397a52e607a472b30cb054e75 commit 737ff457dd446f1397a52e607a472b30cb054e75 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Apr 27 23:05:20 2017 [autotest] Block '*.core' from test results. When Chrome crashes, we care about the minidump files, but not the core files. The core files are large, and expensive to include in test results, so we want to exclude them. There's a config option that allows selecting whether core files are uploaded in sysinfo for test results. By default, that option is true. This changes the option to be false in global_config.ini. BUG= chromium:715228 TEST=Run tests against at DUT with dumps; see non-*.core files collected Change-Id: I4be856c45df1456893347cf3a48179c92d5f8330 Reviewed-on: https://chromium-review.googlesource.com/489685 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/737ff457dd446f1397a52e607a472b30cb054e75/client/bin/site_sysinfo.py
,
Apr 28 2017
Issue 715223 has been merged into this issue.
,
Apr 28 2017
As satorux@ wrote in issue 715223 , core files are not generated for chrome unless /mnt/stateful_partition/etc/collect_chrome_crashes is present. http://www.chromium.org/chromium-os/packages/crash-reporting/faq#TOC-Why-would-Chrome-crashes-not-generate-a-core-file-on-dev-builds- How about just stop creating /mnt/stateful_partition/etc/collect_chrome_crashes for autotests, instead of adding new code to block core files? Also, core files for processes other than chrome shouldn't be huge so uploading them aren't harmless.
,
Apr 28 2017
,
Apr 28 2017
Your change meets the bar and is auto-approved for M59. Please go ahead and merge the CL to branch 3071 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), gkihumba@(ChromeOS), Abdul Syed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 28 2017
> How about just stop creating
> /mnt/stateful_partition/etc/collect_chrome_crashes
> for autotests, instead of adding new code to block core files?
There's ongoing discussion about the right strategy for handling
crashes, including crashes for programs other than Chrome. There
will eventually be some new bugs filed once we sort out a bit more
about what we're doing.
Meantime, all the changes associated with this bug are in and
cherry-picked.
The post-mortem with bugs and other pending actions to address
the root causes of the overload is here:
https://docs.google.com/document/d/1qofdsfDsds7h32fiAlrKyY6l_bwWKJmlO1r09DeNsjA/
,
May 1 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 1 2017
,
Jul 27 2017
closing this bug. please reopen if still not fixed. |
||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||
Comment 1 by gkihumba@google.com
, Apr 25 2017