Crashes not showing up in chrome:crashes |
|||||||
Issue descriptionChrome Version: 69.0.3475.0 OS: Android 8.1.0 (on Pixel 2 XL) Forked from https://bugs.chromium.org/p/chromium/issues/detail?id=859386 See that bug for a repro for sad tab, but this bug is specifically about the crash not showing up in chrome:crashes. That seems like an even more serious problem. What steps will reproduce the problem? (1) See bug #859386 for a reproducible crash (2) Note that it does not show up in chrome:crashes I have tried manually causing a crash using chrome:crash, and that one correctly appears in chrome:crashes. I'm very worried that we have unreported crashes in the wild. +cc amineer@ as mobile TPM For m69.
,
Jul 11
Also FWIW, I can repro the lack of a crash report on my device as well.
,
Jul 11
If it's not showing up, it's *probably* not a crash because our inbound flow hasn't disappeared (it's actually currently quite high :). There are a few bigger changes happening right now which could explain it. Have we enabled crashpad yet? +jperaza Were you using @google accounts - it could then be OOPIF and a renderer kill (so no stack) if we're not getting bindings right? +bo
,
Jul 11
Crashpad isn't enabled yet.
,
Jul 11
Hmm, interesting. chrome://crash generates a crash report (and the "chromium has crashed" dialog, since I'm using a userdebug android build). Neither happens for the repro in crbug.com/859386 . This is all without OOPIF
,
Jul 11
Nothing interesting from logcat: 07-11 21:06:13.106 23067 23067 I cr_TabWebContentsObs: renderProcessGone() for tab id: 3, oom protected: true, already needs reload: false 07-11 21:06:13.177 23067 23117 W cr_ChildProcessConn: onServiceDisconnected (crash or killed by oom): pid=25203 07-11 21:06:13.197 796 1979 I ActivityManager: Process com.chrome.canary:sandboxed_process15 (pid 25203) has died 07-11 21:06:13.197 796 1979 D ActivityManager: cleanUpApplicationRecord -- 25203 07-11 21:06:13.197 796 1979 W ActivityManager: Scheduling restart of crashed service com.chrome.canary/org.chromium.content.app.SandboxedProcessService15 in 1000ms 07-11 21:06:13.200 23067 23117 W cr_ChildProcessConn: The connection is not bound for 25203
,
Jul 11
Yeah.. I suspect this is not actually a crash. Renderer side bug generated a mojo pipe error, which on the browser side is treated as renderer process dying: https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_process_host_impl.cc?rcl=1d9f316a0ba47bb4be158db0de5edc596272dabe&l=3127 Waiting on rockot on the bug to confirm
,
Jul 11
If that is the case though, then I imagine the problem of not having a crash report affects desktop as well. Right solution is to mark that mojo channel to crash on error on the client side I guess.
,
Jul 11
Hmm, looks that theory is not correct.
,
Jul 11
Guessed 50%. It generates a channel error on the renderer side as well, which triggers this: https://cs.chromium.org/chromium/src/content/child/child_thread_impl.cc?rcl=351a925fcc3c0a950711126c9f4dab65cbee1ed9&l=186 Why _exit(0) I wonder, instead of something louder like LOG(FATAL). Anyway, _exit won't trigger generating a crash report, so that's why there is no crash report. So channel errors caused by child are treated right now as ooms on android...
,
Jul 11
Origin of that code predates mojo: https://chromiumcodereview.appspot.com/10834068/diff/14002/content/common/child_thread.cc I guess in the old IPC days, channel error can only happen when the browser dies, hence normal termination of child processes. But now it could has be caused by bug in the renderer side as well. I don't imagine there is any easy way to tell these apart though?
,
Jul 12
Sorry, I think I missed something. What exactly is generating a channel error on the renderer side?
,
Jul 12
This is the forked bug from https://bugs.chromium.org/p/chromium/issues/detail?id=859386#c14 to check why the sad tab didn't generate a crash report. I'm assuming that scenario is expected to generate channel errors on both ends, since that's what's actually happening.
,
Jul 12
The only detail I'm aware of in that bug is the DCHECK, which will just crash the renderer. That has nothing to do with triggering a channel error in the renderer when DCHECKs are disabled. If there's a repro of said channel error, can we get a stack trace from whatever is triggering it?
,
Jul 12
Stack that triggers OnChannelError on the renderer side? Yeah I can get that tomorrow.
,
Jul 12
stack for the OnChannelError on the renderer side
,
Jul 12
Ah. So that means the browser is first closing its end of the channel for some reason, which it should never do. I guess now we need to figure out why *that* is happening.
,
Jul 12
I think I can see how it's possible, which we hadn't really considered previously. If a Channel-associated Mojo interface sends a malformed mojom message (from renderer to browser), that can trigger the browser to close its end of the underlying channel pipe. Can you confirm/deny that during repro, the browser hits this line? https://cs.chromium.org/chromium/src/ipc/ipc_mojo_bootstrap.cc?rcl=abb887f1fe0523ed4ab633530a3a55fd3e479731&l=322
,
Jul 12
Yes RaiseError does happen on browser side. Looks like from sending a message though, not receiving one? Logging does not necessarily indicate order, but at least the RaiseError seemed to happen in the renderer side first in logs.
,
Jul 12
Ahhhh. OK. The renderer stack from comment #19 is actually the source of clarity here. I was wrong, we do end up closing our local endpoint if we drop a response callback without invoking it. I was looking in the wrong place. So that navigation bug would indeed explain this behavior. I wonder if we should maybe just disable this auto-closing behavior for IPC Channel and associated interfaces, given the extremely subtle effect it apparently has on things like crash reporting.
,
Jul 12
I am still unable to repro this locally. I have a CL[1] which should address both issues in release builds, though I think bug 859386 should remain open to track the work of avoiding DCHECKs from various dropped callbacks. boliu@: Any chance you could test the CL since you seem to be able to consistently repro? [1] https://chromium-review.googlesource.com/c/chromium/src/+/1135712
,
Jul 12
Yep, crash goes away as expected. Fwiw, I checked out 69.0.3475.0 locally to reproduce consistently. Never actually tried trunk
,
Jul 12
Thanks for confirming. I'll get that landed and we can see about merging if needed.
,
Jul 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/138153b186554c2077e867efbeb71e0342b70159 commit 138153b186554c2077e867efbeb71e0342b70159 Author: Ken Rockot <rockot@chromium.org> Date: Fri Jul 13 23:31:57 2018 IPC: Avoid closing Channel endpoints on error We never want to explicitly close a Channel endpoint until after the corresponding child process is dead. This prevents that from happening in some cases where it's still possible today. TBR=dcheng@chromium.org Bug: 859386 ,861607, 863430 Change-Id: I2858b6e011e06a000ae976c1e6ae96595cf67a85 Reviewed-on: https://chromium-review.googlesource.com/1135712 Commit-Queue: Ken Rockot <rockot@chromium.org> Reviewed-by: Ken Rockot <rockot@chromium.org> Reviewed-by: John Abd-El-Malek <jam@chromium.org> Cr-Commit-Position: refs/heads/master@{#575115} [modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/DEPS [modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_channel_mojo_unittest.cc [modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_mojo_bootstrap.cc [modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_test.mojom
,
Jul 16
@boliu, @rockot - THANK YOU for persisting and investigating this! Looks like a genuine bug was fixed here. Can we measure/check if there are other sad-tab-that's-not-really-a-crash going on? The user perception will definitely be that 'Chrome crashed' - and if we're showing a sad tab, I reckon we should know about it.
,
Jul 23
That's a pretty open-ended question, and I don't think there's any way to check for these things.
,
Jul 24
Yeah, it was pretty open ended since I'm not sure what the current state of the art is. I did little digging, and from https://cs.chromium.org/chromium/src/chrome/browser/ui/sad_tab.cc we do log UMA when we create a sad tab (Tabs.SadTab.CrashCreated). So at minimum, we are gathering these stats, and the TPMs can be monitoring them. Unfortunately, it looks like we don't gather that particular stat for Android though. Filed issue 866734 to track this.
,
Jul 24
in case you were thinking of comparing number of sad tabs vs number of crash reports to see how many reports are missed, you'll be disappointed to know that that won't work on android. android os killing a process to free memory does not generate crash reports because it's a sigkill, which is why I said in #10 above that these were treated as oom before
,
Jul 24
Does the sigkill in that case still show a sad tab? We could log that as Tabs.SadTab.OomDisplayed as we on other platforms, and then account for those as 'don't expect a crash report for these'.
,
Jul 24
> Does the sigkill in that case still show a sad tab? Yes > We could log that as Tabs.SadTab.OomDisplayed as we on other platforms, and then account for those as 'don't expect a crash report for these'. OOM metric for android is in Stability.Android.ProcessedCrashCounts. It's a bit more complicated than desktop I imagine, and checks a bunch of other conditions that affects which process android decides to kill to free memory. Google-internal doc explaining it in more detail: https://docs.google.com/document/d/1yrJJSapCSh1zTpeVF2qopQuSUg2ltkRs6FiV7sQC-rk/edit?usp=sharing
,
Jul 24
> > Does the sigkill in that case still show a sad tab? > Yes Err, actually, more complicated than that. It shows a sad tab if the tab is visible to the user. The tab is reloaded if it's dies in the background.
,
Jul 30
I don't think this needs an RBS tag any longer.
,
Sep 8
No longer on the Chrome team, e-mail me @google.com if any attention still required from me here, otherwise good luck!
,
Oct 17
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by amineer@chromium.org
, Jul 11Labels: -Pri-3 ReleaseBlock-Stable M-69 Target-69 Pri-1
Owner: yfried...@chromium.org
Status: Assigned (was: Untriaged)