New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 861607 link

Starred by 3 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

Crashes not showing up in chrome:crashes

Project Member Reported by mikelawther@chromium.org, Jul 8

Issue description

Chrome Version: 69.0.3475.0
OS: Android 8.1.0 (on Pixel 2 XL)

Forked from https://bugs.chromium.org/p/chromium/issues/detail?id=859386

See that bug for a repro for sad tab, but this bug is specifically about the crash not showing up in chrome:crashes. That seems like an even more serious problem.

What steps will reproduce the problem?
(1) See  bug #859386  for a reproducible crash
(2) Note that it does not show up in chrome:crashes

I have tried manually causing a crash using chrome:crash, and that one correctly appears in chrome:crashes.

I'm very worried that we have unreported crashes in the wild.

+cc amineer@ as mobile TPM For m69.
 
Cc: ranj@chromium.org wnwen@chromium.org benmason@chromium.org
Labels: -Pri-3 ReleaseBlock-Stable M-69 Target-69 Pri-1
Owner: yfried...@chromium.org
Status: Assigned (was: Untriaged)
Concur that not having crash reports for a reproducible sad tab is not a state we want to be in, tentatively marking this as RB-S for M69.  yfriedman@, thoughts on who'd be best to investigate?
Also FWIW, I can repro the lack of a crash report on my device as well.
Cc: boliu@chromium.org jperaza@chromium.org
Owner: jperaza@chromium.org
If it's not showing up, it's *probably* not a crash because our inbound flow hasn't disappeared (it's actually currently quite high :).

There are a few bigger changes happening right now which could explain it. 

Have we enabled crashpad yet? +jperaza
Were you using @google accounts - it could then be OOPIF and a renderer kill (so no stack) if we're not getting bindings right? +bo

Crashpad isn't enabled yet.
Hmm, interesting. chrome://crash generates a crash report (and the "chromium has crashed" dialog, since I'm using a userdebug android build). Neither happens for the repro in  crbug.com/859386 . This is all without OOPIF
Nothing interesting from logcat:
07-11 21:06:13.106 23067 23067 I cr_TabWebContentsObs: renderProcessGone() for tab id: 3, oom protected: true, already needs reload: false
07-11 21:06:13.177 23067 23117 W cr_ChildProcessConn: onServiceDisconnected (crash or killed by oom): pid=25203
07-11 21:06:13.197   796  1979 I ActivityManager: Process com.chrome.canary:sandboxed_process15 (pid 25203) has died
07-11 21:06:13.197   796  1979 D ActivityManager: cleanUpApplicationRecord -- 25203
07-11 21:06:13.197   796  1979 W ActivityManager: Scheduling restart of crashed service com.chrome.canary/org.chromium.content.app.SandboxedProcessService15 in 1000ms
07-11 21:06:13.200 23067 23117 W cr_ChildProcessConn: The connection is not bound for 25203

Yeah.. I suspect this is not actually a crash. Renderer side bug generated a mojo pipe error, which on the browser side is treated as renderer process dying:
https://cs.chromium.org/chromium/src/content/browser/renderer_host/render_process_host_impl.cc?rcl=1d9f316a0ba47bb4be158db0de5edc596272dabe&l=3127

Waiting on rockot on the bug to confirm
If that is the case though, then I imagine the problem of not having a crash report affects desktop as well.

Right solution is to mark that mojo channel to crash on error on the client side I guess.
Hmm, looks that theory is not correct.
Guessed 50%. It generates a channel error on the renderer side as well, which triggers this:
https://cs.chromium.org/chromium/src/content/child/child_thread_impl.cc?rcl=351a925fcc3c0a950711126c9f4dab65cbee1ed9&l=186

Why _exit(0) I wonder, instead of something louder like LOG(FATAL).

Anyway, _exit won't trigger generating a crash report, so that's why there is no crash report. So channel errors caused by child are treated right now as ooms on android...
Cc: roc...@chromium.org jam@chromium.org
Origin of that code predates mojo:
https://chromiumcodereview.appspot.com/10834068/diff/14002/content/common/child_thread.cc

I guess in the old IPC days, channel error can only happen when the browser dies, hence normal termination of child processes. But now it could has be caused by bug in the renderer side as well. I don't imagine there is any easy way to tell these apart though?
Sorry, I think I missed something. What exactly is generating a channel error on the renderer side?
This is the forked bug from https://bugs.chromium.org/p/chromium/issues/detail?id=859386#c14 to check why the sad tab didn't generate a crash report.

I'm assuming that scenario is expected to generate channel errors on both ends, since that's what's actually happening.
The only detail I'm aware of in that bug is the DCHECK, which will just crash the renderer. That has nothing to do with triggering a channel error in the renderer when DCHECKs are disabled.

If there's a repro of said channel error, can we get a stack trace from whatever is triggering it?
Stack that triggers OnChannelError on the renderer side? Yeah I can get that tomorrow.
stack for the OnChannelError on the renderer side
renderer_stack.txt
24.0 KB View Download
Ah. So that means the browser is first closing its end of the channel for some reason, which it should never do. I guess now we need to figure out why *that* is happening.
I think I can see how it's possible, which we hadn't really considered previously.

If a Channel-associated Mojo interface sends a malformed mojom message (from renderer to browser), that can trigger the browser to close its end of the underlying channel pipe.

Can you confirm/deny that during repro, the browser hits this line? https://cs.chromium.org/chromium/src/ipc/ipc_mojo_bootstrap.cc?rcl=abb887f1fe0523ed4ab633530a3a55fd3e479731&l=322
Yes RaiseError does happen on browser side. Looks like from sending a message though, not receiving one? Logging does not necessarily indicate order, but at least the RaiseError seemed to happen in the renderer side first in logs.
browser_RaiseError.txt
18.9 KB View Download
renderer_RaiseError.txt
44.4 KB View Download
Ahhhh. OK.

The renderer stack from comment #19 is actually the source of clarity here. I was wrong, we do end up closing our local endpoint if we drop a response callback without invoking it. I was looking in the wrong place. So that navigation bug would indeed explain this behavior.

I wonder if we should maybe just disable this auto-closing behavior for IPC Channel and associated interfaces, given the extremely subtle effect it apparently has on things like crash reporting.
I am still unable to repro this locally. I have a CL[1] which should address both issues in release builds, though I think  bug 859386  should remain open to track the work of avoiding DCHECKs from various dropped callbacks.

boliu@: Any chance you could test the CL since you seem to be able to consistently repro?

[1] https://chromium-review.googlesource.com/c/chromium/src/+/1135712
Yep, crash goes away as expected.

Fwiw, I checked out 69.0.3475.0 locally to reproduce consistently. Never actually tried trunk
Cc: -roc...@chromium.org
Owner: roc...@chromium.org
Status: Started (was: Assigned)
Thanks for confirming. I'll get that landed and we can see about merging if needed.
Project Member

Comment 24 by bugdroid1@chromium.org, Jul 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/138153b186554c2077e867efbeb71e0342b70159

commit 138153b186554c2077e867efbeb71e0342b70159
Author: Ken Rockot <rockot@chromium.org>
Date: Fri Jul 13 23:31:57 2018

IPC: Avoid closing Channel endpoints on error

We never want to explicitly close a Channel endpoint until after the
corresponding child process is dead. This prevents that from happening
in some cases where it's still possible today.

TBR=dcheng@chromium.org

Bug:  859386 ,861607, 863430 
Change-Id: I2858b6e011e06a000ae976c1e6ae96595cf67a85
Reviewed-on: https://chromium-review.googlesource.com/1135712
Commit-Queue: Ken Rockot <rockot@chromium.org>
Reviewed-by: Ken Rockot <rockot@chromium.org>
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Cr-Commit-Position: refs/heads/master@{#575115}
[modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/DEPS
[modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_channel_mojo_unittest.cc
[modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_mojo_bootstrap.cc
[modify] https://crrev.com/138153b186554c2077e867efbeb71e0342b70159/ipc/ipc_test.mojom

@boliu, @rockot - THANK YOU for persisting and investigating this! Looks like a genuine bug was fixed here.

Can we measure/check if there are other sad-tab-that's-not-really-a-crash going on? The user perception will definitely be that 'Chrome crashed' - and if we're showing a sad tab, I reckon we should know about it. 
That's a pretty open-ended question, and I don't think there's any way to check for these things.
Yeah, it was pretty open ended since I'm not sure what the current state of the art is. 

I did little digging, and from https://cs.chromium.org/chromium/src/chrome/browser/ui/sad_tab.cc we do log UMA when we create a sad tab (Tabs.SadTab.CrashCreated). So at minimum, we are gathering these stats, and the TPMs can be monitoring them.

Unfortunately, it looks like we don't gather that particular stat for Android though. Filed issue 866734 to track this.
in case you were thinking of comparing number of sad tabs vs number of crash reports to see how many reports are missed, you'll be disappointed to know that that won't work on android. android os killing a process to free memory does not generate crash reports because it's a sigkill, which is why I said in #10 above that these were treated as oom before
Does the sigkill in that case still show a sad tab? We could log that as Tabs.SadTab.OomDisplayed as we on other platforms, and then account for those as 'don't expect a crash report for these'.
> Does the sigkill in that case still show a sad tab?

Yes

> We could log that as Tabs.SadTab.OomDisplayed as we on other platforms, and then account for those as 'don't expect a crash report for these'.

OOM metric for android is in Stability.Android.ProcessedCrashCounts. It's a bit more complicated than desktop I imagine, and checks a bunch of other conditions that affects which process android decides to kill to free memory.

Google-internal doc explaining it in more detail:
https://docs.google.com/document/d/1yrJJSapCSh1zTpeVF2qopQuSUg2ltkRs6FiV7sQC-rk/edit?usp=sharing
> > Does the sigkill in that case still show a sad tab?

> Yes

Err, actually, more complicated than that. It shows a sad tab if the tab is visible to the user. The tab is reloaded if it's dies in the background.
Labels: -ReleaseBlock-Stable
I don't think this needs an RBS tag any longer.
Cc: -amineer@chromium.org
No longer on the Chrome team, e-mail me @google.com if any attention still required from me here, otherwise good luck!
Owner: rockot@google.com

Sign in to add a comment