New issue
Advanced search Search tips

Issue 900324 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 405732



Sign in to add a comment

crash-reporter marks a spool dir as full based on # of files, not # of reports

Project Member Reported by xiy...@chromium.org, Oct 30

Issue description

Chrome version: 70.0.3538.76 beta
Platform: 11021.56.0 (Official Build) beta-channel eve

From the feedback report:
https://listnr.corp.google.com/product/208/report/85748347375

The syslog of the feedback shows there are quite some chrome crashes and they are not received.

When the crash happens, there are 32 reports alreadying and crash_reporter does not upload new ones immediately. The report is on Oct 27. 3 days has passed and we still don't see any reports on the server.

2018-10-27T10:22:52.872565-07:00 WARNING crash_reporter[9152]: Crash directory /home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash already full with 32 pending reports
 
Labels: Restrict-View-Google
RVG for crash report of the particular device:

https://crash.corp.google.com/browse?q=ClientID%3D%271007aa27505449269bf8a5c381aac1be%27

Cc: elijahtaylor@chromium.org
Labels: -Pri-2 Pri-1
Owner: vapier@chromium.org
Status: Assigned (was: Untriaged)
Summary: Device with large number of crashes has very few crashes uploaded (was: Crash report not reported)
vapier, can you help look at this or route it to the right person?

I can repro some odd crash behavior on an eve with a test image.

# Delete any existing reports.
rm /home/chronos/user/crash/*
rm /home/chronos/crash/*
rm /var/spool/crash/*

# Reset the crash sender's time stamp history.
rm /var/lib/crash_sender/*

# Set up test image for crash testing
metrics_client -C
touch /run/crash_reporter/crash-test-in-progress

Now crash chrome 11 times, with about:inducebrowsercrashforrealz and about:crash

For the last one /var/log/messages will show:

2018-10-30T14:55:38.023912-07:00 INFO kernel: [12460.561756] chrome[20886]: segfault at 0 ip 00005d08f847c060 sp 00007ffdded632a0 error 6 in chrome[5d08f4400000+6225000]
2018-10-30T14:55:38.028614-07:00 INFO crash_reporter[20903]: libminijail[20903]: mount '/dev/log' -> '/dev/log' type '' flags 0x1001
2018-10-30T14:55:38.029906-07:00 NOTICE kernel: [12460.567780] audit: type=1400 audit(1540936538.028:3273): avc:  granted  { execute } for  pid=20905 comm="renderer_crash_" name="crash_reporter" dev="nvme0n1p5" ino=114797 scontext=u:r:cros_session_manager:s0 tcontext=u:object_r:cros_crash_reporter_exec:s0 tclass=file
2018-10-30T14:55:38.032911-07:00 NOTICE kernel: [12460.571486] audit: type=1400 audit(1540936538.031:3274): avc:  granted  { execute } for  pid=20905 comm="crash_reporter" path="/sbin/crash_reporter" dev="nvme0n1p5" ino=114797 scontext=u:r:cros_session_manager:s0 tcontext=u:object_r:cros_crash_reporter_exec:s0 tclass=file
2018-10-30T14:55:38.039455-07:00 WARNING crash_reporter[20905]: Received crash notification for chrome[20886] user 1000 (called directly)
2018-10-30T14:55:38.041164-07:00 INFO crash_reporter[20905]: Accessing crash dir '/home/user/382de3446528694112ec98e00dc6f3e700da4c03/crash' via symlinked handle '/proc/self/fd/6'
2018-10-30T14:55:38.041309-07:00 WARNING crash_reporter[20905]: Crash directory /home/user/382de3446528694112ec98e00dc6f3e700da4c03/crash already full with 32 pending reports
^^^^^^^^^^^^^
2018-10-30T14:55:38.041326-07:00 ERR crash_reporter[20905]: Can't create crash directory for uid 1000
2018-10-30T14:55:38.044403-07:00 WARNING crash_reporter[20903]: [user] Received crash notification for chrome[20886] sig 11, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-30T14:55:38.044618-07:00 WARNING crash_reporter[20903]: [ARC] Received crash notification for chrome[20886] sig 11, user 1000 group 1000 (ignoring - crash origin is not ARC)

But we don't have 32 pending reports, only 11:

localhost ~ # ls /home/chronos/user/crash/
chrome.20181030.143723.8595.chrome.txt.gz             chrome.20181030.143942.18472.i915_error_state.log.xz
chrome.20181030.143723.8595.dmp                       chrome.20181030.143942.18472.meta
chrome.20181030.143723.8595.i915_error_state.log.xz   chrome.20181030.144003.18543.chrome.txt.gz
chrome.20181030.143723.8595.meta                      chrome.20181030.144003.18543.dmp
chrome.20181030.143738.10668.chrome.txt.gz            chrome.20181030.144003.18543.i915_error_state.log.xz
chrome.20181030.143738.10668.dmp                      chrome.20181030.144003.18543.meta
chrome.20181030.143738.10668.i915_error_state.log.xz  chrome.20181030.144012.18597.chrome.txt.gz
chrome.20181030.143738.10668.meta                     chrome.20181030.144012.18597.dmp
chrome.20181030.143819.12012.chrome.txt.gz            chrome.20181030.144012.18597.i915_error_state.log.xz
chrome.20181030.143819.12012.dmp                      chrome.20181030.144012.18597.meta
chrome.20181030.143819.12012.i915_error_state.log.xz  chrome.20181030.144019.18682.chrome.txt.gz
chrome.20181030.143819.12012.meta                     chrome.20181030.144019.18682.dmp
chrome.20181030.143835.14185.chrome.txt.gz            chrome.20181030.144019.18682.i915_error_state.log.xz
chrome.20181030.143835.14185.dmp                      chrome.20181030.144019.18682.meta
chrome.20181030.143835.14185.i915_error_state.log.xz  chrome.20181030.144025.18743.chrome.txt.gz
chrome.20181030.143835.14185.meta                     chrome.20181030.144025.18743.dmp
chrome.20181030.143911.18127.chrome.txt.gz            chrome.20181030.144025.18743.i915_error_state.log.xz
chrome.20181030.143911.18127.dmp                      chrome.20181030.144025.18743.meta
chrome.20181030.143911.18127.i915_error_state.log.xz  chrome.20181030.145501.20766.chrome.txt.gz
chrome.20181030.143911.18127.meta                     chrome.20181030.145501.20766.dmp
chrome.20181030.143942.18472.chrome.txt.gz            chrome.20181030.145501.20766.i915_error_state.log.xz
chrome.20181030.143942.18472.dmp                      chrome.20181030.145501.20766.meta

localhost ~ # ls /home/chronos/user/crash/*.dmp | wc -l
11

The code in crash_reporter counts *files*, not dmps, to decide when to stop recording crashes:

https://cs.corp.google.com/chromeos_public/src/platform2/crash-reporter/crash_collector.cc?rcl=4b6d68f0f9fe6d0e015a4c2b1c6725ff7a27e3dc&l=600

I don't know how ARC handles things -- I wonder if WTFs could add files to that directory that would prevent chrome crashes from being recorded.

However, this still doesn't explain why we see so few crash reports. From the system log in listnr we see many chrome process crashes, probably renderers and service workers (since we don't see session manager restarting the browser). I would expect at least a few of these crashes to get uploaded.

2018-10-27T10:22:11.828897-07:00 INFO kernel: [  545.176407] traps: chrome[8558] trap invalid opcode ip:57125212e8d2 sp:7ffdf1d30400 error:0 in chrome[57124ca00000+7965000]
2018-10-27T10:22:11.834595-07:00 INFO crash_reporter[8571]: libminijail[8571]: mount /dev/log -> /dev/log type ''
2018-10-27T10:22:11.850837-07:00 WARNING crash_reporter[8571]: [user] Received crash notification for chrome[8558] sig 4, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-27T10:22:11.851163-07:00 WARNING crash_reporter[8571]: [ARC] Received crash notification for chrome[8558] sig 4, user 1000 group 1000 (ignoring - crash origin is not ARC)
2018-10-27T10:22:11.855429-07:00 WARNING crash_reporter[8574]: Received crash notification for chrome[8558] user 1000 (called directly)
2018-10-27T10:22:11.856887-07:00 INFO crash_reporter[8574]: Accessing crash dir '/home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash' via symlinked handle '/proc/self/fd/6'
2018-10-27T10:22:12.477883-07:00 INFO kernel: [  545.825015] traps: ServiceWorker t[8606] trap invalid opcode ip:57125212e8d2 sp:7807ed84d6c0 error:0 in chrome[57124ca00000+7965000]
2018-10-27T10:22:12.483516-07:00 INFO crash_reporter[8609]: libminijail[8609]: mount /dev/log -> /dev/log type ''
2018-10-27T10:22:12.500923-07:00 WARNING crash_reporter[8609]: [user] Received crash notification for chrome[8595] sig 4, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-27T10:22:12.501165-07:00 WARNING crash_reporter[8609]: [ARC] Received crash notification for chrome[8595] sig 4, user 1000 group 1000 (ignoring - crash origin is not ARC)
2018-10-27T10:22:12.521341-07:00 WARNING crash_reporter[8619]: Received crash notification for chrome[8595] user 1000 (called directly)
2018-10-27T10:22:12.524118-07:00 INFO crash_reporter[8619]: Accessing crash dir '/home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash' via symlinked handle '/proc/self/fd/6'
...

Still, it might be worth counting .dmp files instead of all files to decide when to stop recording crashes.

Components: -Internals>CrashReporting OS>Systems>CrashReporting
Labels: -Restrict-View-Google
Summary: crash-reporter marks a spool dir as full based on # of files, not # of reports (was: Device with large number of crashes has very few crashes uploaded)
i had noticed this in the code recently, but assumed i must have been reading it wrong as someone surely would have noticed, and we should have test coverage to make sure it is working correctly.

the code in question (CrashCollector::CheckHasCapacity) hasn't really changed since its initial implementation.  the logic in 8e9591e7e33115616a236c0e72e924f5948003fd from Sep 2010 is the same.

that said, updating the check to operate per-reports shouldn't be too hard.
Blocking: 405732
oh, i see the bug.  it's specifically that the counter chops off the last dotted extension.  so these files count as one:
  chrome.20181030.143723.8595.dmp
  chrome.20181030.143723.8595.log
  chrome.20181030.143723.8595.meta
  (they all resolve to "chrome.20181030.143723.8595")
but these files cause problems:
  chrome.20181030.143942.18472.i915_error_state.log.xz
  chrome.20181030.144025.18743.chrome.txt.gz
the xz/gz are chopped yielding unique names.

which means this has been broken since Jan 2015:
  https://chromium-review.googlesource.com/242457
Ah, that explains why it took me 11 crashes (* 3 = 33) instead of only 8 before it stopped recording.

It still doesn't fully explain the problem in the email thread, though. I wonder if there are ARC-y files that cause counting problems.

Cc: domlasko...@chromium.org
+domlaskowski, I don't know enough about crash_collector re: ARC and what's stored on disk

wrt the WTFs, we limit them on the ARC side to 8/day (b/31709045), so it's not some sort of DoS issue, and in the issue we're talking about on the internal thread, there were no WTFs reported by that user.

Status: Started (was: Assigned)
https://chromium-review.googlesource.com/1308499 should fix things
Project Member

Comment 8 by bugdroid1@chromium.org, Nov 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025

commit ceaa2875cf541fb99fc8a4f589d62a1dc9b71025
Author: Mike Frysinger <vapier@chromium.org>
Date: Thu Nov 01 04:59:37 2018

crash: change report counting in spool dir

The CheckHasCapacity logic is supposed to count unique crash reports.
It decides what crash report a file is part of by chopping off just
the last extension of the file.  This worked in the past when we only
had files like foo.dmp, foo.core, foo.meta, and foo.log.  But when we
started adding files like foo.log.gz, the logic broke, and we started
double counting things thus reducing the number of reports we'd allow
to be queued.

Rework the function to just ignore files that don't end in dmp, core,
and meta.  This will let us add any type of supplemental file without
making this func more complicated.

BUG= chromium:900324 
TEST=unittests & autotests pass

Change-Id: I110d6c12ad39e2597f0cff514a87bf00086a6b4c
Reviewed-on: https://chromium-review.googlesource.com/1308499
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Ben Chan <benchan@chromium.org>

[modify] https://crrev.com/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025/crash-reporter/crash_collector_test.cc
[modify] https://crrev.com/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025/crash-reporter/crash_collector.cc

this should be fixed in M72 now.  can someone give the next canary spin a check before cherry picking back ?

i assume we'll want M71.  people also want M70 ?
Regarding comment 6, ARC reports are stored as foo.{meta,dmp} or foo.{meta,log}, so they don't throw off counting. Looking at the logs in the feedback report, the crash dir mostly filled up with reports from Chrome. Only one is from ARC (Sonos app crash).
Labels: Merge-Request-71
Project Member

Comment 12 by sheriffbot@chromium.org, Nov 6

Labels: -Merge-Request-71 Hotlist-Merge-Review Merge-Review-71
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-71 Merge-Approved-71
Approved for M71 ChromeOS
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 7

Labels: merge-merged-release-R71-11151.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/f85f1b440eb523b508066f8cdb4133620bd02f24

commit f85f1b440eb523b508066f8cdb4133620bd02f24
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Nov 07 20:46:10 2018

crash: change report counting in spool dir

The CheckHasCapacity logic is supposed to count unique crash reports.
It decides what crash report a file is part of by chopping off just
the last extension of the file.  This worked in the past when we only
had files like foo.dmp, foo.core, foo.meta, and foo.log.  But when we
started adding files like foo.log.gz, the logic broke, and we started
double counting things thus reducing the number of reports we'd allow
to be queued.

Rework the function to just ignore files that don't end in dmp, core,
and meta.  This will let us add any type of supplemental file without
making this func more complicated.

BUG= chromium:900324 
TEST=unittests & autotests pass

Change-Id: I110d6c12ad39e2597f0cff514a87bf00086a6b4c
(cherry picked from commit ceaa2875cf541fb99fc8a4f589d62a1dc9b71025)
Reviewed-on: https://chromium-review.googlesource.com/c/1324249
Reviewed-by: Mike Frysinger <vapier@chromium.org>
Commit-Queue: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/f85f1b440eb523b508066f8cdb4133620bd02f24/crash-reporter/crash_collector_test.cc
[modify] https://crrev.com/f85f1b440eb523b508066f8cdb4133620bd02f24/crash-reporter/crash_collector.cc

Labels: -Merge-Approved-71 Merge-Merged
Status: Fixed (was: Started)
this should be fixed in R71+.  i haven't heard people clamoring for R70, so calling it finished for now.

Sign in to add a comment