crash-reporter marks a spool dir as full based on # of files, not # of reports |
||||||||||
Issue descriptionChrome version: 70.0.3538.76 beta Platform: 11021.56.0 (Official Build) beta-channel eve From the feedback report: https://listnr.corp.google.com/product/208/report/85748347375 The syslog of the feedback shows there are quite some chrome crashes and they are not received. When the crash happens, there are 32 reports alreadying and crash_reporter does not upload new ones immediately. The report is on Oct 27. 3 days has passed and we still don't see any reports on the server. 2018-10-27T10:22:52.872565-07:00 WARNING crash_reporter[9152]: Crash directory /home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash already full with 32 pending reports
,
Oct 30
vapier, can you help look at this or route it to the right person?
I can repro some odd crash behavior on an eve with a test image.
# Delete any existing reports.
rm /home/chronos/user/crash/*
rm /home/chronos/crash/*
rm /var/spool/crash/*
# Reset the crash sender's time stamp history.
rm /var/lib/crash_sender/*
# Set up test image for crash testing
metrics_client -C
touch /run/crash_reporter/crash-test-in-progress
Now crash chrome 11 times, with about:inducebrowsercrashforrealz and about:crash
For the last one /var/log/messages will show:
2018-10-30T14:55:38.023912-07:00 INFO kernel: [12460.561756] chrome[20886]: segfault at 0 ip 00005d08f847c060 sp 00007ffdded632a0 error 6 in chrome[5d08f4400000+6225000]
2018-10-30T14:55:38.028614-07:00 INFO crash_reporter[20903]: libminijail[20903]: mount '/dev/log' -> '/dev/log' type '' flags 0x1001
2018-10-30T14:55:38.029906-07:00 NOTICE kernel: [12460.567780] audit: type=1400 audit(1540936538.028:3273): avc: granted { execute } for pid=20905 comm="renderer_crash_" name="crash_reporter" dev="nvme0n1p5" ino=114797 scontext=u:r:cros_session_manager:s0 tcontext=u:object_r:cros_crash_reporter_exec:s0 tclass=file
2018-10-30T14:55:38.032911-07:00 NOTICE kernel: [12460.571486] audit: type=1400 audit(1540936538.031:3274): avc: granted { execute } for pid=20905 comm="crash_reporter" path="/sbin/crash_reporter" dev="nvme0n1p5" ino=114797 scontext=u:r:cros_session_manager:s0 tcontext=u:object_r:cros_crash_reporter_exec:s0 tclass=file
2018-10-30T14:55:38.039455-07:00 WARNING crash_reporter[20905]: Received crash notification for chrome[20886] user 1000 (called directly)
2018-10-30T14:55:38.041164-07:00 INFO crash_reporter[20905]: Accessing crash dir '/home/user/382de3446528694112ec98e00dc6f3e700da4c03/crash' via symlinked handle '/proc/self/fd/6'
2018-10-30T14:55:38.041309-07:00 WARNING crash_reporter[20905]: Crash directory /home/user/382de3446528694112ec98e00dc6f3e700da4c03/crash already full with 32 pending reports
^^^^^^^^^^^^^
2018-10-30T14:55:38.041326-07:00 ERR crash_reporter[20905]: Can't create crash directory for uid 1000
2018-10-30T14:55:38.044403-07:00 WARNING crash_reporter[20903]: [user] Received crash notification for chrome[20886] sig 11, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-30T14:55:38.044618-07:00 WARNING crash_reporter[20903]: [ARC] Received crash notification for chrome[20886] sig 11, user 1000 group 1000 (ignoring - crash origin is not ARC)
But we don't have 32 pending reports, only 11:
localhost ~ # ls /home/chronos/user/crash/
chrome.20181030.143723.8595.chrome.txt.gz chrome.20181030.143942.18472.i915_error_state.log.xz
chrome.20181030.143723.8595.dmp chrome.20181030.143942.18472.meta
chrome.20181030.143723.8595.i915_error_state.log.xz chrome.20181030.144003.18543.chrome.txt.gz
chrome.20181030.143723.8595.meta chrome.20181030.144003.18543.dmp
chrome.20181030.143738.10668.chrome.txt.gz chrome.20181030.144003.18543.i915_error_state.log.xz
chrome.20181030.143738.10668.dmp chrome.20181030.144003.18543.meta
chrome.20181030.143738.10668.i915_error_state.log.xz chrome.20181030.144012.18597.chrome.txt.gz
chrome.20181030.143738.10668.meta chrome.20181030.144012.18597.dmp
chrome.20181030.143819.12012.chrome.txt.gz chrome.20181030.144012.18597.i915_error_state.log.xz
chrome.20181030.143819.12012.dmp chrome.20181030.144012.18597.meta
chrome.20181030.143819.12012.i915_error_state.log.xz chrome.20181030.144019.18682.chrome.txt.gz
chrome.20181030.143819.12012.meta chrome.20181030.144019.18682.dmp
chrome.20181030.143835.14185.chrome.txt.gz chrome.20181030.144019.18682.i915_error_state.log.xz
chrome.20181030.143835.14185.dmp chrome.20181030.144019.18682.meta
chrome.20181030.143835.14185.i915_error_state.log.xz chrome.20181030.144025.18743.chrome.txt.gz
chrome.20181030.143835.14185.meta chrome.20181030.144025.18743.dmp
chrome.20181030.143911.18127.chrome.txt.gz chrome.20181030.144025.18743.i915_error_state.log.xz
chrome.20181030.143911.18127.dmp chrome.20181030.144025.18743.meta
chrome.20181030.143911.18127.i915_error_state.log.xz chrome.20181030.145501.20766.chrome.txt.gz
chrome.20181030.143911.18127.meta chrome.20181030.145501.20766.dmp
chrome.20181030.143942.18472.chrome.txt.gz chrome.20181030.145501.20766.i915_error_state.log.xz
chrome.20181030.143942.18472.dmp chrome.20181030.145501.20766.meta
localhost ~ # ls /home/chronos/user/crash/*.dmp | wc -l
11
The code in crash_reporter counts *files*, not dmps, to decide when to stop recording crashes:
https://cs.corp.google.com/chromeos_public/src/platform2/crash-reporter/crash_collector.cc?rcl=4b6d68f0f9fe6d0e015a4c2b1c6725ff7a27e3dc&l=600
I don't know how ARC handles things -- I wonder if WTFs could add files to that directory that would prevent chrome crashes from being recorded.
However, this still doesn't explain why we see so few crash reports. From the system log in listnr we see many chrome process crashes, probably renderers and service workers (since we don't see session manager restarting the browser). I would expect at least a few of these crashes to get uploaded.
2018-10-27T10:22:11.828897-07:00 INFO kernel: [ 545.176407] traps: chrome[8558] trap invalid opcode ip:57125212e8d2 sp:7ffdf1d30400 error:0 in chrome[57124ca00000+7965000]
2018-10-27T10:22:11.834595-07:00 INFO crash_reporter[8571]: libminijail[8571]: mount /dev/log -> /dev/log type ''
2018-10-27T10:22:11.850837-07:00 WARNING crash_reporter[8571]: [user] Received crash notification for chrome[8558] sig 4, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-27T10:22:11.851163-07:00 WARNING crash_reporter[8571]: [ARC] Received crash notification for chrome[8558] sig 4, user 1000 group 1000 (ignoring - crash origin is not ARC)
2018-10-27T10:22:11.855429-07:00 WARNING crash_reporter[8574]: Received crash notification for chrome[8558] user 1000 (called directly)
2018-10-27T10:22:11.856887-07:00 INFO crash_reporter[8574]: Accessing crash dir '/home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash' via symlinked handle '/proc/self/fd/6'
2018-10-27T10:22:12.477883-07:00 INFO kernel: [ 545.825015] traps: ServiceWorker t[8606] trap invalid opcode ip:57125212e8d2 sp:7807ed84d6c0 error:0 in chrome[57124ca00000+7965000]
2018-10-27T10:22:12.483516-07:00 INFO crash_reporter[8609]: libminijail[8609]: mount /dev/log -> /dev/log type ''
2018-10-27T10:22:12.500923-07:00 WARNING crash_reporter[8609]: [user] Received crash notification for chrome[8595] sig 4, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-10-27T10:22:12.501165-07:00 WARNING crash_reporter[8609]: [ARC] Received crash notification for chrome[8595] sig 4, user 1000 group 1000 (ignoring - crash origin is not ARC)
2018-10-27T10:22:12.521341-07:00 WARNING crash_reporter[8619]: Received crash notification for chrome[8595] user 1000 (called directly)
2018-10-27T10:22:12.524118-07:00 INFO crash_reporter[8619]: Accessing crash dir '/home/user/e3c2f458c9b5c8b72513ea812f96d786748cf363/crash' via symlinked handle '/proc/self/fd/6'
...
Still, it might be worth counting .dmp files instead of all files to decide when to stop recording crashes.
,
Oct 30
i had noticed this in the code recently, but assumed i must have been reading it wrong as someone surely would have noticed, and we should have test coverage to make sure it is working correctly. the code in question (CrashCollector::CheckHasCapacity) hasn't really changed since its initial implementation. the logic in 8e9591e7e33115616a236c0e72e924f5948003fd from Sep 2010 is the same. that said, updating the check to operate per-reports shouldn't be too hard.
,
Oct 30
oh, i see the bug. it's specifically that the counter chops off the last dotted extension. so these files count as one: chrome.20181030.143723.8595.dmp chrome.20181030.143723.8595.log chrome.20181030.143723.8595.meta (they all resolve to "chrome.20181030.143723.8595") but these files cause problems: chrome.20181030.143942.18472.i915_error_state.log.xz chrome.20181030.144025.18743.chrome.txt.gz the xz/gz are chopped yielding unique names. which means this has been broken since Jan 2015: https://chromium-review.googlesource.com/242457
,
Oct 30
Ah, that explains why it took me 11 crashes (* 3 = 33) instead of only 8 before it stopped recording. It still doesn't fully explain the problem in the email thread, though. I wonder if there are ARC-y files that cause counting problems.
,
Oct 31
+domlaskowski, I don't know enough about crash_collector re: ARC and what's stored on disk wrt the WTFs, we limit them on the ARC side to 8/day (b/31709045), so it's not some sort of DoS issue, and in the issue we're talking about on the internal thread, there were no WTFs reported by that user.
,
Oct 31
,
Nov 1
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025 commit ceaa2875cf541fb99fc8a4f589d62a1dc9b71025 Author: Mike Frysinger <vapier@chromium.org> Date: Thu Nov 01 04:59:37 2018 crash: change report counting in spool dir The CheckHasCapacity logic is supposed to count unique crash reports. It decides what crash report a file is part of by chopping off just the last extension of the file. This worked in the past when we only had files like foo.dmp, foo.core, foo.meta, and foo.log. But when we started adding files like foo.log.gz, the logic broke, and we started double counting things thus reducing the number of reports we'd allow to be queued. Rework the function to just ignore files that don't end in dmp, core, and meta. This will let us add any type of supplemental file without making this func more complicated. BUG= chromium:900324 TEST=unittests & autotests pass Change-Id: I110d6c12ad39e2597f0cff514a87bf00086a6b4c Reviewed-on: https://chromium-review.googlesource.com/1308499 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> [modify] https://crrev.com/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025/crash-reporter/crash_collector_test.cc [modify] https://crrev.com/ceaa2875cf541fb99fc8a4f589d62a1dc9b71025/crash-reporter/crash_collector.cc
,
Nov 1
this should be fixed in M72 now. can someone give the next canary spin a check before cherry picking back ? i assume we'll want M71. people also want M70 ?
,
Nov 2
Regarding comment 6, ARC reports are stored as foo.{meta,dmp} or foo.{meta,log}, so they don't throw off counting. Looking at the logs in the feedback report, the crash dir mostly filled up with reports from Chrome. Only one is from ARC (Sonos app crash).
,
Nov 6
,
Nov 6
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Nov 7
Approved for M71 ChromeOS
,
Nov 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/f85f1b440eb523b508066f8cdb4133620bd02f24 commit f85f1b440eb523b508066f8cdb4133620bd02f24 Author: Mike Frysinger <vapier@chromium.org> Date: Wed Nov 07 20:46:10 2018 crash: change report counting in spool dir The CheckHasCapacity logic is supposed to count unique crash reports. It decides what crash report a file is part of by chopping off just the last extension of the file. This worked in the past when we only had files like foo.dmp, foo.core, foo.meta, and foo.log. But when we started adding files like foo.log.gz, the logic broke, and we started double counting things thus reducing the number of reports we'd allow to be queued. Rework the function to just ignore files that don't end in dmp, core, and meta. This will let us add any type of supplemental file without making this func more complicated. BUG= chromium:900324 TEST=unittests & autotests pass Change-Id: I110d6c12ad39e2597f0cff514a87bf00086a6b4c (cherry picked from commit ceaa2875cf541fb99fc8a4f589d62a1dc9b71025) Reviewed-on: https://chromium-review.googlesource.com/c/1324249 Reviewed-by: Mike Frysinger <vapier@chromium.org> Commit-Queue: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/f85f1b440eb523b508066f8cdb4133620bd02f24/crash-reporter/crash_collector_test.cc [modify] https://crrev.com/f85f1b440eb523b508066f8cdb4133620bd02f24/crash-reporter/crash_collector.cc
,
Nov 7
this should be fixed in R71+. i haven't heard people clamoring for R70, so calling it finished for now. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by xiy...@chromium.org
, Oct 30