crash_sender cleaning up saved OS version info |
||||||||||||||
Issue descriptionI was digging through kernel crash reports for Kevin and found a crash report 8f1e8b7c5a9c1325. That contains: <3>[ 7060.531837] mali ff9a0000.gpu: Issueing GPU soft-reset because jobs failed to be killed (within 1000 ms) as part of context termination (e.g. process exit) <3>[ 7060.531854] mali ff9a0000.gpu: Preparing to soft-reset GPU: Waiting (upto 3000 ms) for all jobs to complete soft-stop <3>[ 7060.531964] mali ff9a0000.gpu: Resetting GPU (allowing up to 500 ms) <3>[ 7060.531972] mali ff9a0000.gpu: Register state: <3>[ 7060.531983] mali ff9a0000.gpu: GPU_IRQ_RAWSTAT=0x00000200 GPU_STATUS=0x00000000 <3>[ 7060.531993] mali ff9a0000.gpu: JOB_IRQ_RAWSTAT=0x00000000 JOB_IRQ_JS_STATE=0x00000000 JOB_IRQ_THROTTLE=0x00000000 <3>[ 7060.532002] mali ff9a0000.gpu: JS0_STATUS=0x00000000 JS0_HEAD_LO=0xe47f9f00 <3>[ 7060.532010] mali ff9a0000.gpu: JS1_STATUS=0x00000000 JS1_HEAD_LO=0xe47f9d40 <3>[ 7060.532019] mali ff9a0000.gpu: JS2_STATUS=0x00000000 JS2_HEAD_LO=0x00000000 <3>[ 7060.532027] mali ff9a0000.gpu: MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000 <3>[ 7060.532035] mali ff9a0000.gpu: GPU_IRQ_MASK=0x00000000 JOB_IRQ_MASK=0x00000000 MMU_IRQ_MASK=0x00000000 <3>[ 7060.532043] mali ff9a0000.gpu: PWR_OVERRIDE0=0x00000000 PWR_OVERRIDE1=0x00000000 <3>[ 7060.532051] mali ff9a0000.gpu: SHADER_CONFIG=0x00010000 L2_MMU_CONFIG=0x00000000 <3>[ 7060.532117] mali ff9a0000.gpu: Reset complete Those were supposed to go away with <https://chromium-review.googlesource.com/655538> (CHROMIUM: MALI: fix async wait for fence). --- At first I was a bit worried that we still hand't fixed this, but then I looked more carefully. I began to wonder if perhaps we were doing our old favorite thing and reporting the crash with the wrong version. AKA: maybe the fix for bug #590757 is somehow not working properly? Here's my evidence. When I look at 8f1e8b7c5a9c1325 and look at stack crawls, I see ones that look like: <0>[ 7200.115883] Call trace: <4>[ 7200.115902] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8 <4>[ 7200.115916] [<ffffffc00093ace8>] __schedule+0x504/0x740 <4>[ 7200.115927] [<ffffffc00093afac>] schedule+0x88/0xa8 <4>[ 7200.115941] [<ffffffc0005eb7c0>] kbase_jm_wait_for_zero_jobs+0xd8/0x228 <4>[ 7200.115955] [<ffffffc0005d19ac>] kbase_jd_zap_context+0xe0/0xf8 <4>[ 7200.115967] [<ffffffc0005d63cc>] kbase_destroy_context+0x48/0x19c <4>[ 7200.115979] [<ffffffc0005e050c>] kbase_release+0x134/0x16c <4>[ 7200.115990] [<ffffffc00035ce28>] __fput+0x104/0x1cc <4>[ 7200.115999] [<ffffffc00035cf64>] ____fput+0x20/0x2c <4>[ 7200.116009] [<ffffffc00023d4cc>] task_work_run+0xa4/0xd4 <4>[ 7200.116022] [<ffffffc0002224a8>] do_exit+0x494/0x8f8 <4>[ 7200.116033] [<ffffffc000223a48>] do_group_exit+0x50/0xb0 <4>[ 7200.116045] [<ffffffc00022f430>] get_signal+0x50c/0x548 <4>[ 7200.116056] [<ffffffc000207c5c>] do_signal+0xb0/0x514 <4>[ 7200.116067] [<ffffffc0002082b4>] do_notify_resume+0x28/0x6c <4>[ 7200.116077] [<ffffffc000203d28>] work_pending+0x1c/0x20 Now if I look for other crashes that are supposed to have the same version number, like fd37062ecb8358ef: <0>[11160.151002] Call trace: <4>[11160.151017] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8 <4>[11160.151028] [<ffffffc00093ae28>] __schedule+0x504/0x740 <4>[11160.151035] [<ffffffc00093b0ec>] schedule+0x88/0xa8 <4>[11160.151043] [<ffffffc00093ed40>] schedule_timeout+0x44/0x27c <4>[11160.151051] [<ffffffc00093a8e4>] io_schedule_timeout+0x7c/0xbc <4>[11160.151058] [<ffffffc00093ba44>] bit_wait_io+0x1c/0x6c <4>[11160.151066] [<ffffffc00093b8b8>] __wait_on_bit_lock+0x70/0xcc <4>[11160.151074] [<ffffffc000303134>] __lock_page+0x7c/0x9c <4>[11160.151082] [<ffffffc00030db50>] write_cache_pages+0x1a4/0x3e0 <4>[11160.151089] [<ffffffc00030dde8>] generic_writepages+0x5c/0x8c <4>[11160.151099] [<ffffffc0004260e4>] jbd2_journal_commit_transaction+0x6a4/0x183c <4>[11160.151108] [<ffffffc00042c65c>] kjournald2+0xe4/0x278 <4>[11160.151115] [<ffffffc00023f31c>] kthread+0x124/0x12c <4>[11160.151123] [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 Notice that the address of "__schedule+0x504/0x740" is different. In the first it was ffffffc00093ace8 and in the second ffffffc00093ae28. Even more telling is this difference: <3>[ 7200.115845] Not tainted 4.4.70-11339-gb1947ccd8d12 #1 vs. <3>[11160.150977] Not tainted 4.4.70-11357-gcb8919185fe3 #1 To me that means that these are different builds and the crashing device had a kernel from 9/1 (based on the git log of hash b1947ccd8d12). --- The first explanation that would come to mind would be that perhaps the old version was "too old" and didn't log the kernel version. ...but that falls apart because the Mali crash was _introduced_ in R61 and the fix for bug #590757 was supposed to be in R60 sometime... --- So I guess the question is: what's happening here? Why are we not getting the right version?
,
Sep 26 2017
From above, problematic crash report was: 8f1e8b7c5a9c1325 ...so link to crash report is: http://crash.corp/8f1e8b7c5a9c1325 From above, crash report that was on the same version but _did_ seem to have the right kernel version was: fd37062ecb8358ef ...so link to crash report is: http://crash.corp/fd37062ecb8358ef
,
Oct 16 2017
the CL that should have fixed this landed here: https://chromium-review.googlesource.com/473608 which should be part of R60+. the way the fix works is that it needs to boot the system, collect the data, and only crashes that happened after that point would have the right version. if a crash occurred before the upgrade (or was otherwise queued), and then the update happened, it's possible for the wrong version to be attached to any reports not yet uploaded. so there might be a bug, or we might be seeing a long tail of upgrades & unuploaded crashes.
,
Oct 16 2017
> so there might be a bug, or we might be seeing a long tail of upgrades & unuploaded crashes. Right, except (I think) that's impossible given the data above. Specifically, the two git hashes mentioned above: <3>[ 7200.115845] Not tainted 4.4.70-11339-gb1947ccd8d12 #1 vs. <3>[11160.150977] Not tainted 4.4.70-11357-gcb8919185fe3 #1 --- commit b1947ccd8d12b477fda9de3330ac5da0d2d0274b Author: Rodrigo Vivi <rodrigo.vivi@intel.com> AuthorDate: Wed Aug 9 15:50:21 2017 -0700 Commit: ChromeOS Commit Bot <chromeos-commit-bot@chromium.org> CommitDate: Fri Sep 1 23:49:45 2017 +0000 --- commit cb8919185fe3e996ad1cc480b0c74fe57b046813 Author: oder_chiou@realtek.com <oder_chiou@realtek.com> AuthorDate: Wed Aug 30 13:08:31 2017 +0800 Commit: ChromeOS Commit Bot <chromeos-commit-bot@chromium.org> CommitDate: Wed Sep 20 05:51:31 2017 +0000 --- The only way that the device could have reported a git hash of a kernel commit that landed on Sep 1 is if it had a new build. ...and that means it should have had the fix. --- That's why I filed this bug and believe that we might have a bug.
,
Oct 16 2017
gotcha (and you already explained that in your initial comment) we have the logging_KernelCrash autotest, but i don't think that's part of any suite, and i don't think it was updated to check for this. so i guess we should start there ? :)
,
Oct 16 2017
As further examples: --- http://crash.corp/fed8797e6a924d93 Claims to be: 9901.46.0 ...date: 2017-10-12 21:11 ...commit of top kernel in that build: cedb2868 ...date of top kernel commit: Wed Oct 11 19:16:05 2017 +0000 String from the console-ramoops: Not tainted 4.4.79-11626-g960c8bae43ca #1 ...commit from that string: 960c8bae43ca ...date of that commit: Tue Sep 26 21:07:47 2017 +0000 (consistent with the fact that the crashing system must have been running the previous pushed build, 9901.35.1) --- Updating the autotest to check for this seems sane. Ben: do you want to do that, or should I?
,
Oct 16 2017
I can take a look at the autotest later this week. One possibility I've thought of: How early in boot are these crashes happening? The current running version isn't written to /var/spool/crash until the boot process gets far enough to start crash_reporter, so a very early crash on a new kernel could still potentially pick up the old version.
,
Oct 16 2017
the first number in the kernel log tells you how many seconds since boot: <0>[ 7200.115883] Call trace: so that's ~2 hours. if it took us ~2 hours to get to the crash-reporter init i think we've got a bigger problem :D. also i think the fail safe logic would have kicked in by then. i *think* these are crashes being collected via pstore (i.e. they trigger a panic/reboot and we collect them on the next boot). the autotest in question should be able to reproduce this scenario.
,
Oct 16 2017
@7: Thanks! The crashes happen very late. Also if we were running on the old version and rebooted to the new version, we definitely had enough time to fully download and apply an auto update.
,
Oct 18 2017
Still working on the autotests, but here's a new possibility: this code removes files that don't have a .meta once they're 24 hours old: http://cs/chromeos_public/src/platform2/crash-reporter/crash_sender?l=552&rcl=d3af8130a36aa1fd6af2d8b880eb6f309617905b Since os-release and lsb-release are copied at boot, I'm not sure how they'd be 24 hours old by the time that first crash happened at 7200s, but I'll keep digging.
,
Oct 18 2017
7200s is the kernel uptime, not the wall clock time. If the device was suspended then kernel uptime doesn't tick, but wall clock time does...
,
Oct 19 2017
none of those crashes include logs, so can't check if "Removing old orphaned file" exists there. but seems likely that string would show up in other reports. is there a way to search feedback reports or archives in crash reports ? or do we just have to download a ton of files and write a script to unpack & grep ? :(
,
Oct 19 2017
> is there a way to search feedback reports or archives in crash > reports ? or do we just have to download a ton of files and > write a script to unpack & grep ? :( Someone did that, but I'm drawing a blank for who. I'll ask around at the office today. ...but easier is for me to just open file:///var/log/messages.* on the kevin I'm typing on right now. On file:///var/log/messages.2 I found: 2017-10-17T15:03:10.584663-07:00 NOTICE crash_sender[25627]: Removing old orphaned file: /var/spool/crash/lsb-release. 2017-10-17T15:03:10.596982-07:00 NOTICE crash_sender[25632]: Removing old orphaned file: /var/spool/crash/os-release. ...so yeah, seems like the problem.
,
Oct 19 2017
,
Oct 19 2017
crash_sender is so helpful! gold star!
,
Oct 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/20a23362c557f400e221de9b3b39200b3b546e4c commit 20a23362c557f400e221de9b3b39200b3b546e4c Author: Benjamin Gordon <bmgordon@chromium.org> Date: Thu Oct 19 19:56:50 2017 crash: Preserve cached os version files /etc/os-release and /etc/lsb-release are cached in /var/spool/crash so that kernel crashes can be correctly matched up to the running version at the time of the crash (details in crbug.com/590757 ). The logic that cleans up orphaned crash reports is picking up these files and deleting them 24 hours after boot. Fix this by explicitly preserving the two cached files. BUG= chromium:768467 TEST=Manually touched files in /var/spool/crash and verified that crash_sender doesn't delete them. Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f Reviewed-on: https://chromium-review.googlesource.com/728299 Commit-Ready: Benjamin Gordon <bmgordon@chromium.org> Tested-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/20a23362c557f400e221de9b3b39200b3b546e4c/crash-reporter/crash_sender
,
Oct 19 2017
I think we want this in M-62 to help make our lives easier, and the change above is super safe.
,
Oct 19 2017
This bug requires manual review: Request affecting a post-stable build Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 20 2017
,
Oct 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/18692bcca06d6d8868e99770f79185a8e9617290 commit 18692bcca06d6d8868e99770f79185a8e9617290 Author: Benjamin Gordon <bmgordon@chromium.org> Date: Fri Oct 20 22:47:50 2017 crash: Preserve cached os version files /etc/os-release and /etc/lsb-release are cached in /var/spool/crash so that kernel crashes can be correctly matched up to the running version at the time of the crash (details in crbug.com/590757 ). The logic that cleans up orphaned crash reports is picking up these files and deleting them 24 hours after boot. Fix this by explicitly preserving the two cached files. BUG= chromium:768467 TEST=Manually touched files in /var/spool/crash and verified that crash_sender doesn't delete them. Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f Reviewed-on: https://chromium-review.googlesource.com/728299 Commit-Ready: Benjamin Gordon <bmgordon@chromium.org> Tested-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> (cherry picked from commit 20a23362c557f400e221de9b3b39200b3b546e4c) Reviewed-on: https://chromium-review.googlesource.com/730776 Reviewed-by: Benjamin Gordon <bmgordon@chromium.org> Commit-Queue: Benjamin Gordon <bmgordon@chromium.org> [modify] https://crrev.com/18692bcca06d6d8868e99770f79185a8e9617290/crash-reporter/crash_sender
,
Oct 22 2017
Merged to M63 (thanks!), so just waiting on M-62 decision.
,
Oct 23 2017
,
Oct 23 2017
Approved for 62.
,
Oct 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/790759393832017a20b7af663241bff6954f11ff commit 790759393832017a20b7af663241bff6954f11ff Author: Benjamin Gordon <bmgordon@chromium.org> Date: Mon Oct 23 17:26:15 2017 crash: Preserve cached os version files /etc/os-release and /etc/lsb-release are cached in /var/spool/crash so that kernel crashes can be correctly matched up to the running version at the time of the crash (details in crbug.com/590757 ). The logic that cleans up orphaned crash reports is picking up these files and deleting them 24 hours after boot. Fix this by explicitly preserving the two cached files. BUG= chromium:768467 TEST=Manually touched files in /var/spool/crash and verified that crash_sender doesn't delete them. Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f Reviewed-on: https://chromium-review.googlesource.com/728299 Commit-Ready: Benjamin Gordon <bmgordon@chromium.org> Tested-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> (cherry picked from commit 20a23362c557f400e221de9b3b39200b3b546e4c) Reviewed-on: https://chromium-review.googlesource.com/733880 Commit-Queue: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Benjamin Gordon <bmgordon@chromium.org> [modify] https://crrev.com/790759393832017a20b7af663241bff6954f11ff/crash-reporter/crash_sender
,
Oct 23 2017
Merge into M62 is done.
,
Oct 23 2017
Adding merge tag and branch tags so we know where it was fixed in quickly. :) Thanks!
,
Oct 23 2017
what about /var/lib/ follow up work ?
,
Oct 23 2017
@27: Good point. I'll file a new bug.
,
Oct 23 2017
Moved to bug #777490 |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by vapier@chromium.org
, Sep 26 2017