New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 768467 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

crash_sender cleaning up saved OS version info

Project Member Reported by diand...@chromium.org, Sep 25 2017

Issue description

I was digging through kernel crash reports for Kevin and found a crash report 8f1e8b7c5a9c1325.

That contains:

<3>[ 7060.531837] mali ff9a0000.gpu: Issueing GPU soft-reset because jobs failed to be killed (within 1000 ms) as part of context termination (e.g. process exit)
<3>[ 7060.531854] mali ff9a0000.gpu: Preparing to soft-reset GPU: Waiting (upto 3000 ms) for all jobs to complete soft-stop
<3>[ 7060.531964] mali ff9a0000.gpu: Resetting GPU (allowing up to 500 ms)
<3>[ 7060.531972] mali ff9a0000.gpu: Register state:
<3>[ 7060.531983] mali ff9a0000.gpu:   GPU_IRQ_RAWSTAT=0x00000200 GPU_STATUS=0x00000000
<3>[ 7060.531993] mali ff9a0000.gpu:   JOB_IRQ_RAWSTAT=0x00000000 JOB_IRQ_JS_STATE=0x00000000 JOB_IRQ_THROTTLE=0x00000000
<3>[ 7060.532002] mali ff9a0000.gpu:   JS0_STATUS=0x00000000      JS0_HEAD_LO=0xe47f9f00
<3>[ 7060.532010] mali ff9a0000.gpu:   JS1_STATUS=0x00000000      JS1_HEAD_LO=0xe47f9d40
<3>[ 7060.532019] mali ff9a0000.gpu:   JS2_STATUS=0x00000000      JS2_HEAD_LO=0x00000000
<3>[ 7060.532027] mali ff9a0000.gpu:   MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000
<3>[ 7060.532035] mali ff9a0000.gpu:   GPU_IRQ_MASK=0x00000000    JOB_IRQ_MASK=0x00000000     MMU_IRQ_MASK=0x00000000
<3>[ 7060.532043] mali ff9a0000.gpu:   PWR_OVERRIDE0=0x00000000   PWR_OVERRIDE1=0x00000000
<3>[ 7060.532051] mali ff9a0000.gpu:   SHADER_CONFIG=0x00010000   L2_MMU_CONFIG=0x00000000
<3>[ 7060.532117] mali ff9a0000.gpu: Reset complete

Those were supposed to go away with <https://chromium-review.googlesource.com/655538> (CHROMIUM: MALI: fix async wait for fence).

---

At first I was a bit worried that we still hand't fixed this, but then I looked more carefully.  I began to wonder if perhaps we were doing our old favorite thing and reporting the crash with the wrong version.  AKA: maybe the fix for  bug #590757  is somehow not working properly?

Here's my evidence.  When I look at 8f1e8b7c5a9c1325 and look at stack crawls, I see ones that look like:

<0>[ 7200.115883] Call trace:
<4>[ 7200.115902] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8
<4>[ 7200.115916] [<ffffffc00093ace8>] __schedule+0x504/0x740
<4>[ 7200.115927] [<ffffffc00093afac>] schedule+0x88/0xa8
<4>[ 7200.115941] [<ffffffc0005eb7c0>] kbase_jm_wait_for_zero_jobs+0xd8/0x228
<4>[ 7200.115955] [<ffffffc0005d19ac>] kbase_jd_zap_context+0xe0/0xf8
<4>[ 7200.115967] [<ffffffc0005d63cc>] kbase_destroy_context+0x48/0x19c
<4>[ 7200.115979] [<ffffffc0005e050c>] kbase_release+0x134/0x16c
<4>[ 7200.115990] [<ffffffc00035ce28>] __fput+0x104/0x1cc
<4>[ 7200.115999] [<ffffffc00035cf64>] ____fput+0x20/0x2c
<4>[ 7200.116009] [<ffffffc00023d4cc>] task_work_run+0xa4/0xd4
<4>[ 7200.116022] [<ffffffc0002224a8>] do_exit+0x494/0x8f8
<4>[ 7200.116033] [<ffffffc000223a48>] do_group_exit+0x50/0xb0
<4>[ 7200.116045] [<ffffffc00022f430>] get_signal+0x50c/0x548
<4>[ 7200.116056] [<ffffffc000207c5c>] do_signal+0xb0/0x514
<4>[ 7200.116067] [<ffffffc0002082b4>] do_notify_resume+0x28/0x6c
<4>[ 7200.116077] [<ffffffc000203d28>] work_pending+0x1c/0x20


Now if I look for other crashes that are supposed to have the same version number, like fd37062ecb8358ef:

<0>[11160.151002] Call trace:
<4>[11160.151017] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8
<4>[11160.151028] [<ffffffc00093ae28>] __schedule+0x504/0x740
<4>[11160.151035] [<ffffffc00093b0ec>] schedule+0x88/0xa8
<4>[11160.151043] [<ffffffc00093ed40>] schedule_timeout+0x44/0x27c
<4>[11160.151051] [<ffffffc00093a8e4>] io_schedule_timeout+0x7c/0xbc
<4>[11160.151058] [<ffffffc00093ba44>] bit_wait_io+0x1c/0x6c
<4>[11160.151066] [<ffffffc00093b8b8>] __wait_on_bit_lock+0x70/0xcc
<4>[11160.151074] [<ffffffc000303134>] __lock_page+0x7c/0x9c
<4>[11160.151082] [<ffffffc00030db50>] write_cache_pages+0x1a4/0x3e0
<4>[11160.151089] [<ffffffc00030dde8>] generic_writepages+0x5c/0x8c
<4>[11160.151099] [<ffffffc0004260e4>] jbd2_journal_commit_transaction+0x6a4/0x183c
<4>[11160.151108] [<ffffffc00042c65c>] kjournald2+0xe4/0x278
<4>[11160.151115] [<ffffffc00023f31c>] kthread+0x124/0x12c
<4>[11160.151123] [<ffffffc000203dd0>] ret_from_fork+0x10/0x40


Notice that the address of "__schedule+0x504/0x740" is different.  In the first it was ffffffc00093ace8 and in the second ffffffc00093ae28.

Even more telling is this difference:

<3>[ 7200.115845]       Not tainted 4.4.70-11339-gb1947ccd8d12 #1

vs. 

<3>[11160.150977]       Not tainted 4.4.70-11357-gcb8919185fe3 #1


To me that means that these are different builds and the crashing device had a kernel from 9/1 (based on the git log of hash b1947ccd8d12).

---

The first explanation that would come to mind would be that perhaps the old version was "too old" and didn't log the kernel version.  ...but that falls apart because the Mali crash was _introduced_ in R61 and the fix for  bug #590757  was supposed to be in R60 sometime...

---

So I guess the question is: what's happening here?  Why are we not getting the right version?
 

Comment 1 by vapier@chromium.org, Sep 26 2017

link to the crash report ?
From above, problematic crash report was: 8f1e8b7c5a9c1325

...so link to crash report is: http://crash.corp/8f1e8b7c5a9c1325

From above, crash report that was on the same version but _did_ seem to have the right kernel version was: fd37062ecb8358ef

...so link to crash report is: http://crash.corp/fd37062ecb8358ef

Comment 3 by vapier@chromium.org, Oct 16 2017

the CL that should have fixed this landed here:
  https://chromium-review.googlesource.com/473608
which should be part of R60+.

the way the fix works is that it needs to boot the system, collect the data, and only crashes that happened after that point would have the right version.  if a crash occurred before the upgrade (or was otherwise queued), and then the update happened, it's possible for the wrong version to be attached to any reports not yet uploaded.

so there might be a bug, or we might be seeing a long tail of upgrades & unuploaded crashes.
> so there might be a bug, or we might be seeing a long tail of upgrades & unuploaded crashes.

Right, except (I think) that's impossible given the data above.  Specifically, the two git hashes mentioned above:

<3>[ 7200.115845]       Not tainted 4.4.70-11339-gb1947ccd8d12 #1

vs.

<3>[11160.150977]       Not tainted 4.4.70-11357-gcb8919185fe3 #1

---

commit b1947ccd8d12b477fda9de3330ac5da0d2d0274b
Author:     Rodrigo Vivi <rodrigo.vivi@intel.com>
AuthorDate: Wed Aug 9 15:50:21 2017 -0700
Commit:     ChromeOS Commit Bot <chromeos-commit-bot@chromium.org>
CommitDate: Fri Sep 1 23:49:45 2017 +0000

---

commit cb8919185fe3e996ad1cc480b0c74fe57b046813
Author:     oder_chiou@realtek.com <oder_chiou@realtek.com>
AuthorDate: Wed Aug 30 13:08:31 2017 +0800
Commit:     ChromeOS Commit Bot <chromeos-commit-bot@chromium.org>
CommitDate: Wed Sep 20 05:51:31 2017 +0000

---

The only way that the device could have reported a git hash of a kernel commit that landed on Sep 1 is if it had a new build.  ...and that means it should have had the fix.

---

That's why I filed this bug and believe that we might have a bug.

Comment 5 by vapier@chromium.org, Oct 16 2017

gotcha (and you already explained that in your initial comment)

we have the logging_KernelCrash autotest, but i don't think that's part of any suite, and i don't think it was updated to check for this.  so i guess we should start there ? :)
As further examples:

---

http://crash.corp/fed8797e6a924d93

Claims to be:                          9901.46.0
...date:                               2017-10-12 21:11
...commit of top kernel in that build: cedb2868
...date of top kernel commit:          Wed Oct 11 19:16:05 2017 +0000


String from the console-ramoops:       Not tainted 4.4.79-11626-g960c8bae43ca #1
...commit from that string:            960c8bae43ca
...date of that commit:                Tue Sep 26 21:07:47 2017 +0000

(consistent with the fact that the crashing system must have been running the previous pushed build, 9901.35.1)

---

Updating the autotest to check for this seems sane.  Ben: do you want to do that, or should I?



I can take a look at the autotest later this week.

One possibility I've thought of: How early in boot are these crashes happening?  The current running version isn't written to /var/spool/crash until the boot process gets far enough to start crash_reporter, so a very early crash on a new kernel could still potentially pick up the old version.

Comment 8 by vapier@chromium.org, Oct 16 2017

the first number in the kernel log tells you how many seconds since boot:
  <0>[ 7200.115883] Call trace:
so that's ~2 hours.

if it took us ~2 hours to get to the crash-reporter init i think we've got a bigger problem :D.  also i think the fail safe logic would have kicked in by then.

i *think* these are crashes being collected via pstore (i.e. they trigger a panic/reboot and we collect them on the next boot).  the autotest in question should be able to reproduce this scenario.
@7: Thanks!

The crashes happen very late.  Also if we were running on the old version and rebooted to the new version, we definitely had enough time to fully download and apply an auto update.
Still working on the autotests, but here's a new possibility: this code removes files that don't have a .meta once they're 24 hours old:

http://cs/chromeos_public/src/platform2/crash-reporter/crash_sender?l=552&rcl=d3af8130a36aa1fd6af2d8b880eb6f309617905b

Since os-release and lsb-release are copied at boot, I'm not sure how they'd be 24 hours old by the time that first crash happened at 7200s, but I'll keep digging.

7200s is the kernel uptime, not the wall clock time.  If the device was suspended then kernel uptime doesn't tick, but wall clock time does...
none of those crashes include logs, so can't check if "Removing old orphaned file" exists there.  but seems likely that string would show up in other reports.

is there a way to search feedback reports or archives in crash reports ?  or do we just have to download a ton of files and write a script to unpack & grep ? :(
> is there a way to search feedback reports or archives in crash 
> reports ?  or do we just have to download a ton of files and 
> write a script to unpack & grep ? :(

Someone did that, but I'm drawing a blank for who.  I'll ask around at the office today.

...but easier is for me to just open file:///var/log/messages.* on the kevin I'm typing on right now.  On file:///var/log/messages.2 I found:

2017-10-17T15:03:10.584663-07:00 NOTICE crash_sender[25627]: Removing old orphaned file: /var/spool/crash/lsb-release.
2017-10-17T15:03:10.596982-07:00 NOTICE crash_sender[25632]: Removing old orphaned file: /var/spool/crash/os-release.

...so yeah, seems like the problem.
Status: Started (was: Untriaged)
Summary: crash_sender cleaning up saved OS version info (was: Possibly still getting wrong version in kernel crash reporting?)
crash_sender is so helpful!  gold star!
Project Member

Comment 16 by bugdroid1@chromium.org, Oct 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/20a23362c557f400e221de9b3b39200b3b546e4c

commit 20a23362c557f400e221de9b3b39200b3b546e4c
Author: Benjamin Gordon <bmgordon@chromium.org>
Date: Thu Oct 19 19:56:50 2017

crash: Preserve cached os version files

/etc/os-release and /etc/lsb-release are cached in /var/spool/crash so
that kernel crashes can be correctly matched up to the running version
at the time of the crash (details in  crbug.com/590757 ).

The logic that cleans up orphaned crash reports is picking up these
files and deleting them 24 hours after boot.  Fix this by explicitly
preserving the two cached files.

BUG= chromium:768467 
TEST=Manually touched files in /var/spool/crash and verified that
     crash_sender doesn't delete them.

Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f
Reviewed-on: https://chromium-review.googlesource.com/728299
Commit-Ready: Benjamin Gordon <bmgordon@chromium.org>
Tested-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/20a23362c557f400e221de9b3b39200b3b546e4c/crash-reporter/crash_sender

Cc: bhthompson@chromium.org diand...@chromium.org
Labels: Merge-Request-63 Merge-Request-62
I think we want this in M-62 to help make our lives easier, and the change above is super safe.  
Project Member

Comment 18 by sheriffbot@chromium.org, Oct 19 2017

Labels: -Merge-Request-62 Merge-Review-62 Hotlist-Merge-Review
This bug requires manual review: Request affecting a post-stable build
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Request-63 Merge-Approved-63
Project Member

Comment 20 by bugdroid1@chromium.org, Oct 20 2017

Labels: merge-merged-release-R63-10032.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/18692bcca06d6d8868e99770f79185a8e9617290

commit 18692bcca06d6d8868e99770f79185a8e9617290
Author: Benjamin Gordon <bmgordon@chromium.org>
Date: Fri Oct 20 22:47:50 2017

crash: Preserve cached os version files

/etc/os-release and /etc/lsb-release are cached in /var/spool/crash so
that kernel crashes can be correctly matched up to the running version
at the time of the crash (details in  crbug.com/590757 ).

The logic that cleans up orphaned crash reports is picking up these
files and deleting them 24 hours after boot.  Fix this by explicitly
preserving the two cached files.

BUG= chromium:768467 
TEST=Manually touched files in /var/spool/crash and verified that
     crash_sender doesn't delete them.

Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f
Reviewed-on: https://chromium-review.googlesource.com/728299
Commit-Ready: Benjamin Gordon <bmgordon@chromium.org>
Tested-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>
(cherry picked from commit 20a23362c557f400e221de9b3b39200b3b546e4c)
Reviewed-on: https://chromium-review.googlesource.com/730776
Reviewed-by: Benjamin Gordon <bmgordon@chromium.org>
Commit-Queue: Benjamin Gordon <bmgordon@chromium.org>

[modify] https://crrev.com/18692bcca06d6d8868e99770f79185a8e9617290/crash-reporter/crash_sender

Labels: -Merge-Approved-63
Merged to M63 (thanks!), so just waiting on M-62 decision.
Cc: -bhthompson@chromium.org bmgordon@chromium.org
Owner: bhthompson@chromium.org
Status: Assigned (was: Started)
Labels: -Hotlist-Merge-Review -Merge-Review-62 Merge-Approved-62
Approved for 62. 
Project Member

Comment 24 by bugdroid1@chromium.org, Oct 23 2017

Labels: merge-merged-release-R62-9901.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/790759393832017a20b7af663241bff6954f11ff

commit 790759393832017a20b7af663241bff6954f11ff
Author: Benjamin Gordon <bmgordon@chromium.org>
Date: Mon Oct 23 17:26:15 2017

crash: Preserve cached os version files

/etc/os-release and /etc/lsb-release are cached in /var/spool/crash so
that kernel crashes can be correctly matched up to the running version
at the time of the crash (details in  crbug.com/590757 ).

The logic that cleans up orphaned crash reports is picking up these
files and deleting them 24 hours after boot.  Fix this by explicitly
preserving the two cached files.

BUG= chromium:768467 
TEST=Manually touched files in /var/spool/crash and verified that
     crash_sender doesn't delete them.

Change-Id: I63bebaa56632e17756ef777b51e58188ccc88e4f
Reviewed-on: https://chromium-review.googlesource.com/728299
Commit-Ready: Benjamin Gordon <bmgordon@chromium.org>
Tested-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>
(cherry picked from commit 20a23362c557f400e221de9b3b39200b3b546e4c)
Reviewed-on: https://chromium-review.googlesource.com/733880
Commit-Queue: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Benjamin Gordon <bmgordon@chromium.org>

[modify] https://crrev.com/790759393832017a20b7af663241bff6954f11ff/crash-reporter/crash_sender

Labels: -Merge-Approved-62
Status: Fixed (was: Assigned)
Merge into M62 is done.
Labels: M-62 M-63 Merge-Merged
Adding merge tag and branch tags so we know where it was fixed in quickly.  :)  Thanks!
what about /var/lib/ follow up work ?
@27: Good point.  I'll file a new bug.
Cc: -bmgordon@chromium.org bhthompson@chromium.org
Owner: bmgordon@chromium.org
Moved to  bug #777490 

Sign in to add a comment