New issue
Advanced search Search tips

Issue 792562 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Dec 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

[Stumpy] Kernel crashes are not creating reports in /var/spool/crash directory.

Project Member Reported by sontis@chromium.org, Dec 6 2017

Issue description

Chrome Version: 10176.5.0 / 64.0.3282.11
OS: ChromeOS
What steps will reproduce the problem?
(1) Sign in to the device.
(2) Initiate kernel crash.
    sudo echo BUG > /sys/kernel/debug/provoke-crash/DIRECT
         ( or )
    Use keyboard "Alt + F10 + X + X + X" 


What is the expected result?
Device should reboot immediately and   /var/spool/crash directory should display kcrash dmp files.


What happens instead?
Device hanging for 30 seconds before reboot and Kcrash dmp files are missing.



 
Cc: vapier@chromium.org snanda@chromium.org
just to be clear, you don't mean literally a ".dmp" file right ?  kernel crashes don't produce those.  they produce ".kcrash" files which are copies of the kernel log that came from pstore.
yes, .Kcrash and .meta files are missing.
Summary: [Stumpy] Kernel crashes are not creating reports in /var/spool/crash directory. (was: [Stumpy] Kernel crashes are not creating dmp files in /var/spool/crash directory. )
Cc: semenzato@chromium.org
Owner: semenzato@chromium.org
Luigi, could you PTAL?

sontis: is console-ramoops file present under /sys/fs/pstore or /dev/pstore after the panic-reboot?
console-ramoops is NOT present after the panic-reboot.
No files in both  /sys/fs/pstore and /sys/pstore directories 
Sure, will take a look shortly.
Status: Started (was: Untriaged)
#7 sontis: please read #6 more carefully.  Sameer is asking whether /dev/pstore/console-ramoops was present, but your comment does not reference /dev/pstore.

Stumpy is on kernel 3.8, that's probably the location of console-ramoops.  (I'll now check sources to see if that change has been backported.)

Also, I'd like to ssh to this device, but from the logs it looks like eth0 never gets an IP address, although wlan0 does connect to GoogleGuest 100.110.200.6.  Can you put it on the test network so I can take a look?

Thanks!

@ Comment #10
Sorry for the typo, onsole-ramoops file is not present at /dev/pstore after the panic-reboot.

Currently I have recovery image installed on my stumpy device.(IP: 172.22.19.166). 




Mmm.  I got a stumpy and built and installed a ToT test image, and I cannot reproduce this problem.  I am seeing a normal /dev/pstore/console-ramoops as expected.  I may have to take another look at your system.

Cc: tbroch@chromium.org
I asked Todd to try this on his stumpy, where he has an R59 image.  He can reproduce this, there is no /dev/pstore/console-ramoops after testing as described in comment #0.

Comment 14 Deleted

More precisely, R59-9460.73.0 test image.
Still able to reproduce this issue on ToT recovery image (M65 10194.0.0)
/dev/pstore and /sys/fs/pstore directories are empty. 

Will try with ToT Test image.
Fails on R57-9202.64.0 as well.
Facing same issue on ToT Test image(M65 10194.0.0)
console-ramoops file is not present at /dev/pstore or /sys/fs/pstore after the panic-reboot.
I just ran a recovery using the R65-10193.0.0 recovery image.  Then I rebooted as per #0, and I have a proper /dev/pstore/console-ramoops.  So I cannot repro this at all.

Next, we're going to swap hardware.

So we swapped hardware, I now have tag 460700, and the problem reproduces with ToT test image, so this is likely a HW or FW problem.  The HW could be flaky or different.

crossystem gives me this info:

fwid                   = Google_Stumpy.2.102.0  (ro_fwid is the same)
hwid                   = STUMPY YELLOW A-C 7520

Another thing that's a little strange with this hardware is that when I crash the system, the shell stops echoing characters immediately, but the screen stays on for several seconds before the system reboots.
Not able to reproduce this issue on stumpy device with Ass. Tag C020673

fwid: Google_Stumpy.2.102.0(ro_fwid is the same)
hwid: STUMPY DOGFOOD GREEN A-A 8242

My dev has,
localhost ~ # crossystem | grep wid
fwid                   = Google_Stumpy.2.102.0          # Active firmware ID
hwid                   = STUMPY DOGFOOD GREEN A-A 8242  # Hardware ID
ro_fwid                = Google_Stumpy.2.102.0          # Read-only firmware ID
Another unusual symptom is that when I turn off the device with the power button, the screen goes into whiteout and then off, but the LED in the power button stays on.  Pushing the power button again does nothing.  A long press of the power button finally turns the system off.


Same (or very similar) behavior with R52 test image from July 20 2016.
Same behavior in R32 (February 2014).

Sridhar, would you be so kind to check that the working unit (the one currently in your hands) is showing the correct behavior with respect to #23?  I would be happier if the LED in the power button went off.

(In fact there are many other things that would make me happier, but let's not get into that.)
@ Comment #25
The working unit is showing correct behavior with power button.
LED in the Power button went off immediately after shutdown. 

Comment 27 Deleted

the device does not reboot properly, or a power down is forced, then it's conceivable that the memory was lost.  pstore only works across reboots, and any time the system powers off, the contents of RAM are going to degrade.
If you arrived here from CL 811984:

https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/811984/

chromeos-base/swap-init: make min_filelist default dynamic

that was a mistake---that CL should have pointed to (unrelated)  issue 792152 .  Sorry for the inconvenience.
I can get kernel output on a serial line by replacing the wifi NIC with a 2-port RS232 Mini PCI Express Serial Card with 16950 UART by StarTech.com.  The card has a different format from the wifi NIC, so I use a spoon from the cafeteria to hold it in place (see picture).

Now I can see the messages at shutdown, but there isn't much.  This is the output after executing "halt".  The LED in the power button stays on, and the power button no longer responds to regular presses.

[  747.704581] EXT4-fs (sda3): re-mounted. Opts: (null)
[  747.711715] init: crash-sender main process (1430) killed by TERM signal
[  747.711871] init: log-rotate main process (1456) killed by TERM signal
[  747.742889] tpm_tis tpm_tis: command 0xba (size 18) returned code 0x0
[  747.780821] init: chapsd main process (1186) killed by PIPE signal
[  747.790770] tpm_tis tpm_tis: command 0xba (size 18) returned code 0xc
[  747.819957] init: recover_duts main process (1852) killed by TERM signal
[  747.842565] init: cras main process (1540) terminated with status 143
[  747.862574] init: tlsdated main process (2113) killed by TERM signal
[  747.882370] init: anomaly-collector main process (2122) killed by TERM signal
[  747.904466] init: powerd main process (1088) killed by TERM signal
[  747.923703] init: cros-machine-id-regen-periodic main process (2169) killed by TERM signal
[  747.949311] init: activate_date main process (2225) killed by TERM signal
[  747.970296] init: netfilter-queue main process (2238) killed by TERM signal
[  748.004855] init: cryptohomed main process (1280) killed by TERM signal
[  748.039433] tpm_tis tpm_tis: command 0xba (size 18) returned code 0x0
[  748.147320] tpm_tis tpm_tis: command 0x98 (size 10) returned code 0x0
[  748.373103] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[  748.755365] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  748.774367] sd 0:0:0:0: [sda] Stopping disk
[  748.803455] ACPI: Preparing to enter system sleep state S5
[  748.820541] Disabling non-boot CPUs ...
[  748.838365] Power down.
[  748.846097] acpi_power_off called

To get this to work, there are some instructions here:

https://www.chromium.org/chromium-os/how-tos-and-troubleshooting/serial-debugging-howto

For the kernel command line, I used 

console=uart8250,mmio,0xe0801000 loglevel=9

The mmio address was taken from /var/log/messages (search for ttyS0).  Simply using /dev/ttyS0, as the instructions suggest, did not work.

On the workstation side, I used

cu -l /dev/ttyUSB0 -s 38400

It may be a good idea to up the speed on both sides because booting is pretty slow.

serial.jpg
6.8 MB View Download
Triggering a kernel panic with the command in #0 produces this output:

[  114.490866] lkdtm: Performing direct entry BUG
[  114.504549] ------------[ cut here ]------------
[  114.518707] kernel BUG at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-3_8-3.8.11-r651/work/chromeos-kernel-3_8-3.8.11/drivers/misc/lkdtm.c:375!
[  114.560855] invalid opcode: 0000 [#1] SMP 
[  114.573435] Modules linked in: cmac rfcomm uinput i2c_dev snd_hda_codec_hdmi snd_hda_codec_cirrus aesni_intel ath3k snd_hda_intel snd_hda_codec snd_hwdep btusb btrtl xts btbcm btintel aes_x86_64 lrw ablk_helper cryptd bluetooth memconsole r8169 zram(C) snd_pcm fuse zsmalloc(C) snd_page_alloc cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer cdc_ether usbnet
[  114.696709] CPU 2 
[  114.702331] Pid: 2587, comm: bash Tainted: G         C   3.8.11 #1
[  114.721827] RIP: 0010:[<ffffffff93f16af4>]  [<ffffffff93f16af4>] lkdtm_do_action+0x3f/0x3a6
[  114.747473] RSP: 0018:ffff88012a537e70  EFLAGS: 00010293
[  114.763759] RAX: 0000000000000000 RBX: ffff880121c56000 RCX: 0000000000003837
[  114.785644] RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000001
[  114.807529] RBP: ffff88012a537ed0 R08: 000000000000d5fa R09: 000000000000d5fa
[  114.829421] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004
[  114.851315] R13: ffff88012a537f48 R14: 0000000000000002 R15: 00007f814cb956c0
[  114.873208] FS:  00007f814cd1d700(0000) GS:ffff88014fb00000(0000) knlGS:0000000000000000
[  114.898034] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.915661] CR2: 0000052d950aa010 CR3: 000000011d751000 CR4: 00000000000407e0
[  114.937550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  114.959440] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  114.981338] Process bash (pid: 2587, threadinfo ffff88012a536000, task ffff880146a2d280)
[  115.006158] Stack:
[  115.012315]  ffffffff940d8f64 0000000000000010 ffff88012a537ee0 ffff88012a537e98
[  115.035067]  00000000b94cabb8 ffff88012a537eb8 ffffffff9437a4ab ffff88014fb14c30
[  115.057815]  ffffea00048715a0 00000000b94cabb8 ffff880121c56000 0000000000000004
[  115.080567] Call Trace:
[  115.088071]  [<ffffffff940d8f64>] ? printk+0x5e/0x7a
[  115.103304]  [<ffffffff93f16f26>] direct_entry+0xcb/0xdb
[  115.119614]  [<ffffffff93cf64d6>] vfs_write+0xa6/0xca
[  115.135117]  [<ffffffff93cf65f0>] sys_write+0x5c/0x9a
[  115.150622]  [<ffffffff940dde42>] system_call_fastpath+0x16/0x1b
[  115.169045] Code: 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 83 ff 15 0f 87 5f 03 00 00 ff 24 fd 00 2e 26 94 48 c7 c7 f3 a0 37 94 31 c0 e8 e4 21 1c 00 <0f> 0b be 7a 01 00 00 48 c7 c7 fc a0 37 94 e8 a2 c7 d1 ff e9 32 
[  115.227868] RIP  [<ffffffff93f16af4>] lkdtm_do_action+0x3f/0x3a6
[  115.246318]  RSP <ffff88012a537e70>
[  115.257065] ---[ end trace 33a6ebe90ab195b6 ]---
[  115.280260] Kernel panic - not syncing: Fatal exception
[  115.296286] Kernel Offset: 0x12c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  115.338318] ACPI MEMORY or I/O RESET_REG.

so this can explain the different behavior.  The "reboot" command results in similar output:

[  167.434175] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[  167.811320] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  167.831523] Disabling non-boot CPUs ...
[  167.850023] Restarting system.
[  167.859377] reboot: machine restart
[  167.870293] ACPI MEMORY or I/O RESET_REG.


Todd and Sridhar, if it's not too much trouble, could you try this on your stumpies:

flashrom -r /tmp/x
sum /tmp/x

I would like to make sure we're running the same firmware.  Thanks!

I get this:


localhost ~ # flashrom -r /tmp/x   
flashrom v0.9.9  : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64)
flashrom v0.9.9  : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0xaceddc00.
Block protection could not be disabled!
Reading flash... SUCCESS
localhost ~ # sum /tmp/x
32538  8192

Google Chrome	65.0.3287.0 (Official Build) unknown (64-bit)
Revision	0
Platform	10194.0.0 (Official Build) dev-channel stumpy test
Firmware Version	Google_Stumpy.2.102.0
Customization ID	STUMPY




localhost / # flashrom -r /tmp/x
flashrom v0.9.9  : 831c609 : Nov 30 2017 02:04:47 UTC on Linux 3.8.11 (x86_64)
flashrom v0.9.9  : 831c609 : Nov 30 2017 02:04:47 UTC on Linux 3.8.11 (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7c6ddc00.
Reading flash... SUCCESS
localhost / # sum /tmp/x
07049  8192
doesn't that flashrom command dump the entire spi flash and thus will contain device-specific things like VPD ?  you'd have to use cbfstool to extract/check specific regions.
#34 Thank you Mike.

But why would the coreboot table be at different addresses?  Oh oh look.  And the device in #33 had the write-protect removed... so maybe someone reflashed it with firmware THAT WORKS...

So maybe I got lucky.

if disabling write protect and updating the firmware to current versions makes this issue go away, that's acceptable imo.  pstore cannot work correctly w/out functioning firmware support.
#35 no actually I wasn't lucky.  The coreboot table is a table of function pointers in RAM.  Its location depends on the amount of physical memory.

After removing one DIMM (I had two), I get the same location as #33:

localhost ~ # flashrom -r /tmp/x
flashrom v0.9.9  : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64)
flashrom v0.9.9  : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7c6ddc00.
Block protection could not be disabled!
Reading flash... SUCCESS

The shutdown behavior remains the same (i.e. faulty).

It may be wiser to pass this on to the firmware team.

OK one more try.  Sridhar would you be so kind to run flashrom -r /tmp/x again, then upload x to Drive and post the link?  The FW team will help me compare the images.

Thanks!
Mine is here:

https://drive.google.com/file/d/1oKcVP1DHyITHEsBMgyClPp5PuhjZSvlT

contains tarball which has
stumpy_tbroch_fw/image.bin

result of flashrom -r image.bin

sumpy_tbroch_fw/<REGION>

result of flashrom -x

I believe you can also generate those by providing the binary (image.bin) on the flashrom command line (--image=<file> perhaps?

With Stefan's help I compared the various sections of the dumps.  The firmware in all three devices is identical.

Next, let's compare the memory modules.
We swapped memory modules between my stumpy (bad) and Sridhar's (good).  The bad stumpy stayed bad, and the good one stayed good.  So it's not the memory.

Just in case: my stumpy (bad) has 2x2GB Samsung modules (1Rx8 PC3 - 10600S - 09 - 10 - ZZZ).  I also tried running with a single module (nothing changed).  Sridhar's (good) has 2x1GB Samsung (I didn't write down the exact model since it shouldn't matter).
Sadly my stumpy stopped working over the week-end :/
I am searching cautotest/ for generic_RebootTest, then selecting the stumpy jobs (pretty slow) and looking at the reboot times.  It looks like a couple of the DUTs require a longer time, similarly to what we see in the bad DUTs in our hands.

Further testing along this line would give us an approximate ratio of DUTs with faulty hardware.

4-2-8-15   163500220 release 49:54 -> 50.17 == 23s
4-2-9-4    163499241 paladin 31:21 -> 31.44 == 23s
4-2-9-1    163464395 paladin 14:57 -> 15:28 == 31s  !!!
4-2-8-15   163429387 release 21:13 -> 21:35 == 22s
4-2-9-15   163409297 paladin 00:49 -> 01:12 == 23s
4-2-8-13   163395594 paladin 52:53 -> 53:11 == 18s
4-2-9-3    163360741 release 06:51 -> 07:09 == 17s
4-2-9-8    163342853 paladin 40:01 -> 40:32 == 31s  !!!
4-9-2-11   163325729 release 48:25 -> 48:45 == 20s
4-9-2-11   163311227 release 09:46 -> 10:06 == 20s
4-2-9-7    163309991 paladin 24:07 -> 24:29 == 22s
4-2-9-7    163272444 paladin 20:01 -> 20:23 == 22s
4-2-9-1    163263933 paladin 37:40 -> 38:11 == 31s  !!! repeat
4-2-8-13   163239679 release 39:23 -> 39.41 == 18s
4-2-8-13   163224869 paladin 39:01 -> 39.23 == 22s
4-2-9-4    163184345 paladin 09:23 -> 09:45 == 22s
4-2-9-6    163176454 release 58:35 -> 58:57 == 22s
4-2-9-5    163145392 release 12:10 -> 12:32 == 22s
4-2-9-3    163089908 release 00:49 -> 01:07 == 18s
4-2-9-5    163069960 paladin 42:46 -> 43:09 == 23s
FWIW here's SPD related params for my 'bad' stumpy which you're welcome to borrow,

mosys memory spd print all
0 | DDR3 | SO-DIMM
1 | DDR3 | SO-DIMM
0 | 1-78: Samsung | 794aaf4c | M471B2873FHS-CH9
1 | 1-78: Samsung | 794aaf4d | M471B2873FHS-CH9
0 | 1024 | 1 | 64
1 | 1024 | 1 | 64
0 | DDR3-800, DDR3-1066, DDR3-1333
1 | DDR3-800, DDR3-1066, DDR3-1333


I also have a 2nd stumpy now in working order.  It appears to be 'good' (reboots fast).  Its SPD is,

mosys memory spd print all
0 | DDR3 | SO-DIMM
1 | DDR3 | SO-DIMM
0 | 1-78: Samsung | 613ea982 | M471B5773CHS-CH9
1 | 1-78: Samsung | 613ea709 | M471B5773CHS-CH9
0 | 2048 | 1 | 64
1 | 2048 | 1 | 64
0 | DDR3-800, DDR3-1066, DDR3-1333
1 | DDR3-800, DDR3-1066, DDR3-1333

I'll send you IP addresses should remote access prove useful for your testing.

Consequently I think this bug and  crbug.com/782329  may be dupe's as the 2nd stumpy resumes successfully as well.
Ping on this one since open for a while and a stable blocker.  Seems we should work to get this in..
Labels: -ReleaseBlock-Stable
Remove RBS since this issues appears to be HW related.
Status: Archived (was: Started)
stumpy shall be missed

Sign in to add a comment