[Stumpy] Kernel crashes are not creating reports in /var/spool/crash directory. |
|||||||
Issue description
Chrome Version: 10176.5.0 / 64.0.3282.11
OS: ChromeOS
What steps will reproduce the problem?
(1) Sign in to the device.
(2) Initiate kernel crash.
sudo echo BUG > /sys/kernel/debug/provoke-crash/DIRECT
( or )
Use keyboard "Alt + F10 + X + X + X"
What is the expected result?
Device should reboot immediately and /var/spool/crash directory should display kcrash dmp files.
What happens instead?
Device hanging for 30 seconds before reboot and Kcrash dmp files are missing.
,
Dec 6 2017
,
Dec 6 2017
just to be clear, you don't mean literally a ".dmp" file right ? kernel crashes don't produce those. they produce ".kcrash" files which are copies of the kernel log that came from pstore.
,
Dec 6 2017
yes, .Kcrash and .meta files are missing.
,
Dec 6 2017
,
Dec 6 2017
Luigi, could you PTAL? sontis: is console-ramoops file present under /sys/fs/pstore or /dev/pstore after the panic-reboot?
,
Dec 6 2017
console-ramoops is NOT present after the panic-reboot. No files in both /sys/fs/pstore and /sys/pstore directories
,
Dec 6 2017
Sure, will take a look shortly.
,
Dec 6 2017
,
Dec 7 2017
#7 sontis: please read #6 more carefully. Sameer is asking whether /dev/pstore/console-ramoops was present, but your comment does not reference /dev/pstore. Stumpy is on kernel 3.8, that's probably the location of console-ramoops. (I'll now check sources to see if that change has been backported.) Also, I'd like to ssh to this device, but from the logs it looks like eth0 never gets an IP address, although wlan0 does connect to GoogleGuest 100.110.200.6. Can you put it on the test network so I can take a look? Thanks!
,
Dec 7 2017
@ Comment #10 Sorry for the typo, onsole-ramoops file is not present at /dev/pstore after the panic-reboot. Currently I have recovery image installed on my stumpy device.(IP: 172.22.19.166).
,
Dec 7 2017
Mmm. I got a stumpy and built and installed a ToT test image, and I cannot reproduce this problem. I am seeing a normal /dev/pstore/console-ramoops as expected. I may have to take another look at your system.
,
Dec 8 2017
I asked Todd to try this on his stumpy, where he has an R59 image. He can reproduce this, there is no /dev/pstore/console-ramoops after testing as described in comment #0.
,
Dec 8 2017
More precisely, R59-9460.73.0 test image.
,
Dec 8 2017
Still able to reproduce this issue on ToT recovery image (M65 10194.0.0) /dev/pstore and /sys/fs/pstore directories are empty. Will try with ToT Test image.
,
Dec 8 2017
Fails on R57-9202.64.0 as well.
,
Dec 8 2017
Facing same issue on ToT Test image(M65 10194.0.0) console-ramoops file is not present at /dev/pstore or /sys/fs/pstore after the panic-reboot.
,
Dec 8 2017
I just ran a recovery using the R65-10193.0.0 recovery image. Then I rebooted as per #0, and I have a proper /dev/pstore/console-ramoops. So I cannot repro this at all. Next, we're going to swap hardware.
,
Dec 8 2017
So we swapped hardware, I now have tag 460700, and the problem reproduces with ToT test image, so this is likely a HW or FW problem. The HW could be flaky or different. crossystem gives me this info: fwid = Google_Stumpy.2.102.0 (ro_fwid is the same) hwid = STUMPY YELLOW A-C 7520 Another thing that's a little strange with this hardware is that when I crash the system, the shell stops echoing characters immediately, but the screen stays on for several seconds before the system reboots.
,
Dec 8 2017
Not able to reproduce this issue on stumpy device with Ass. Tag C020673 fwid: Google_Stumpy.2.102.0(ro_fwid is the same) hwid: STUMPY DOGFOOD GREEN A-A 8242
,
Dec 8 2017
My dev has, localhost ~ # crossystem | grep wid fwid = Google_Stumpy.2.102.0 # Active firmware ID hwid = STUMPY DOGFOOD GREEN A-A 8242 # Hardware ID ro_fwid = Google_Stumpy.2.102.0 # Read-only firmware ID
,
Dec 8 2017
Another unusual symptom is that when I turn off the device with the power button, the screen goes into whiteout and then off, but the LED in the power button stays on. Pushing the power button again does nothing. A long press of the power button finally turns the system off.
,
Dec 8 2017
Same (or very similar) behavior with R52 test image from July 20 2016.
,
Dec 8 2017
Same behavior in R32 (February 2014). Sridhar, would you be so kind to check that the working unit (the one currently in your hands) is showing the correct behavior with respect to #23? I would be happier if the LED in the power button went off. (In fact there are many other things that would make me happier, but let's not get into that.)
,
Dec 9 2017
@ Comment #25 The working unit is showing correct behavior with power button. LED in the Power button went off immediately after shutdown.
,
Dec 9 2017
the device does not reboot properly, or a power down is forced, then it's conceivable that the memory was lost. pstore only works across reboots, and any time the system powers off, the contents of RAM are going to degrade.
,
Dec 11 2017
If you arrived here from CL 811984: https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/811984/ chromeos-base/swap-init: make min_filelist default dynamic that was a mistake---that CL should have pointed to (unrelated) issue 792152 . Sorry for the inconvenience.
,
Dec 13 2017
I can get kernel output on a serial line by replacing the wifi NIC with a 2-port RS232 Mini PCI Express Serial Card with 16950 UART by StarTech.com. The card has a different format from the wifi NIC, so I use a spoon from the cafeteria to hold it in place (see picture). Now I can see the messages at shutdown, but there isn't much. This is the output after executing "halt". The LED in the power button stays on, and the power button no longer responds to regular presses. [ 747.704581] EXT4-fs (sda3): re-mounted. Opts: (null) [ 747.711715] init: crash-sender main process (1430) killed by TERM signal [ 747.711871] init: log-rotate main process (1456) killed by TERM signal [ 747.742889] tpm_tis tpm_tis: command 0xba (size 18) returned code 0x0 [ 747.780821] init: chapsd main process (1186) killed by PIPE signal [ 747.790770] tpm_tis tpm_tis: command 0xba (size 18) returned code 0xc [ 747.819957] init: recover_duts main process (1852) killed by TERM signal [ 747.842565] init: cras main process (1540) terminated with status 143 [ 747.862574] init: tlsdated main process (2113) killed by TERM signal [ 747.882370] init: anomaly-collector main process (2122) killed by TERM signal [ 747.904466] init: powerd main process (1088) killed by TERM signal [ 747.923703] init: cros-machine-id-regen-periodic main process (2169) killed by TERM signal [ 747.949311] init: activate_date main process (2225) killed by TERM signal [ 747.970296] init: netfilter-queue main process (2238) killed by TERM signal [ 748.004855] init: cryptohomed main process (1280) killed by TERM signal [ 748.039433] tpm_tis tpm_tis: command 0xba (size 18) returned code 0x0 [ 748.147320] tpm_tis tpm_tis: command 0x98 (size 10) returned code 0x0 [ 748.373103] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required. [ 748.755365] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 748.774367] sd 0:0:0:0: [sda] Stopping disk [ 748.803455] ACPI: Preparing to enter system sleep state S5 [ 748.820541] Disabling non-boot CPUs ... [ 748.838365] Power down. [ 748.846097] acpi_power_off called To get this to work, there are some instructions here: https://www.chromium.org/chromium-os/how-tos-and-troubleshooting/serial-debugging-howto For the kernel command line, I used console=uart8250,mmio,0xe0801000 loglevel=9 The mmio address was taken from /var/log/messages (search for ttyS0). Simply using /dev/ttyS0, as the instructions suggest, did not work. On the workstation side, I used cu -l /dev/ttyUSB0 -s 38400 It may be a good idea to up the speed on both sides because booting is pretty slow.
,
Dec 13 2017
Triggering a kernel panic with the command in #0 produces this output: [ 114.490866] lkdtm: Performing direct entry BUG [ 114.504549] ------------[ cut here ]------------ [ 114.518707] kernel BUG at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-3_8-3.8.11-r651/work/chromeos-kernel-3_8-3.8.11/drivers/misc/lkdtm.c:375! [ 114.560855] invalid opcode: 0000 [#1] SMP [ 114.573435] Modules linked in: cmac rfcomm uinput i2c_dev snd_hda_codec_hdmi snd_hda_codec_cirrus aesni_intel ath3k snd_hda_intel snd_hda_codec snd_hwdep btusb btrtl xts btbcm btintel aes_x86_64 lrw ablk_helper cryptd bluetooth memconsole r8169 zram(C) snd_pcm fuse zsmalloc(C) snd_page_alloc cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer cdc_ether usbnet [ 114.696709] CPU 2 [ 114.702331] Pid: 2587, comm: bash Tainted: G C 3.8.11 #1 [ 114.721827] RIP: 0010:[<ffffffff93f16af4>] [<ffffffff93f16af4>] lkdtm_do_action+0x3f/0x3a6 [ 114.747473] RSP: 0018:ffff88012a537e70 EFLAGS: 00010293 [ 114.763759] RAX: 0000000000000000 RBX: ffff880121c56000 RCX: 0000000000003837 [ 114.785644] RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000001 [ 114.807529] RBP: ffff88012a537ed0 R08: 000000000000d5fa R09: 000000000000d5fa [ 114.829421] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004 [ 114.851315] R13: ffff88012a537f48 R14: 0000000000000002 R15: 00007f814cb956c0 [ 114.873208] FS: 00007f814cd1d700(0000) GS:ffff88014fb00000(0000) knlGS:0000000000000000 [ 114.898034] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 114.915661] CR2: 0000052d950aa010 CR3: 000000011d751000 CR4: 00000000000407e0 [ 114.937550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 114.959440] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 114.981338] Process bash (pid: 2587, threadinfo ffff88012a536000, task ffff880146a2d280) [ 115.006158] Stack: [ 115.012315] ffffffff940d8f64 0000000000000010 ffff88012a537ee0 ffff88012a537e98 [ 115.035067] 00000000b94cabb8 ffff88012a537eb8 ffffffff9437a4ab ffff88014fb14c30 [ 115.057815] ffffea00048715a0 00000000b94cabb8 ffff880121c56000 0000000000000004 [ 115.080567] Call Trace: [ 115.088071] [<ffffffff940d8f64>] ? printk+0x5e/0x7a [ 115.103304] [<ffffffff93f16f26>] direct_entry+0xcb/0xdb [ 115.119614] [<ffffffff93cf64d6>] vfs_write+0xa6/0xca [ 115.135117] [<ffffffff93cf65f0>] sys_write+0x5c/0x9a [ 115.150622] [<ffffffff940dde42>] system_call_fastpath+0x16/0x1b [ 115.169045] Code: 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 83 ff 15 0f 87 5f 03 00 00 ff 24 fd 00 2e 26 94 48 c7 c7 f3 a0 37 94 31 c0 e8 e4 21 1c 00 <0f> 0b be 7a 01 00 00 48 c7 c7 fc a0 37 94 e8 a2 c7 d1 ff e9 32 [ 115.227868] RIP [<ffffffff93f16af4>] lkdtm_do_action+0x3f/0x3a6 [ 115.246318] RSP <ffff88012a537e70> [ 115.257065] ---[ end trace 33a6ebe90ab195b6 ]--- [ 115.280260] Kernel panic - not syncing: Fatal exception [ 115.296286] Kernel Offset: 0x12c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 115.338318] ACPI MEMORY or I/O RESET_REG. so this can explain the different behavior. The "reboot" command results in similar output: [ 167.434175] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required. [ 167.811320] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 167.831523] Disabling non-boot CPUs ... [ 167.850023] Restarting system. [ 167.859377] reboot: machine restart [ 167.870293] ACPI MEMORY or I/O RESET_REG.
,
Dec 13 2017
Todd and Sridhar, if it's not too much trouble, could you try this on your stumpies: flashrom -r /tmp/x sum /tmp/x I would like to make sure we're running the same firmware. Thanks! I get this: localhost ~ # flashrom -r /tmp/x flashrom v0.9.9 : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64) flashrom v0.9.9 : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64) Calibrating delay loop... OK. coreboot table found at 0xaceddc00. Block protection could not be disabled! Reading flash... SUCCESS localhost ~ # sum /tmp/x 32538 8192
,
Dec 13 2017
Google Chrome 65.0.3287.0 (Official Build) unknown (64-bit) Revision 0 Platform 10194.0.0 (Official Build) dev-channel stumpy test Firmware Version Google_Stumpy.2.102.0 Customization ID STUMPY localhost / # flashrom -r /tmp/x flashrom v0.9.9 : 831c609 : Nov 30 2017 02:04:47 UTC on Linux 3.8.11 (x86_64) flashrom v0.9.9 : 831c609 : Nov 30 2017 02:04:47 UTC on Linux 3.8.11 (x86_64) Calibrating delay loop... OK. coreboot table found at 0x7c6ddc00. Reading flash... SUCCESS localhost / # sum /tmp/x 07049 8192
,
Dec 13 2017
doesn't that flashrom command dump the entire spi flash and thus will contain device-specific things like VPD ? you'd have to use cbfstool to extract/check specific regions.
,
Dec 13 2017
#34 Thank you Mike. But why would the coreboot table be at different addresses? Oh oh look. And the device in #33 had the write-protect removed... so maybe someone reflashed it with firmware THAT WORKS... So maybe I got lucky.
,
Dec 13 2017
if disabling write protect and updating the firmware to current versions makes this issue go away, that's acceptable imo. pstore cannot work correctly w/out functioning firmware support.
,
Dec 13 2017
#35 no actually I wasn't lucky. The coreboot table is a table of function pointers in RAM. Its location depends on the amount of physical memory. After removing one DIMM (I had two), I get the same location as #33: localhost ~ # flashrom -r /tmp/x flashrom v0.9.9 : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64) flashrom v0.9.9 : 41f2f3d : Nov 29 2017 23:45:47 UTC on Linux 3.8.11 (x86_64) Calibrating delay loop... OK. coreboot table found at 0x7c6ddc00. Block protection could not be disabled! Reading flash... SUCCESS The shutdown behavior remains the same (i.e. faulty). It may be wiser to pass this on to the firmware team.
,
Dec 13 2017
OK one more try. Sridhar would you be so kind to run flashrom -r /tmp/x again, then upload x to Drive and post the link? The FW team will help me compare the images. Thanks!
,
Dec 13 2017
@comment #38 File is present at https://storage.cloud.google.com/chromiumos-test-logs/bugfiles/cros/792562/x?_ga=2.18807024.-859875196.1426784702
,
Dec 14 2017
Thank you! Mine is at https://drive.google.com/open?id=13gT4mlAJmGikrmQWUJ94brFvXdmhftDd
,
Dec 14 2017
Mine is here: https://drive.google.com/file/d/1oKcVP1DHyITHEsBMgyClPp5PuhjZSvlT contains tarball which has stumpy_tbroch_fw/image.bin result of flashrom -r image.bin sumpy_tbroch_fw/<REGION> result of flashrom -x I believe you can also generate those by providing the binary (image.bin) on the flashrom command line (--image=<file> perhaps?
,
Dec 14 2017
With Stefan's help I compared the various sections of the dumps. The firmware in all three devices is identical. Next, let's compare the memory modules.
,
Dec 14 2017
We swapped memory modules between my stumpy (bad) and Sridhar's (good). The bad stumpy stayed bad, and the good one stayed good. So it's not the memory. Just in case: my stumpy (bad) has 2x2GB Samsung modules (1Rx8 PC3 - 10600S - 09 - 10 - ZZZ). I also tried running with a single module (nothing changed). Sridhar's (good) has 2x1GB Samsung (I didn't write down the exact model since it shouldn't matter).
,
Dec 18 2017
Sadly my stumpy stopped working over the week-end :/
,
Dec 19 2017
I am searching cautotest/ for generic_RebootTest, then selecting the stumpy jobs (pretty slow) and looking at the reboot times. It looks like a couple of the DUTs require a longer time, similarly to what we see in the bad DUTs in our hands. Further testing along this line would give us an approximate ratio of DUTs with faulty hardware. 4-2-8-15 163500220 release 49:54 -> 50.17 == 23s 4-2-9-4 163499241 paladin 31:21 -> 31.44 == 23s 4-2-9-1 163464395 paladin 14:57 -> 15:28 == 31s !!! 4-2-8-15 163429387 release 21:13 -> 21:35 == 22s 4-2-9-15 163409297 paladin 00:49 -> 01:12 == 23s 4-2-8-13 163395594 paladin 52:53 -> 53:11 == 18s 4-2-9-3 163360741 release 06:51 -> 07:09 == 17s 4-2-9-8 163342853 paladin 40:01 -> 40:32 == 31s !!! 4-9-2-11 163325729 release 48:25 -> 48:45 == 20s 4-9-2-11 163311227 release 09:46 -> 10:06 == 20s 4-2-9-7 163309991 paladin 24:07 -> 24:29 == 22s 4-2-9-7 163272444 paladin 20:01 -> 20:23 == 22s 4-2-9-1 163263933 paladin 37:40 -> 38:11 == 31s !!! repeat 4-2-8-13 163239679 release 39:23 -> 39.41 == 18s 4-2-8-13 163224869 paladin 39:01 -> 39.23 == 22s 4-2-9-4 163184345 paladin 09:23 -> 09:45 == 22s 4-2-9-6 163176454 release 58:35 -> 58:57 == 22s 4-2-9-5 163145392 release 12:10 -> 12:32 == 22s 4-2-9-3 163089908 release 00:49 -> 01:07 == 18s 4-2-9-5 163069960 paladin 42:46 -> 43:09 == 23s
,
Dec 19 2017
FWIW here's SPD related params for my 'bad' stumpy which you're welcome to borrow, mosys memory spd print all 0 | DDR3 | SO-DIMM 1 | DDR3 | SO-DIMM 0 | 1-78: Samsung | 794aaf4c | M471B2873FHS-CH9 1 | 1-78: Samsung | 794aaf4d | M471B2873FHS-CH9 0 | 1024 | 1 | 64 1 | 1024 | 1 | 64 0 | DDR3-800, DDR3-1066, DDR3-1333 1 | DDR3-800, DDR3-1066, DDR3-1333 I also have a 2nd stumpy now in working order. It appears to be 'good' (reboots fast). Its SPD is, mosys memory spd print all 0 | DDR3 | SO-DIMM 1 | DDR3 | SO-DIMM 0 | 1-78: Samsung | 613ea982 | M471B5773CHS-CH9 1 | 1-78: Samsung | 613ea709 | M471B5773CHS-CH9 0 | 2048 | 1 | 64 1 | 2048 | 1 | 64 0 | DDR3-800, DDR3-1066, DDR3-1333 1 | DDR3-800, DDR3-1066, DDR3-1333 I'll send you IP addresses should remote access prove useful for your testing. Consequently I think this bug and crbug.com/782329 may be dupe's as the 2nd stumpy resumes successfully as well.
,
Jan 4 2018
Ping on this one since open for a while and a stable blocker. Seems we should work to get this in..
,
Jan 8 2018
Remove RBS since this issues appears to be HW related.
,
Dec 18
stumpy shall be missed |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by sontis@chromium.org
, Dec 6 2017