Implement watchdog for cyan and other braswell devices |
|||||
Issue descriptioncyan is experiencing shutdown issues (crbug.com/639301) which is preventing clean reboots which causes Provision failures in the lab. Attempts to reproduce the issue and gather ramoops are hampered by the fact that the ramoops is often not recoverable via power-refresh after a hang. Enable watchdog accomplishes two things: - allows us to get a ramoops on next boot - works around the shutdown issue
,
Oct 19 2017
localhost ~ # daisydog HW watchdog interval is 30 seconds /dev/watchdog reported boot status: normal-boot So 30 seconds was a good guess. "normal-boot" is wrong given my previous poking at /dev/watchdog. I'm not sure the BIOS leaves enough info (chipset state) for the kernel to determine the cause of the reboot. Duncan, Aaron, can either of you comment on braswell (or other Intel chipsets) regarding iTCO watchdog support?
,
Oct 19 2017
Looks lke watchdog timeout setting is also not correct. Claimed 10 seconds but took ~15 seconds to timeout. localhost ~ # daisydog -c HW watchdog interval is 10 seconds /dev/watchdog reported boot status: normal-boot localhost ~ # echo 0 > /dev/watchdog ; while : ; do date ; sleep 1 ; done Wed Oct 18 19:36:18 PDT 2017 Wed Oct 18 19:36:19 PDT 2017 Wed Oct 18 19:36:20 PDT 2017 Wed Oct 18 19:36:21 PDT 2017 Wed Oct 18 19:36:22 PDT 2017 Wed Oct 18 19:36:23 PDT 2017 Wed Oct 18 19:36:24 PDT 2017 Wed Oct 18 19:36:25 PDT 2017 Wed Oct 18 19:36:26 PDT 2017 Wed Oct 18 19:36:27 PDT 2017 Wed Oct 18 19:36:28 PDT 2017 Wed Oct 18 19:36:29 PDT 2017 Wed Oct 18 19:36:30 PDT 2017 Wed Oct 18 19:36:31 PDT 2017 Wed Oct 18 19:36:32 PDT 2017 Wed Oct 18 19:36:33 PDT 2017 Wed Oct 18 19:36:34 PDT 2017 and after system rebooted: localhost ~ # daisydog -c HW watchdog interval is 30 seconds /dev/watchdog reported boot status: normal-boot (reported boot status should have been something about watchdog timeout)
,
Oct 19 2017
iTCO support should be available in all intel pltforms. However, how that's utilized is up to the kernel. If it shuts down the timer on 'reboot' then it won't help with your shutdown/reboot issues for kicking the machine. However, it also means the firmware should disable it once it's come up when the reboot didn't hang. IIRC the hang on cyan was because of xhci duplicate init which is super late in the firmware. So there's no panacea to the problem plaguing cyan without larger changes to the firmware I suspect. what does 'mosys eventlog list' show when the TCO resets the system?
,
Oct 20 2017
TCO is probably disabled on newer core platforms (skylake/kabylake) because it is tied with ACPI PM timer which needed to be disabled to allow s0ix entry: https://chromium-review.googlesource.com/c/chromiumos/third_party/coreboot/+/319361
,
Oct 20 2017
Aaron - thanks! David, Duncan: Should we start requiring "test_that $IP platform_HWwatchdog" pass for new chipsets? This test shouldn't require any changes to work on Intel since it's using a public API (ioctl WDIOC_GETBOOTSTATUS) to query the kernel. See ~/trunk/src/third_party/daisydog/daisydog.c. But the kernel will require some changes (see below.). This feature is going to be directly at odds with the change Duncan pointed out. :( But if the NMI watchdog isn't catching particular hangs, then my gut feeling is we should be deploying/counting how often iTCO watchdog is triggering in addition. I have no clue if making this work will require Coreboot to clear the iTCO (and record if iTCO fired). I kind of expect that any system reset would clear iTCO to default values - but that could just be wishful thinking. "mosys evenlog list" output for the machine I was testing with on Wednesday (hasn't been touched since then): ... 265 | 2000-00-00 00:00:00 | System Reset 266 | 2000-00-00 00:00:00 | Wake Source | Power Button | 0 267 | 2000-00-00 00:00:00 | Chrome OS Developer Mode 268 | 2017-10-18 19:07:20 | Kernel Event | Clean Shutdown 269 | 2017-10-18 19:07:22 | System boot | 3 270 | 2017-10-18 19:07:22 | System Reset 271 | 2017-10-18 19:07:22 | Chrome OS Developer Mode 272 | 2017-10-18 19:12:36 | System boot | 4 273 | 2017-10-18 19:12:36 | System Reset 274 | 2017-10-18 19:12:36 | Chrome OS Developer Mode 275 | 2017-10-18 19:18:58 | Kernel Event | Clean Shutdown 276 | 2017-10-18 19:19:00 | System boot | 5 277 | 2017-10-18 19:19:00 | System Reset 278 | 2017-10-18 19:19:01 | Chrome OS Developer Mode 279 | 2017-10-18 19:36:36 | System boot | 6 280 | 2017-10-18 19:36:36 | System Reset 281 | 2017-10-18 19:36:36 | Chrome OS Developer Mode So it looks like the iTCO firing isn't getting recorded. The iTCO driver (in all kernel versions) currently ignores "bootstatus" field. So the kernel(s) would still need a few small changes to the kernel similar to this: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/334235 ~/trunk/src/third_party/kernel $ fgrep -n bootstatus v*/drivers/watchdog/iTCO_wdt.c v3.10/drivers/watchdog/iTCO_wdt.c:487: iTCO_wdt_watchdog_dev.bootstatus = 0; v3.14/drivers/watchdog/iTCO_wdt.c:485: iTCO_wdt_watchdog_dev.bootstatus = 0; v3.18/drivers/watchdog/iTCO_wdt.c:512: iTCO_wdt_watchdog_dev.bootstatus = 0; v3.8/drivers/watchdog/iTCO_wdt.c:487: iTCO_wdt_watchdog_dev.bootstatus = 0; v4.12/drivers/watchdog/iTCO_wdt.c:530: p->wddev.bootstatus = 0; v4.4/drivers/watchdog/iTCO_wdt.c:531: iTCO_wdt_watchdog_dev.bootstatus = 0; And something like this is needed: v3.14/drivers/watchdog/ixp4xx_wdt.c:117: case WDIOC_GETBOOTSTATUS: v3.14/drivers/watchdog/ixp4xx_wdt.c-118- ret = put_user(boot_status, (int *)arg); I'm not sure where, but I'm guessing the current /dev/watchdog is doing this instead: v4.12/drivers/watchdog/intel_scu_watchdog.c:386: case WDIOC_GETBOOTSTATUS: v4.12/drivers/watchdog/intel_scu_watchdog.c-387- return put_user(0, p); I'll be out the next two weeks. Please feel free to reassign to a different owner if this is urgent enough.
,
Nov 13 2017
,
Nov 20 2017
Duncan, Aaron, not recording the cause of reboot (TCO timer) seems like an issue with BIOS. Once that is fixed, I can look at kernel changes to support WDIOC_GETBOOTSTATUS. Who on the BIOS team should own this? Or is this WONTFIX? BTW, Aaron is correct that daisy_dog (or any watchdog deamon) currently gets shut down cleanly and won't address the original problem. The easiest "fix" is to add the equivalent of "echo 0 > /dev/watchdog" to one of the shutdown scripts. That would arm the watchdog and either power goes off or BIOS clears the TCO on reboot. Should I add that even though we won't have a clue how often it fires?
,
Nov 21 2017
I'm okay with that if it makes shutdowns more reliable.
,
Aug 1
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by grundler@chromium.org
, Oct 19 2017