New issue
Advanced search Search tips

Issue 767966 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Implement watchdog for cyan and other braswell devices

Project Member Reported by davidri...@chromium.org, Sep 22 2017

Issue description

cyan is experiencing shutdown issues (crbug.com/639301) which is preventing clean reboots which causes Provision failures in the lab.

Attempts to reproduce the issue and gather ramoops are hampered by the fact that the ramoops is often not recoverable via power-refresh after a hang.

Enable watchdog accomplishes two things:
- allows us to get a ramoops on next boot
- works around the shutdown issue
 
cyan currently has two watchdogs enabled:
[    0.080850] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.779088] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11

I believe the latter is what implements /dev/watchdog. "daisydog" is the code on ARM machines that open and "pet" the /dev/watchdog and this package is NOT included on cyan. I'm trying to build/deploy the package and then try to run:
   test_that 100.127.1.243 platform_HWwatchdog

to see if all that fails.

I tried "echo 0 > /dev/watchdog" and the system rebooted about 30 seconds later. I'll see if I can make that timeout shorter (10 or 15 seconds is typical).
Cc: adurbin@chromium.org davidri...@chromium.org dlaurie@chromium.org
localhost ~ # daisydog
HW watchdog interval is 30 seconds
/dev/watchdog reported boot status: normal-boot

So 30 seconds was a good guess.

"normal-boot" is wrong given my previous poking at /dev/watchdog.  I'm not sure the BIOS leaves enough info (chipset state) for the kernel to determine the cause of the reboot.

Duncan, Aaron, can either of you comment on braswell (or other Intel chipsets) regarding iTCO watchdog support?
Looks lke watchdog timeout setting is also not correct. Claimed 10 seconds but took ~15 seconds to timeout.

localhost ~ # daisydog -c   
HW watchdog interval is 10 seconds
/dev/watchdog reported boot status: normal-boot
localhost ~ # echo 0 > /dev/watchdog ; while : ; do date ; sleep 1 ; done
Wed Oct 18 19:36:18 PDT 2017
Wed Oct 18 19:36:19 PDT 2017
Wed Oct 18 19:36:20 PDT 2017
Wed Oct 18 19:36:21 PDT 2017
Wed Oct 18 19:36:22 PDT 2017
Wed Oct 18 19:36:23 PDT 2017
Wed Oct 18 19:36:24 PDT 2017
Wed Oct 18 19:36:25 PDT 2017
Wed Oct 18 19:36:26 PDT 2017
Wed Oct 18 19:36:27 PDT 2017
Wed Oct 18 19:36:28 PDT 2017
Wed Oct 18 19:36:29 PDT 2017
Wed Oct 18 19:36:30 PDT 2017
Wed Oct 18 19:36:31 PDT 2017
Wed Oct 18 19:36:32 PDT 2017
Wed Oct 18 19:36:33 PDT 2017
Wed Oct 18 19:36:34 PDT 2017

and after system rebooted:

localhost ~ # daisydog -c
HW watchdog interval is 30 seconds
/dev/watchdog reported boot status: normal-boot

(reported boot status should have been something about watchdog timeout)
iTCO support should be available in all intel pltforms. However, how that's utilized is up to the kernel. If it shuts down the timer on 'reboot' then it won't help with your shutdown/reboot issues for kicking the machine. However, it also means the firmware should disable it once it's come up when the reboot didn't hang. IIRC the hang on cyan was because of xhci duplicate init which is super late in the firmware.  So there's no panacea to the problem plaguing cyan without larger changes to the firmware I suspect.

what does 'mosys eventlog list' show when the TCO resets the system?
TCO is probably disabled on newer core platforms (skylake/kabylake) because it is tied with ACPI PM timer which needed to be disabled to allow s0ix entry: https://chromium-review.googlesource.com/c/chromiumos/third_party/coreboot/+/319361
Aaron - thanks!

David, Duncan: Should we start requiring "test_that $IP platform_HWwatchdog" pass for new chipsets?

This test shouldn't require any changes to work on Intel since it's using a public API (ioctl WDIOC_GETBOOTSTATUS) to query the kernel. See ~/trunk/src/third_party/daisydog/daisydog.c.

But the kernel will require some changes (see below.).

This feature is going to be directly at odds with the change Duncan pointed out. :(

But if the NMI watchdog isn't catching particular hangs, then my gut feeling is we should be deploying/counting how often iTCO watchdog is triggering in addition.

I have no clue if making this work will require Coreboot to clear the iTCO (and record if iTCO fired). I kind of expect that any system reset would clear iTCO to default values - but that could just be wishful thinking.


"mosys evenlog list" output for the machine I was testing with on Wednesday (hasn't been touched since then):
...
265 | 2000-00-00 00:00:00 | System Reset
266 | 2000-00-00 00:00:00 | Wake Source | Power Button | 0
267 | 2000-00-00 00:00:00 | Chrome OS Developer Mode
268 | 2017-10-18 19:07:20 | Kernel Event | Clean Shutdown
269 | 2017-10-18 19:07:22 | System boot | 3
270 | 2017-10-18 19:07:22 | System Reset
271 | 2017-10-18 19:07:22 | Chrome OS Developer Mode
272 | 2017-10-18 19:12:36 | System boot | 4
273 | 2017-10-18 19:12:36 | System Reset
274 | 2017-10-18 19:12:36 | Chrome OS Developer Mode
275 | 2017-10-18 19:18:58 | Kernel Event | Clean Shutdown
276 | 2017-10-18 19:19:00 | System boot | 5
277 | 2017-10-18 19:19:00 | System Reset
278 | 2017-10-18 19:19:01 | Chrome OS Developer Mode
279 | 2017-10-18 19:36:36 | System boot | 6
280 | 2017-10-18 19:36:36 | System Reset
281 | 2017-10-18 19:36:36 | Chrome OS Developer Mode


So it looks like the iTCO firing isn't getting recorded.

The iTCO driver (in all kernel versions) currently ignores "bootstatus" field. So the kernel(s) would still need a few small changes to the kernel similar to this:
   https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/334235


 ~/trunk/src/third_party/kernel $ fgrep -n bootstatus v*/drivers/watchdog/iTCO_wdt.c
v3.10/drivers/watchdog/iTCO_wdt.c:487:  iTCO_wdt_watchdog_dev.bootstatus = 0;
v3.14/drivers/watchdog/iTCO_wdt.c:485:  iTCO_wdt_watchdog_dev.bootstatus = 0;
v3.18/drivers/watchdog/iTCO_wdt.c:512:  iTCO_wdt_watchdog_dev.bootstatus = 0;
v3.8/drivers/watchdog/iTCO_wdt.c:487:   iTCO_wdt_watchdog_dev.bootstatus = 0;
v4.12/drivers/watchdog/iTCO_wdt.c:530:  p->wddev.bootstatus = 0;
v4.4/drivers/watchdog/iTCO_wdt.c:531:   iTCO_wdt_watchdog_dev.bootstatus = 0;


And something like this is needed:
v3.14/drivers/watchdog/ixp4xx_wdt.c:117:        case WDIOC_GETBOOTSTATUS:
v3.14/drivers/watchdog/ixp4xx_wdt.c-118-                ret = put_user(boot_status, (int *)arg);

I'm not sure where, but I'm guessing the current /dev/watchdog is doing this instead:

v4.12/drivers/watchdog/intel_scu_watchdog.c:386:        case WDIOC_GETBOOTSTATUS:
v4.12/drivers/watchdog/intel_scu_watchdog.c-387-                return put_user(0, p);

I'll be out the next two weeks. Please feel free to reassign to a different owner if this is urgent enough.
Status: Available (was: Untriaged)
Owner: dlaurie@chromium.org
Duncan, Aaron, not recording the cause of reboot (TCO timer) seems like an issue with BIOS. Once that is fixed, I can look at kernel changes to support WDIOC_GETBOOTSTATUS.  Who on the BIOS team should own this? Or is this WONTFIX?

BTW, Aaron is correct that daisy_dog (or any watchdog deamon) currently gets shut down cleanly and won't address the original problem.

The easiest "fix" is to add the equivalent of "echo 0 > /dev/watchdog" to one of the shutdown scripts. That would arm the watchdog and either power goes off or BIOS clears the TCO on reboot. Should I add that even though we won't have a clue how often it fires?
I'm okay with that if it makes shutdowns more reliable.
Status: Assigned (was: Available)

Sign in to add a comment