SKL: on some hard hangs warm reset doesn't reboot the system |
||||||||||||
Issue descriptionSpawning this bug from crbug.com/712526 to track why warm reset (triggered via alt-volup-r key sequence) doesn't reboot the system when it is hard hung due to display FIFO underruns. Hard hang steps to repro would be to use an older M59 build such as 59.0.3071.35 on Caroline and then do the following steps (from https://bugs.chromium.org/p/chromium/issues/detail?id=712526#c57) that should result in hard hang in a few minutes: 1. select the address bar and hold a key (preferably a letter). 2. without letting up please circle your cursor in and out of the address bar
,
May 12 2017
I have added Wei Shun to the cc: list but couldn't add him as an owner since he has probably not registered here. Getting back next week on this sounds good. Thanks!
,
May 13 2017
,
May 15 2017
Ryan will take a look and provide an update by the end of the day.
,
May 15 2017
Attach the EC log. According to the log, EC issued Warm Reset ([87.221528 KB warm reboot], [87.221800 chipset_reset(0)]) to CPU when I pressed alt+vol_up+r, but EC was unable to bring CPU out of hung state.
,
May 15 2017
This could be because on skylake systems the EC pulses RCIN# to generate INIT# to the CPU on a 'warm' reset: https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-glados-7820.B/power/skylake.c#106 If that was changed to pulse SYS_RESET# (which is done on a cold reset currently) it might have more luck actually resetting the system when something goes wrong.
,
May 16 2017
SYS_RESET# could reboot the device was hung. apreset Issuing AP cold reset... [789.835057 chipset_reset(1)] > [797.267639 Battery 14% / 3h:2 to full] [806.710214 Port 80: 0x1002][806.710506 LPC RESET# asserted] [806.711020 power state 3 = S0, in 0x0019] [806.711560 power state 8 = S0->S3, in 0x0019] [806.712157 event set 0x00200000] [806.712946 chipset -> S3] [806.713230 power state 2 = S3, in 0x0019] [806.713644 power state 9 = S3->S5, in 0x0019] [806.714239 chipset -> S5] [806.714529 power state 1 = S5, in 0x0019] [811.000444 power state 1 = S5, in 0x001f] [811.000982 power state 6 = S5->S3, in 0x001f] [811.001871 chipset -> S3] [811.002164 power state 2 = S3, in 0x001f] [811.002582 power state 7 = S3->S0, in 0x001f]
,
May 16 2017
The hung device could be rebooted by the warm reset button (WARM_RST_L) of Servo board. But, it seems that WARM_RST_L is connected to SYS_RST# as well.
,
May 16 2017
Can you check if SYS_RST# retained RAM contents of interest? The way to check would be to look for /dev/pstore/console-ramoops file and verify that it has contents from the previous boot's dmesg log.
,
May 17 2017
Re #9, SYS_RESET# can retain RAM content when the device is running, but it can't retain RAM content every time when the device is hung.
,
May 18 2017
,
May 30 2017
Any updates here on either: - why SYS_RESET# doesn't retain RAM contents when the device is hung? - why warm reset doesn't reboot a hung device?
,
May 30 2017
@snanda Is your question addressed to Ryan from Intel? If so he maybe gone for the next 1-2 days due to TWN holidays. Is that okay?
,
May 31 2017
Re #13, @snanda, please refer to my comment below, >- why SYS_RESET# doesn't retain RAM contents when the device is hung? From the EC log in comment #7, SYS_RESET# triggered the device powering off to S5. The memory context does not need to be preserved in the S5 state, so the power to the memory is also shut. >- why warm reset doesn't reboot a hung device? As the explanation in comment #6, When the PCH detects the assertion of this signal, INIT# is generated to the CPU to reset CPU only. In this hung case (crbug.com/712526), RCIN# can't work, which may indicate that CPU is in a wrong state and unable to execute INIT#. And, the hung issue on SKL devices should be fixed with https://chromium-review.googlesource.com/#/c/503788/ .
,
Jun 1 2017
sys_reset# pin on the SoC should be a formal warm reset request. It is not an INIT# generating pin (RCIN#). Here's the doc: "System Reset: This pin forces an internal reset after being de-bounced. The PCH will reset immediately if the SMBus is idle; otherwise, it will wait up to 25 ms ±2 ms for the SMBus to idle before forcing a reset on the system." But resets are not asynch in these SoCs. Lots of things need to align to warm reset properly. I suspect things get in a bad enough state that the PMC times out on its warm reset sequence and forces a cold reset to clear whatever fault that exists. There should be nuggets of information in PMC sram, but I'm not sure how to get that information out. Intel should know.
,
Jun 1 2017
But Duncan is correct. The EC pulses EC_PCH_RCIN_L (RCIN#) and not sys_reset. Sorry for the confusion. Didn't read this full bug. INIT# not working is a similar issue of when an actual warm reset fails.
,
Jun 1 2017
#7 is interesting in that it shouldn't be a cold reset when sys_reset# is pulsed. When a device is functional does 'apreset' just do a warm reset? It should. In which case that should be good confirmation of state within the SoC being super screwed up to not be able to complete any of those sequences.
,
Jun 1 2017
,
Jun 2 2017
Re #18 @adurbin, sys_reset# triggers a soft reset in normal cases. The following logs is what I captured when my device was running. I am not sure why it did a hard reset with crbug.com/712526. > apreset cold Issuing AP cold reset... [61.037024 chipset_reset(1)] > [62.170093 Port 80: 0x1002][62.186347 Port 80: 0x10][62.192347 Port 80: 0x24][62.199347 Port 80: 0x28][62.200347 Port 80: 0x90][62.203347 Port 80: 0x2a][62.409347 Port 80: 0x32][62.410347 Port 80: 0x33][62.411678 Port 80: 0x34][62.412347 Port 80: 0x92][62.416347 Port 80: 0x43][62.417347 Port 80: 0x5d][62.434347 Port 80: 0x55][62.522347 Port 80: 0x37][62.529347 Port 80: 0x38][62.531347 Port 80: 0x3b][62.532347 Port 80: 0x11][62.567691 Port 80: 0x80][62.568347 Port 80: 0x71][62.630347 Port 80: 0x93][62.665347 Port 80: 0x72][62.666347 Port 80: 0x24][62.667347 Port 80: 0x55][62.668347 Port 80: 0x73][62.683348 Port 80: 0x74][62.691347 Port 80: 0x75][62.692347 Port 80: 0x75][62.701347 Port 80: 0x93][62.717347 Port 80: 0x9b][62.718347 Port 80: 0x9b][62.719347 Port 80: 0x75][62.720347 Port 80: 0x75][62.721347 Port 80: 0x75][62.722347 Port 80: 0x75][62.723347 Port 80: 0x75][62.724347 Port 80: 0x75][62.725347 Port 80: 0x75][62.726347 Port 80: 0x75][62.727347 Port 80: 0x75][62.728347 Port 80: 0x75][62.729347 Port 80: 0x75][62.730347 Port 80: 0x75][62.732347 Port 80: 0x75][62.733347 Port 80: 0x75][62.735354 KB IRQ disable] [62.738354 KS disable] [62.740396 KB scancode set to 2] [62.742474 KB IRQ enable] [62.743438 KS enable] [62.747347 Port 80: 0x77][62.762347 Port 80: 0x9c][62.768347 Port 80: 0x7a][62.805347 Port 80: 0x95][62.992347 Port 80: 0xaa][63.064773 Executing host reboot command 5] [63.436288 KB disable] [63.436573 KS disable] [64.114931 HC 0x28 err 1] [64.115531 HC 0x28 err 1]
,
Jun 2 2017
,
Jun 2 2017
I believe that's the PMC giving up trying to complete the warm reset sequence because the state in the SoC is bad. The giving up consists of bringing everything down to S5 and back up (cold reset) to clear whatever fault was in the SoC.
,
Jun 5 2017
In this case, SoC is in a bad state and needs a full reset to be recovered. Since the original issue (crbug.com/712526) is closed, can we also close this one?
,
Jun 6 2017
So are we saying that there is no way to do a warm reset when the SoC is stuck in this state? How are we supposed to debug what caused the hard hang if we have to cold reset the EC - we will never be able to see the stack trace from the previous boot.
,
Jun 7 2017
Re #24, Hi Yung, Sorry for the confusion, but I wasn't saying that the warm reset can't work for all of SoC freezing cases but it can't work for this one. From Documentation/ramoops.txt: "Ramoops also supports software ECC protection of persistent memory regions. This might be useful when a hardware reset was used to bring the machine back to life (i.e. a watchdog triggered). In such cases, RAM may be somewhat corrupt, but usually it is restorable." ECC protection is not enabled for Skylake devices. After I enabled it (ramoops.ecc=1) and tested with F3 + power button for 3 times when the device was hung (crbug.com/712526), console-ramoops could be preserved. Please refer to the attachments. Maybe we can enable this feature. I also submit a CL, https://chromium-review.googlesource.com/#/c/526675/ .
,
Jul 14 2017
We are hearing of reports of hard hang where alt-volup-r couldn't warm reboot on KBL systems too. Aaron, Duncan, Furquan, should we point alt-volup-r to pulse sys_reset# instead? We can start with KBL platform and perhaps take that change back to SKL too.
,
Jul 14 2017
Yes, I think we should.
,
Jul 14 2017
Furquan, mind taking this on?
,
Jul 14 2017
Yes, I will take this up next.
,
Jul 18 2017
,
Jul 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/ec/+/ef73893a4b73e9181382530dd98dd28b1de52c38 commit ef73893a4b73e9181382530dd98dd28b1de52c38 Author: Furquan Shaikh <furquan@chromium.org> Date: Wed Jul 19 04:50:33 2017 skylake: Use SYS_RESET signal to trigger warm and cold reset RCIN# signal is known to not work properly for performing a warm reset when the CPU is in a bad state. This results in the common key combo (Alt-Volup-r) not working to reset the host. Thus, use SYS_RESET signal instead to trigger both cold and warm chipset reset. BUG= chromium:721853 BRANCH=None TEST=make -j buildall Change-Id: I38663db96767d0aa03cd1aea0fe2a0cc5b771cd2 Signed-off-by: Furquan Shaikh <furquan@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/575947 Reviewed-by: Duncan Laurie <dlaurie@google.com> [modify] https://crrev.com/ef73893a4b73e9181382530dd98dd28b1de52c38/power/skylake.c
,
Aug 4 2017
Change was pushed in #19. Marking as fixed.
,
Aug 7 2017
+dchan since we will need to schedule a firmware qual to push this out.
,
Aug 7 2017
,
Sep 8 2017
I'm sorry to ask such a basic question but was this issue pushed out in 60 or will it be included in 62?
,
Sep 8 2017
I don't find that question so basic: https://bugs.chromium.org/p/chromium/issues/detail?id=762325 When the Change-Id is preserved: https://chromium-review.googlesource.com/q/Ib86fcd3afc75e623 When there's only one match you can't tell whether there was no cherry-pick versus the Change-Id was changed: https://chromium-review.googlesource.com/q/I70c7491f52afa408a https://chromium-review.googlesource.com/q/I38663db96767d0 |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by shyam.su...@intel.com
, May 12 2017