EC: .bss.slow currently has a race condition after loading the lfw loader. |
|||
Issue descriptionCONFIG_REPLACE_LOADER_WITH_BSS_SLOW was introduced as a way to regain some space by replacing the little firmware loader, which is only used to load the current image, with some extra variables that would be in the .bss section. However, in the event of a sysjump, the EC needs to reload the little firmware loader back from flash into RAM, in order to perform the jump. This tries to happen as late as possible in the sysjump process. The problem is such that once the loader is loaded back into RAM, occasionally some code may modify some variables that were in that region, thereby corrupting the loader which then leads to an exception when we try to jump to the new image. This is because we currently disable interrupts AFTER loading the loader. We should fix this in a minimal way that prevents any other code path to modify any variables in the .bss.slow section, while also not modifying several files in the EC code base. See previous attempts: https://chromium-review.googlesource.com/#/c/336870/ https://chromium-review.googlesource.com/#/c/336894/ One proposal is to essentially disable task switching and maybe just add a watchdog_reload() in place of sleeps in the related SPI transactions.
,
May 4 2016
I am attaching the logs for my flashrom continuous loop test described in https://code.google.com/p/chrome-os-partner/issues/detail?id=51438. This test was done on kunimitsu and the failure was seen after 82 iterations (164 flashrom writes). The exception seems to occur when loader is running. In each iteration, two flashrom write operations are invoked using two different EC images. The EC images are very different as verified by cmp -l and both images are built with the patch that disables task switching when reloading LFW.
,
May 4 2016
Well, the good thing is, is that is the same failure signature I've seen and it appears to be a watchdog reset in the loader. I was testing a new fix for that which seemed promising, but now I'm encountering failures in the new image. My current guess is that the hooks task is getting starved out. Although, I would have expected the auxiliary timer to fire before the watchdog did. I'll continue to investigate tomorrow.
,
May 6 2016
I've been continuing my investigation some more. With the latest patch series that I've uploaded to gerrit, I've been able to confine the watchdog resets to happening after we start executing the new image. (See https://chromium-review.googlesource.com/#/c/343021/ for the HEAD of the patch series). Now here's the part that I can't make sense of yet. When we sysjump to the new image, we encounter a watchdog reset without our auxiliary timer firing. I find that strange because the AUX timer is set to fire 500ms before the watchdog timer should fire. The watchdog is configured to fire after 1.6s. However, it seems that the watchdog is firing way sooner than the value that we programmed. This then causes no watchdog trace to be printed because our aux timer isn't even getting a chance to fire yet. Here are some snippets of the failures from my logs. These were all performed on a sentry. I'll place comments inline with "#". My test procedure is to sysjump between RO and RW while waiting 2 seconds in between. [1739.714190 ACPI query = 22] sysjump RW [1740.151762 Jumping to image RW] littlefw sentry_v1.9.179-82e8b89-dirty lfw-RW load [1740.154946 UART init'd after sysjump] [Image: RW, sentry_v1.9.179-82e8b89-dirty 2016-05-04 13:26:58 aaboagye@lithium.mtv.corp.google.com] [1740.155968 Inits done] [1740.156373 already in S0] [1740.156876 low power idle task started] [1740.173456 hash start 0x00040000 0x0001946c] [1740.176779 KB IRQ enable] [1740.176840 KB enable] [1740.1768littlefw sentry_v1.9.179-82e8b89-dirty lfw-RO load --- UART init'd after reboot --- [Reset cause: reset-pin watchdog] [Image: RO, sentry_v1.9.179-82e8b89-dirty 2016-05-04 13:26:58 aaboagye@lithium.mtv.corp.google.com] [0.001064 KB boot key 0] [0.001116 Inits done] Restarting system with PMIC. <B8>sy.<FB>littlefw sentry_v1.9.179-82e8b89-dirty lfw-RO load --- UART init'd after reboot --- There you can see that the EC reset right after init but not even 100ms had passed by since we completed our inits. I then thought that maybe the value that we were loading into the watchdog timer was wrong or getting corrupted somehow. Therefore, I wanted to repeat my test but print out the value of the count registers for both the AUX and watchdog timer after we had initialized them. Here is the failure from that test: [1138.320192 HC 0x117] DevId:1.1 SW:539 RW:1 sysjump RW [1138.734779 Jumping to image RW] littlefw sentry_v1.9.179-82e8b89-dirty lfw-RW load [1138.738017 UART init'd after sysjump] [Image: RW, sentry_v1.9.179-82e8b89-dirty 2016-05-04 14:48:02 aaboagye@lithium.mtv.corp.google.com] AUX CNT: 0x044c WDG CNT: 0x0634 [1138.739101 Inits done] [1138.739519 already in S0] [1138.739623 event set 0x00002000] [1138.739701 hostcmd init 0x2000] [1138.739776 HC 0x117] [1138.740273 low power idllittlefw sentry_v1.9.179-82e8b89-dirty lfw-RO load --- UART init'd after reboot --- [Reset cause: reset-pin watchdog] [Image: RO, sentry_v1.9.179-82e8b89-dirty 2016-05-04 14:48:02 aaboagye@lithium.mtv.corp.google.com] AUX CNT: 0x044c WDG CNT: 0x0634 [0.001125 KB boot key 0] [0.001177 Inits done] Restarting system with PMIC. Again, you see that the watchdog timer reset the EC very quickly. But, we also see that the values of the count registers are what we programmed them to be. At this point, I was wondering if it's perhaps some hardware issue. That maybe when we kick the watchdog, it doesn't actually reload the count, or maybe the watchdog just fires early for some reason. I decided to test the watchdog timer in isolation, that is not performing a sysjump. For that test, in a tight loop, I kick the watchdog and wait to see if we reset earlier that the value I configured. Unfortunately, for a total of ~20 hours, I was unable to reproduce a failure. I tried both longer watchdog times and the exact times that we are configuring. I also tried reloading the watchdog in 2 different ways: writing to the kick register and writing to the load value register. But, that seemed to not make a difference. I'm currently running out of ideas. Maybe the clock is glitching when we sysjump to the new image and that's causing the watchdog timer to count down much faster than expected? Or perhaps, the watchdog event signal gets randomly asserted when we sysjump which causes our reset. Just wanted to update you all with my progress. All things considered, this patchset is better than the existing situation, but the problem isn't solved yet. I've attached my full logs in case you're interested.
,
May 6 2016
> There you can see that the EC reset right after init but not even 100ms > had passed by since we completed our inits. Are you fully convinced that you are sure of the timeline here. Let's imagine the following scenario: less than 100ms after init, the system 'freezes' either with interrupts disabled or in some high priority interrupt loop. then the watchdog resets the system after its normal expiration at 1.6s. The trace would be exactly the same (but when looking at it interactively you might notice the 1s difference).
,
May 6 2016
Yeah, I'm convinced of the timeline because I'm pretty sure, I've seen it occur in this manner (the quick reset). I'll do it once more just to confirm. I have seen what you described in the past though, but this failure is different.
,
May 6 2016
Actually, I take my comment in #6 back. I reproduced the failure again (it took a longer time), and there *was* a pause. So maybe Vincent's right. The AUX timer interrupt is set to the highest configurable priority. If interrupts are enabled, I would have expected the early watchdog warning IRQ handler to preempt whatever interrupt was currently running (unless it has the same priority, which as far as I can tell is only the SVC handler?).
,
May 6 2016
> The AUX timer interrupt is set to the highest configurable priority. > If interrupts are enabled, I would have expected the early watchdog warning IRQ > handler to preempt whatever interrupt was currently running (unless it has the > same priority, which as far as I can tell is only the SVC handler?). Yes unless your stack pointer is corrupted (that's why the previous watchdog code was using a stackless implementation) or your UART is in a funky state. Overall, all bugs which could trigger a fault in exception context will end up in this state.
,
May 8 2016
I tried with the three patches. The exception occured after 20 iterations (40 flashrom writes) and during the execution of lfw. I don't understand the new approach to disabling task switching so not sure why it is happening. I do think that the original approach in https://chromium-review.googlesource.com/#/c/336894/2 is worth re-considering. When I tried a similar approach, I crossed 500 iterations with my flashrom script.
,
Aug 11 2016
These devices already shipped without a real fix for it... This approach had stalled out. In production, I think we only jump from RO -> RW when booting the system from G3. So I think the only side-effect is that the device may boot 1 second slower when this situation occurs (which is pretty rare). Maybe if there's time later, we can fix it. I don't have time to dig into this right now.
,
Aug 11 2017
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
|||
►
Sign in to add a comment |
|||
Comment 1 by aaboagye@chromium.org
, Apr 22 2016