New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 908981 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Corrupted VBNV can cause bad behavior on Kevin

Project Member Reported by jwer...@chromium.org, Nov 27

Issue description

We got reports from multiple people that they can get into a weird state on Kevin where crossystem claims that dev_boot_usb=1, but the firmware screen Tab display shows dev_boot_usb=0. One of them sent me the attached contents of his RW_NVRAM section.

A hexdump shows that the RW_NVRAM region is corrupted in a way where we got intermittent used and empty NVRAM slots. RW_NVRAM is designed to always fill up from start to end, and the "current" slot is always the last one that is not empty (i.e. all 16 bytes 0xFF). When mosys (which powers crossystem) searches for the current slot, it loads the whole RW_NVRAM section from flash and does a linear search from the start. However, when coreboot or depthcharge try to find the slot, they only load single slots at a time and perform a binary search to find it faster and save boot time. This explains how they can have different opinions about which slot is currently active in a partition that is corrupted in this way.

First of all, we need to be able to detect and recover from this effectively. Since mosys already reads and writes the whole area every time anyway, it shouldn't be a problem to just have it scan through all slots and check if there are any more used ones behind the first empty one found. If so, it should decide which one to make "official" and erase/rewrite the whole flash. It should probably also log a UMA event so that we can track how often this happens.

The second question is how the devices could get into this state in the first place, and whether we have a bug somewhere. It's possible that this just happens due to sudden power loss while NVRAM is erased, which seems unavoidable but should be pretty rare. Kevin is the first device that used more than 4K (i.e. one page) of RW_NVRAM, and it is noticeable that all the corrupted parts in the sample file are located early in the second page.

However, it's also possible that one of our NVRAM-writers (coreboot, depthcharge and mosys... I think those are the only ones, right?) has a bug with interpreting areas larger than 4K. I also find the pattern in the sample file pretty odd... slots are used from 0x0 through 0x11B0, from 0x1200 through 0x1220 and from 0x1300 through 0x1430. 0x1200 and 0x1300 are slots that would come up pretty early during binary search, so a transient SPI read error that makes these look as if they had a bit cleared when they actually haven't could explain that pattern pretty well. It would probably be a good idea if somebody could do some testing on Kevin in different situations to see if we can reproduce either a systematic bug in finding the right slot, or a transient SPI read error. (I'd particularly look at depthcharge, since coreboot SPI read problems would probably be more visible, and we did a bunch of optimizations late in the game on Kevin to be able to increase the clock rate further... maybe we forgot to port something important to depthcharge?)

Assigning to Stefan for triage.
 
weird_nvram.bin
64.0 KB Download
Status: Assigned (was: Untriaged)
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.

Sign in to add a comment