mosys/flashrom are super unhappy w/ erased SPI flash |
|||
Issue description
If you happen to manage to boot up a system and your SPI flash exists but is erased then flashrom churns and churns and churns. It also blocks bootup in a bunch of cases.
---
Specifically my flash looks like this:
# flashrom -r /tmp/foo.bin
flashrom v0.9.9 : 622128c : Jul 05 2018 14:19:12 UTC on Linux 4.14.55 (aarch64)
Reading flash... SUCCESS
# hexdump -C /tmp/foo.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
01000000
In my particular case I'm working on a board that can boot the BIOS from a source other than SPI flash which is why I'm able to boot but my SPI flash is empty.
---
If I look at "top" on my system I almost always see "flashrom" pegged at 100%. Running "ps auxf", I see things like this:
/bin/sh -ue /usr/bin/dev_debug_vboot --cleanup
\_ crossystem --all
\_ /usr/sbin/mosys nvram vboot read
\_ flashrom -p host -i RW_NVRAM:/tmp/flashrom_WQ2qL2 -r
---
There appear to be at least two problems here:
1. It sure seems like flashrom shouldn't be invoked quite so many times.
2. It appears that using the "-i RW_NVRAM" in the case of fully erased flash is much worse than just reading flash. It eats up 10x the CPU time and takes 20x as long. Specifically:
--
2a) During the following time CPU usage for flashrom hovers at ~8% on my system.
# time flashrom -r /tmp/foo.bin
flashrom v0.9.9 : 622128c : Jul 05 2018 14:19:12 UTC on Linux 4.14.55 (aarch64)
Reading flash... SUCCESS
real 0m16.914s
user 0m0.036s
sys 0m1.535s
2b) During the following time CPU usage for flashrom hovers at ~80% on my system.
# time flashrom -p host -i RW_NVRAM:/tmp/flashrom_WQ2qL2 -r
flashrom v0.9.9 : 622128c : Jul 05 2018 14:19:12 UTC on Linux 4.14.55 (aarch64)
No layout info is available.
FAILED
real 6m36.582s
user 0m5.736s
sys 4m55.899s
,
Jul 20
yes, when layout option is given and no map is present in the flash, flashrom ends up doing binary search for the fmap pattern in the flash, reading it again and again and again, reducing granularity to 1 byte. I should at least read it once and then search in the file instead, but it needs to decide when to give up searching in the chip.
,
Jul 20
@2: It seems like you could just keep the binary search algorithm but: - Always read a block at a time - Always cache the results That would at least not make it take 6 minutes to run. I wonder if you have any ideas of the other part of this bug: making it so we don't call flashrom over and over again in this case?
,
Jul 20
Truth be told, system running with the bios flash empty has never happened before, is this something we need to account for now? Also, I am not quite following: where do we call flashrom all over again? You mean crossystem --all results in multiple invocations of flashrom?
,
Jul 20
> is this something we need to account for now It's not a terribly critical bug, but it was pretty annoying to track down and the fix would be easy. If nobody else wants to do it I might just spend the hour to code it up so I don't have to deal with this again. > where do we call flashrom all over again? Right, it seems like crossystem -all keep reading flashrom over again over again. I guess maybe it's not a big deal if flashrom is fast, but when it's slow like this it causes problems... I guess the question is: where is the binary search algorithm useful? It seems like any time we do a significant amount of binary searching flashrom will be slow and thus crossystem will also be slow. Could this be something affecting real users?
,
Jul 20
the binary search is slow because the assumption is that fmap can be placed anywhere in the flash. We (chrome os) pretty much always align it at a few KB boundary (citation needed!), so we can just either give up if fmap is not found at say 4K granularity, or read the entire flashrom at that point. We could generate the layout map on first flashrom invocation and then have the rest use the generated layout instead of looking for it again and again. Maybe the layout file name could be put in an environment variable and then flashrom would use that file if the variable is set. I have been trying to improve flashrom for CCD use, I'll look into these enhancements too.
,
Jul 20
> We (chrome os) pretty much always align it at a few KB boundary (citation needed!) Maybe we could make a parameter for flashrom to disable the binary search?
,
Jul 26
> We could generate the layout map on first flashrom invocation and then have the rest use the generated layout instead of looking for it again and again. What would you store if you have *no* FMAP (per $subject)? Also, you'd need some validation/expiration policy, if you want to support a change in FMAP (e.g., eventually we program a proper FMAP to the flash). BTW, the funny thing is that flashrom already has a sort of caching methodology, but it only works when an FMAP is actually present. We check the output of 'crossystem fmap_base' to see if there's a *suggested* offset, and we check that first. If the FMAP is there, we're done; if not, only *then* do we fall back to the expensive binary search. 'crossystem fmap_base' gets its info from /proc/device-tree/firmware/chromeos/fmap-offset or similar (which is stashed by the BIOS). > Maybe we could make a parameter for flashrom to disable the binary search? And do what instead? Just look at 4KB strides? That probably would work. It's probably worth checking if that is fast enough for you.
,
Jul 26
> What would you store if you have *no* FMAP (per $subject)? an invocation requesting to use a certain section should fail in this case (and is probably failing today, I have not checked). > Maybe we could make a parameter for flashrom to disable the binary search? or maybe make a parameter to specify the finest granularity for this search. Also, note that I made a flashrom change recently where the fmap is looked up in the supplied binary image first, and only if not found falls back to looking in the resident image. This helps a lot in case of write operations when the supplied image is a proper chrome os bios blob.
,
Jul 26
> an invocation requesting to use a certain section should fail in this case (and is probably failing today, I have not checked). Yes...but only after failing to find an FMAP. The problem is that we want to optimize the currently-glacial process used to find that there is no FMAP. I'm just saying that I don't think your suggestion (as I understand it) would really improve things. > or maybe make a parameter to specify the finest granularity for this search. Sure. > Also, note that I made a flashrom change recently where the fmap is looked up in the supplied binary image first, and only if not found falls back to looking in the resident image. Doesn't really affect the original report, which is just about reading regions via mosys/crossystem. Those wouldn't provide a binary image to flash.
,
Sep 28
Is there anything we could do abou this. This looks like a popular reason to cause chrome to fail liveness ping and get killed (issue 883029). http://crash/26e3770dd0a3d093 And "top" output when session_manager kills chrome: ==== top - 18:30:19 up 0 min, 0 users, load average: 5.54, 1.39, 0.46 Tasks: 139 total, 6 running, 132 sleeping, 0 stopped, 1 zombie %Cpu0 : 64.9 us, 32.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 2.7 si, 0.0 st %Cpu1 : 36.1 us, 63.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 1877.1 total, 1069.4 free, 363.9 used, 443.8 buff/cache MiB Swap: 2749.7 total, 2749.7 free, 0.0 used. 1155.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1334 root 20 0 13224 1232 856 R 82.6 0.1 0:01.40 flashrom 895 chronos 20 0 594008 92596 24132 R 52.2 4.8 0:00.74 chrome 658 chronos 12 -8 848404 232488 86908 R 39.1 12.1 0:01.65 chrome 877 chronos 12 -8 318912 37712 15112 R 21.7 2.0 0:00.27 chrome 24 root 20 0 0 0 0 S 13.0 0.0 0:00.03 kworker/1:1 10 root 20 0 0 0 0 R 8.7 0.0 0:00.05 rcu_sched 353 syslog 20 0 297212 2480 1576 S 8.7 0.1 0:00.05 rsyslogd 817 root 20 0 0 0 0 S 4.3 0.0 0:00.12 kworker/0:2 1457 root 20 0 11400 1228 936 S 4.3 0.1 0:00.02 mosys 1 root 20 0 12024 2504 1340 S 0.0 0.1 0:00.29 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u:0 7 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/u:0H 8 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
,
Sep 28
without looking into much details, when flashrom is invoked to read a section from the chip, it tries to read the fmap from the chip to find the section boundary. If the chip is erased, looking for fmap takes very long time, because it is done as a binary search ending up reading the entire flash. is it known what exact flashrom invocation this is caused by?
,
Sep 28
How to find that out? Would full command line to flashrom help?
,
Sep 28
yes, this is what I meant to ask: full flash command line.
,
Sep 28
I could not repro it locally. The "top" output is collected when the crash happens in the field. I will update the collection code. Last time, it took about 1 week to get the data. I will ping back once I get something.
,
Nov 29
Got data back. The flashrom command line looks like the following: flashrom -p host -i FMAP -i RW_VPD:/tmp/vpd.flashrom.7eUpZE -r /tmp/vpd.flashrom.kyK2kO in http://crash/45ba8a29fc21f977. Trimmed top output: ==== op - 10:59:15 up 0 min, 0 users, load average: 5.71, 1.42, 0.48 Tasks: 145 total, 3 running, 141 sleeping, 0 stopped, 1 zombie %Cpu(s): 68.5 us, 31.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 1877.1 total, 1024.4 free, 298.4 used, 554.4 buff/cache MiB Swap: 2749.7 total, 2749.7 free, 0.0 used. 1142.0 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1358 root 20 0 21480 3440 932 R 44.0 0.2 0:03.15 flashrom -p host -i FMAP -i RW_VPD:/tmp/vpd.flashrom.7eUpZE -r /tmp/vpd.flashrom.kyK2kO 677 chronos 12 -8 753840 138540 65704 S 32.0 7.2 0:01.54 /opt/google/chrome/chrome --ppapi-flash-path=/opt/google/chrome/pepper/libpepflashplayer.so --ppapi-flash-version=31.0.0.148 --ui-prioritize-in-gpu-process --use-gl=egl --enable-native-gpu-memory-buffers --gpu-sandbox-failures-fatal=yes --enable-logging --log-level=1 --use-cras --enable-wayland-server --user-data-dir=/home/chronos --enable-features=MachineLearningService --login-profile=user --has-chromeos-keyboard --aura-legacy-power-bu+ 914 chronos 20 0 587096 88120 21644 R 24.0 4.6 0:00.51 /opt/google/chrome/chrome --type=renderer --enable-logging --log-level=1 --use-gl=egl --vmodule=nss_cert_database_chromeos=1,*/assistant/*=1,*chromeos/login/*=1,auto_enrollment_controller=1,*/ui/ozone/*=1,*/ui/display/manager/chromeos/*=1,*night_light*=1,update_engine=1,component_updater_service=1,power_button_observer=2,webui_login_view=2,lock_state_controller=2,webui_screen_locker=2,screen_locker=2 --field-trial-handle=1766717298446023+ 954 chronos 20 0 595264 77936 19020 S 24.0 4.1 0:00.50 /opt/google/chrome/chrome --type=renderer --enable-logging --log-level=1 --use-gl=egl --vmodule=nss_cert_database_chromeos=1,*/assistant/*=1,*chromeos/login/*=1,auto_enrollment_controller=1,*/ui/ozone/*=1,*/ui/display/manager/chromeos/*=1,*night_light*=1,update_engine=1,component_updater_service=1,power_button_observer=2,webui_login_view=2,lock_state_controller=2,webui_screen_locker=2,screen_locker=2 --field-trial-handle=1766717298446023+ 10 root 20 0 0 0 0 S 4.0 0.0 0:00.04 [rcu_sched] 15 root 20 0 0 0 0 S 4.0 0.0 0:00.21 [kworker/1:0] 61 root 0 -20 0 0 0 S 4.0 0.0 0:00.06 [kworker/0:1H] 884 chronos 12 -8 373480 78956 58816 S 4.0 4.1 0:00.51 /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=17667172984460237150,14811314750318420328,131072 --enable-features=MachineLearningService --gpu-sandbox-failures-fatal=yes --enable-logging --log-level=1 --vmodule=nss_cert_database_chromeos=1,*/assistant/*=1,*chromeos/login/*=1,auto_enrollment_controller=1,*/ui/ozone/*=1,*/ui/display/manager/chromeos/*=1,*night_light*=1,update_engine=1,component_updater_service=1,power_bu+ 1580 root 20 0 9544 1404 980 R 4.0 0.1 0:00.01 top -b -c -n1 -w512 1 root 20 0 12208 2628 1372 S 0.0 0.1 0:00.30 /sbin/init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kthreadd] 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [ksoftirqd/0] 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:0] 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:0H]
,
Nov 29
Seems like the command line is constructed in flashromPartialRead with partition_name "RW_VPD" ? https://cs.corp.google.com/chromeos_public/src/platform/vpd/lib/flashrom.c?rcl=4649d42d10a7fa32f18da7e77e019891d1d7d940&l=63
,
Nov 29
Presumably a firmware image could be found on the target in the chromeos-firmwareupdate self extracting tarball, dump_fmap could be used to extract fmap from it, it can be converted into a layout file which flashrom can use. In this case it would not try retrieving layout from the currently installed flash image. We could also have the layout generated at build time and added to the chrome os image.
,
Nov 29
Thanks for the quick response. Is there anything we could do about it? Or we need more info the identify where the problem is? This has been around a while and now contributing to the current top #1 crash of beta release.
,
Nov 29
Pardon my ignorance, but how come there are many devices with their onboard SPI flash not programmed? Otherwise this problem would not show up, right?
,
Nov 29
Sorry, I knew almost nothing about firmware and flash update so not sure why we start to hit this problem. I was looking at the spike of SIGABRT crash of chrome and found part of the crash is due to chrome fails to respond to the liveness ping. However, the chrome stack in the crash reports shows that chrome is idle at the time of the kill. After I added "top" dump when we kill chrome, I found flashrom is pretty common when it happens. I am here in the hope that people cc'ed on the bug could shed some light on the situation.
,
Nov 29
re #21 but the problem this bug started with was specific for chrome os devices where the firmware SPI flash is empty, which is extremely unlikely to happen anywhere but in development environment. maybe the crashes you are looking ag are happening for a different reason.
,
Nov 30
Re #22: Fair enough. Filed issue 910411 and cc'ed you. Please help to cc relevant people and even better to find an owner for the new issue. And let me know what I should look at to find more clues for you guys. Thanks.
,
Nov 30
|
|||
►
Sign in to add a comment |
|||
Comment 1 by sjg@chromium.org
, Jul 20