|nvmem ec test crashes|
|Project Member Reported by firstname.lastname@example.org, Apr 25||Back to list|
Seen on https://uberchromegw.corp.google.com/i/chromeos/builders/peach_pit-release/builds/2521/steps/UnitTest/logs/stdio Looks like an intermittent failure, the test passed on retry. chromeos-ec-0.0.1-r3794: ====== Emulator output ====== chromeos-ec-0.0.1-r3794: No flash storage found. Initializing to 0xff. chromeos-ec-0.0.1-r3794: No RAM data found. Initializing to 0x00. chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: --- Emulator initialized after reboot --- chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: [Reset cause: power-on] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Console input initialized chromeos-ec-0.0.1-r3794: [0.000269 SW 0x04] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Running test_corrupt_nvmem...nvmem_find_partition:300 partiton 0 verification FAILED chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: nvmem_find_partition:300 partiton 1 verification FAILED chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: [0.009161 nvmem_find_partition: No Valid Partition found, will reinitialize!] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: [0.053319 Active Nvmem partition set to 1] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: OK chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Running test_fully_erased_nvmem...nvmem_find_partition:300 partiton 0 verification FAILED chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: nvmem_find_partition:300 partiton 1 verification FAILED chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: [0.129293 nvmem_find_partition: No Valid Partition found, will reinitialize!] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: [0.155206 Active Nvmem partition set to 1] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: OK chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Running test_configured_nvmem...[0.155226 Active Nvmem partition set to 1] chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: OK chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Running test_write_read_sequence...OK chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: Running test_write_full_multi...Stack trace of task 6 (<< test runner >>): chromeos-ec-0.0.1-r3794: #0 /lib64/libc.so.6(__open64+0x2d) [0x7fb31518ea2d] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: #1 /lib64/libc.so.6(_IO_file_open+0x8e) [0x7fb31511fdfe] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: #2 /lib64/libc.so.6(_IO_file_fopen+0xf0) [0x7fb31511ff60] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: #3 /lib64/libc.so.6(+0x69e55) [0x7fb315113e55] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: #4 build/host/nvmem/nvmem.exe(get_persistent_storage+0x81) [0x4070c3] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/chip/host/persistence.c:38 chromeos-ec-0.0.1-r3794: #5 build/host/nvmem/nvmem.exe() [0x406806] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/chip/host/flash.c:39 chromeos-ec-0.0.1-r3794: #6 build/host/nvmem/nvmem.exe(flash_physical_write+0x85) [0x406974] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/chip/host/flash.c:79 chromeos-ec-0.0.1-r3794: #7 build/host/nvmem/nvmem.exe() [0x40c78d] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/common/nvmem.c:148 chromeos-ec-0.0.1-r3794: #8 build/host/nvmem/nvmem.exe(nvmem_commit+0xf5) [0x40d045] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/common/nvmem.c:602 chromeos-ec-0.0.1-r3794: #9 build/host/nvmem/nvmem.exe() [0x412b5d] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/test/nvmem.c:134 chromeos-ec-0.0.1-r3794: #10 build/host/nvmem/nvmem.exe() [0x413246] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/test/nvmem.c:341 chromeos-ec-0.0.1-r3794: #11 build/host/nvmem/nvmem.exe(run_test+0x184) [0x413e58] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/test/nvmem.c:706 chromeos-ec-0.0.1-r3794: #12 build/host/nvmem/nvmem.exe(_run_test+0x11) [0x411a4d] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/core/host/task.c:77 chromeos-ec-0.0.1-r3794: #13 build/host/nvmem/nvmem.exe(_task_start_impl+0x79) [0x412301] chromeos-ec-0.0.1-r3794: /build/peach_pit/tmp/portage/chromeos-base/chromeos-ec-0.0.1-r3794/work/platform/ec/core/host/task.c:406 (discriminator 1) chromeos-ec-0.0.1-r3794: #14 /lib64/libpthread.so.0(+0x74b4) [0x7fb3154594b4] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: #15 /lib64/libc.so.6(clone+0x6d) [0x7fb31519c69d] chromeos-ec-0.0.1-r3794: ??:0 chromeos-ec-0.0.1-r3794: chromeos-ec-0.0.1-r3794: ============================= chromeos-ec-0.0.1-r3794: make: *** [Makefile.rules:213: run-nvmem] Error 1
Observed on daisy_spring too. In both logs, the failing subtest is nvmem_find_partition.
drinkcat, you're the last who touched src/platform/ec. Tag, you're it.
cc-ing relevant people, from the output of `git log test/nvmem*`.
nvmem_find_partition isn't a subtest. In the failure posted in the OP, the failure was in test_write_full_multi(). Do you have links to the other failures?
Found the daisy_spring instance: https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-release/builds/2400 The test failure there was in a different subtest, test_buffer_overflow().
However, in both cases, they seem to be failing on an fopen() when emulating flash on the host. Not sure why...  - https://chromium.googlesource.com/chromiumos/platform/ec/+/59bd55c41e02eab2ac01cd3adbdda1950f7fb09e/chip/host/persistence.c#38
ah, were some intermittent failures in these tests, so at one point Scott (scollyer@) removed a test from the set (I don't remember which one and can't check now) to avoid the problem. Then when adding some more tests to the set I found and fixed a bug in the harness which I thought could have been the reason for the intermittent failure. I could not reporduce the problem on my setup and then re-enabled the test. It has been a few months since, not sure if the problem just resurfaced, or has been going on unnoticed. Will have to confirm that this is the same problem though.
Issue 726371 has been merged into this issue.
Another instance occurred on daisy_spring today: http://shortn/_ykBOJK1Bty I think there's something still lurking here. As mentioned in the other bug, I think there might be something funky with the emulated flash backing store. Oh look it's already assigned to me... vbendeb@, were you able to confirm if the issue was the same in c#7?
This is the original bug I was talking about: https://b.corp.google.com/issues/35577053
Just want to remind we are still seeing this. I don't have the logs in front of me but just reminding that it's still causing build failures.
Don't want to bump priority because I'm not sure how the EC folks handle priorities. This is still happening and at least confusing sheriffs. If this kills another build, I'll strongly argue for immediately removing that test from the build.
Issue 730397 has been merged into this issue.
is there a way to reproduce a builder environment at someone's desk? The error never happens when running on the desktop - every single repo upload requires the tests to be run, and they never fail on the desktop.
If you cannot reproduce with the same build on a desktop (testing on the exact boards that are failing), then you should be able to: - spawn a release trybot - log into a release builder when a canary is not running and manually try and emerge that package/test If it's the same test that is always failing, can you add extra debugging so that when it happens you have information to narrow it down.
this test is not running on any board, it is running on the workstation. Is there a description of this process: "spawn a release trybot" - is it the same as cbuildbot? How do I login into it - is this described anywhere?
Sorry, what I mean is that the test is failing when doing builds for certain boards. Is it failing across all boards, or just some of them? If just some of them, then ensure that you're building for those exact boards. 1. cbuildbot --remote xyz-release (or cbuildbot --local xyz-release) 2. logging in depends on the type of machine (either bare metal *-m2 or GCE instances eg wimpy/standard/beefy). https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md is the documentation I know of. I'd definitely try #1 first because #2 is a bit more work to get going.
Here is another similar instance (attaching log so it isn't lost to log rotation): https://uberchromegw.corp.google.com/i/chromeos/builders/chell-paladin/builds/1704/steps/UnitTest/logs/stdio In this case, it looks like fclose() is deadlocked or hung. I'm not sure what could be causing that. The man page mentions that double fclose() on the same descriptor causes undefined behavior, but checking the code, it doesn't seem possible. I guess another process calling fclose() on our descriptor is within the realm of possibility. There are also some interesting interactions with 'dup' I found mentioned on stackoverflow, but I don't think those apply here.
Issue 755917 has been merged into this issue.
I see that we have "Test nvmem timed out after 10 seconds!" in #18, and the duplicated #19 (we lost the logs for #0). In chip/host/persistence.c, we actually try to open/close a file on the filesystem (build/host/nvmem/nvmem.exe_persist_flash). Could it be that CPU+I/O on the builder is just that busy (building other things in parallel, including other EC tests), that it can't complete the test in 10 seconds?! Looking at the latest example: https://uberchromegw.corp.google.com/i/chromeos/builders/poppy-release/builds/657/steps/UnitTest/logs/stdio chromeos-ec starts around 20:02:08, and fails a little before 20:11:31. Builder stats: https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?duration=6659s&hostname=cros-beefy386-c2&refresh=-1&utc_end=1502856659#_VG_XL6WGsQd CPU is almost at 100% during that time (but I/O is not at peak).
you mean the thread running open/close does not get to run, but the thread limiting execution time still runs?
Right, it might make more sense if this was an I/O issue (so open/close are very slow, but the timeout thread manages to run)...
I don't know how EC unit tests are executed, but wouldn't a timeout thread just sleep most of the time, without needing much CPU time, while the actual test would need a lot of CPU time?
|► Sign in to add a comment|