nocturne provision failure: /dev/kvm is missing |
|||||||||||||||||||||||||||
Issue descriptionhttps://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926675800237399424 breaks 3 nocturne DUTs due to error: 12/19 15:20:37.793 INFO | repair:0354| Verifying this condition: /dev/kvm should exist if device supports Linux VMs 12/19 15:20:37.798 DEBUG| ssh_host:0310| Running (ssh) '[ ! -e /dev/kvm -a -f /usr/bin/vm_concierge ]' from '_verify_dependencies|_verify_list|_verify_host|verify|run|run_very_slowly' 12/19 15:20:38.209 ERROR| repair:0359| Failed: /dev/kvm should exist if device supports Linux VMs Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 356, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 416, in verify raise hosts.AutoservVerifyError('/dev/kvm is missing') AutoservVerifyError: /dev/kvm is missing It also cause the following repair job fail: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1461297-repair/
,
Dec 20
repair ticket: b/121323310
,
Dec 20
,
Dec 20
,
Dec 20
,
Dec 20
wez: Shouldn't issue 902607 keep this from breaking the build?
,
Dec 20
jhawkins: wez and jkardatzke are OOO this week. Can you recommend someone to work on this?
,
Dec 20
Similar to issue 905696
,
Dec 20
,
Dec 20
issue 902607 doesn't seem related; that's about fuchsia, not crostini (aka "Linux VMs"). issue 905696 does seem relevant, but I don't know anything about this test framework or what recent changes could have affected this
,
Dec 20
,
Dec 20
,
Dec 20
Was the firmware updated recently? missing /dev/kvm has to be firmware, kernel, or kernel command line changes.
,
Dec 20
This looks like a cold reboot did not fix the problem with a missing /dev/kvm. But I have no idea why that's happening.
,
Dec 20
Did these systems ever work with /dev/kvm? Can we confirm they are running new-ish firmware?
,
Dec 21
+bleung, nvacarro on expected nocturne kvm support. /v/l/m has 2018-12-19T15:22:37.765249-08:00 ERR kernel: [ 0.188180] kvm: disabled by bios FW on the system is: Google_Nocturne.10984.67.0 Kernel: Linux version 4.4.167-15572-g5df2dcaf967c (chrome-bot@swarm-cros-457) Command line: cros_secure console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4710400 verity payload=PARTUUID=f875aa00-64f6-8743-b0fa-f0e84d8bee93/PARTNROFF=1 hashtree=PARTUUID=f875aa00-64f6-8743-b0fa-f0e84d8bee93/PARTNROFF=1 hashstart=4710400 alg=sha1 root_hexdigest=759bfecc5c980031f15bc853d8aab8a0d9b2786b salt=0634b15d228c5c21cfa707bfb80139b788ef3e436c65305e0bcba07f6db3246c" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=f875aa00-64f6-8743-b0fa-f0e84d8bee93 add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic snd_soc_dmic.modeswitch_delay_ms=35
,
Dec 26
FTR, nocturne DUTs are still dying in the lab due to this ToT bug. So, we don't have enough working nocturne DUTs for CQ / release testing. TreeCloser ToT bugs in the product should be dealt with as P0
,
Dec 27
Resetting components based on best guess from the assigned owner.
,
Dec 27
Nocturne supports KVM; this should go to firmware team.
,
Dec 27
There was no firmware update between failing runs and passing runs. Assign to dtor to root cause. pprabhu@: Why is provision running test code? This causes failures but doesn't assign them to a changelist. Can you remove the kvm check from provision?
,
Dec 27
Re #20, that check was added in https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1262395/ But that CL tells me that this verifier failure should simply lead to a specific repair step being attempted (repair.coldreboot). The problem here is that the repair action failed, because of: 12/19 15:38:09.646 ERROR| cros_repair:0448| Unexpected failure in log collection during coldboot. Traceback (most recent call last): File "/usr/local/autotest/server/hosts/cros_repair.py", line 441, in _check_reset_success self._collect_logs(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 433, in _collect_logs crashcollect.get_crashinfo(host, None) File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 492, in wrapper return fn(*args, **kwargs) File "/usr/local/autotest/server/crashcollect.py", line 232, in get_crashinfo os.makedirs(log_path) File "/usr/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 17] File exists: '/usr/local/autotest/results/hosts/chromeos6-row5-rack16-host1/1461297-repair/20181912152100/crashinfo.chromeos6-row5-rack16-host1/var' And after that, the verifier failed *again*, i.e., /dev/kvm was not restored by the coldboot. So a few leads here: [1] repair.coldboot shouldn't fail because of crash collection failure (crash collection as a separate step should be blocking, but as a part of repair should according to me not even be attempted for each repair action) [2] Why did repair.coldboot not succeed in restoring /dev/kvm [2.1] either: repair.coldboot failed before doing the ec reset [2.2] or: a coldboot wasn't enough to resotre /dev/kvm: so there is still a problem, either with nocturne (which refuses to enable VM when it should) or with the verify-repair pair added in CL:1262395. I'll chase [1] and try to determine which of [2.1 / 2.2] is the case, then reassign this bug accordingly.
,
Dec 27
Re #20[1]: The repair did ignore the non-critical log collection failure, but didn't log it as such. Uploaded https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1391574 to clarify logs. So, repair passed, but /dev/kvm refused to be restored. @jkardatzke: Why was this verify-repair pair added, and what does it mean if the cold reboot fails to restore /dev/kvm? Is this is an issue with the nocturne DUTs / nocturne OS / the verify-repair pair? My guess is that we were on the right track initially: something in the nocturne OS image changed so that the VM isn't getting enabled, even after a cold reboot. Also, the question raised in #20 is still valid: Why should repair care about VM being enabled. Provision / repair should only ensure a basic sane DUT -- most OS features should be tested by specific tests for those features. If all OS features start blocking repair, we'll end up with broken DUTs for a large variety of reasons and a poor signal for what exactly is wrong with the DUT. I don't see any discussion of this on the relevant bug ( issue 892782 ), so I ask it here. Why should all ChromeOS DUTs in the lab have a working VM setup? The answer has to be one off: - DUT without working VM makes no sense. - All autotest tests need the working VM. - Tell me more. If there is no good reason for this, the verify-repair pair should be removed in favor of a specific autotest test that belongs in suite:bvt-cq That said, for this bug, we still need to figure out what's broke on nocturne right now, which is making this verify-repair pair fail.
,
Dec 27
I tries the first failing build on my nocturne (R73-11432.0.0-rc1). /dev/kvm was present on the first build, but when I updated the firmware (chromeos-firmwareupdate --factory) from 10984.67.0 to 10984.69.0 /deev/kvm disappeared.
,
Dec 27
Hmm, I might have been too quick to blame, investigating more.
,
Dec 27
OK, so my setup: - 72.0.3626.37 (latest beta) on SSD - 73.0.3644.0 (the fist failing build of Nocturne with KVM not found) on USB Results: - Latest beta always detects KVM - If I simply reboot into 73 all is fine as well - KVM is NOT there if I try to cold reboot into 73 (via ssh <host> ectool reboot_ec cold) - KVM is detected just fine on 72 (both normal and cold reboot) if I update the kernel to ToT This does not look to me like kernel issue given that newer kernels work fine on the beta build and to me that leaves firmware...
,
Jan 3
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/597a31cd288576e183113d290ab591a659b4bdb8 commit 597a31cd288576e183113d290ab591a659b4bdb8 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 03 23:04:14 2019 autotest: Explicitly note a non-critical exception. BUG= chromium:917053 Change-Id: Ia9caf83f284a82afaf1ceb55ccb5d112b14281f4 Reviewed-on: https://chromium-review.googlesource.com/1391574 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/597a31cd288576e183113d290ab591a659b4bdb8/server/hosts/cros_repair.py
,
Jan 4
Can you try just changing the firmware rather than the entire image except kernel? You can do this by copying the chromeos-firmwareupdate from the older one onto the newer one and running it. If the old firmware on the new system always passes that's a good indication. Can you comment on what firmware would need to do to make /dev/kvm appear, and what the reasons might be that kernel wouldn't create it?
,
Jan 4
Nick, I am OOO till mid next week, you'll have to try experimenting yourself in the mean time. As far as I know the kernel checks CPUID-related facilities to see if virtualization is enabled and if BIOS/firmware disables VT-x it will not create the KVM node.
,
Jan 7
There weren't any changes in firmware to disable KVM, so I think it's unlikely a fw issue given comment #25 states that KVM is not disabled (re: "KVM is detected just fine on 72 (both normal and cold reboot) if I update the kernel to ToT", but maybe something else / unrelated changed in fw that somehow effected this). I verified nocturne (still) has the flag set to enable KVM, making me further doubt that this is a fw issue : nocturne/devicetree.cb: register "VmxEnable" = "1"
,
Jan 7
I will look further into why the kernel logs that it is disabled by bios.
,
Jan 9
,
Jan 9
Issue 920055 has been merged into this issue.
,
Jan 9
,
Jan 9
,
Jan 9
I have a nocturne here that is running R73-11425.0.0 and /dev/kvm was present. I was running firmware 10984.66.0. I then updated to 10984.69.0 and did a cold reboot and /dev/kvm is still present.
I'll try running a newer OS image and see if /dev/kvm remains or not. See below for my logs.
☻ dssh 100.90.31.22
SSHin' to DUT @ 100.90.31.22...
localhost ~ # crossystem fwid
Google_Nocturne.10984.66.0localhost ~ #
localhost ~ #
localhost ~ #
localhost ~ # cat /etc/lsb-release
CHROMEOS_RELEASE_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_BOARD_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_ARC_VERSION=5187506
CHROMEOS_ARC_ANDROID_SDK_VERSION=28
GOOGLE_RELEASE=11425.0.0
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_BUILDER_PATH=nocturne-release/R73-11425.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=11425
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=73
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=11425.0.0 (Official Build) dev-channel nocturne test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_BOARD=nocturne
CHROMEOS_RELEASE_VERSION=11425.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
localhost ~ # ls -l /dev/kvm
crw-rw----. 1 root crosvm 10, 232 Jan 8 16:13 /dev/kvm
localhost ~ #
localhost ~ # chromeos-firmwareupdate -m autoupdate
Machine Model: nocturne
Write Protect: HW=0/0 SW=1
Last Boot Version: RO=Google_Nocturne.10984.66.0 ACT/A=Google_Nocturne.10984.66.0
Firmware Updater: RO=Google_Nocturne.10984.28.0 RW=Google_Nocturne.10984.69.0
>> Starting firmware updater.
>> Target image: bios.bin (RO:Google_Nocturne.10984.28.0, RW/A:Google_Nocturne.10984.69.0, RW/B:Google_Nocturne.10984.69.0).
INFO: Loading current system firmware...
>> Current system: /tmp/fwupdater.u2f0dx (RO:Google_Nocturne.10984.66.0, RW/A:Google_Nocturne.10984.66.0, RW/B:Google_Nocturne.10984.66.0).
>> Write protection: 0 (disabled; HW=0, SW=1).
WARNING: RO changed and no WP. Need full update.
>> FULL UPDATE: Updating whole firmware image(s), RO+RW.
INFO: Checking compatibility...
INFO: Executing: flashrom -w /tmp/fwupdater.bK6xVF -p host
flashrom v0.9.9 : 2c7bca0 : Dec 09 2018 06:59:50 UTC on Linux 4.4.167-15564-gb6a09eb6c05a (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7aabf000.
WARNING: SPI Configuration Lockdown activated.
Erasing and writing flash chip... Verifying flash... VERIFIED.
SUCCESS
INFO: Executing: flashrom -w /tmp/fwupdater.XAo3Eo -p ec
flashrom v0.9.9 : 2c7bca0 : Dec 09 2018 06:59:50 UTC on Linux 4.4.167-15564-gb6a09eb6c05a (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7aabf000.
WARNING: SPI Configuration Lockdown activated.
Skip jumping to RO
Erasing and writing flash chip... Verifying flash... VERIFIED.
SUCCESS
>> DONE: Firmware updater exits successfully.
***reboot the EC***
~
☻ dssh 100.90.31.22
SSHin' to DUT @ 100.90.31.22...
localhost ~ # crossystem fwid
Google_Nocturne.10984.69.0localhost ~ #
localhost ~ # ls -l /dev/kvm
crw-rw----. 1 root crosvm 10, 232 Jan 9 13:51 /dev/kvm
localhost ~ # cat /etc/lsb-release
CHROMEOS_RELEASE_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_BOARD_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_ARC_VERSION=5187506
CHROMEOS_ARC_ANDROID_SDK_VERSION=28
GOOGLE_RELEASE=11425.0.0
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_BUILDER_PATH=nocturne-release/R73-11425.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=11425
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=73
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=11425.0.0 (Official Build) dev-channel nocturne test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_BOARD=nocturne
CHROMEOS_RELEASE_VERSION=11425.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
localhost ~ #
,
Jan 10
I tried rolling forward to R73-11552.0.0, and found that /dev/kvm was missing. This was with firmware 10984.69.0. I then tried rolling the firmware back to 10984.67.0 and /dev/kvm was still missing. Therefore, I think it's something within the image and not necessarily the firmware. Here's the kernel command line on this test image: cros_secure console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4710400 verity payload=PARTUUID=f37bb06d-605c-134f-801c-b81b548140c4/PARTNROFF=1 hashtree=PARTUUID=f37bb06d-605c-134f-801c-b81b548140c4/PARTNROFF=1 hashstart=4710400 alg=sha1 root_hexdigest=35bd93152901f9a1c2573fcd94be49a4d345c415 salt=9c3eb737e5e26e66a2d72e16524727a854b4a6fc5d0f7cca9e20b52fabd7db17" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=f37bb06d-605c-134f-801c-b81b548140c4 add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic snd_soc_dmic.modeswitch_delay_ms=35
,
Jan 10
Since nocturne is a 4.4 device it needs a kernel command line option of the form "disablevmx=off". I don't see that in command line from comment 36.
,
Jan 10
Could it be related to this? https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1381297/
,
Jan 10
Oh, that's a good tip chirantan@, I think I know what happened then. The poppy baseboard defines that in build_kernel_image.sh, but when intel added the change to add that new snd_soc_dmic.modeswitch_delay_ms parameter, they defined it in nocturne, and therefore we lost that? Attached is a dmesg from my DUT as well.
,
Jan 10
drinkcat@, yep :) bleung@ has since reverted that here: https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1404356 So maybe the next canary which has that change will just work.
,
Jan 10
Yes, it was almost certainly caused by https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1381297/ That patch is being reverted in https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1404356 because apparently the name of the option is wrong. So this will likely end up being fixed but for unrelated reasons.
,
Jan 10
Great investigation aaboagye. It looks like there was a kernel command line modification landed that overrode the base board's modify_kernel_command_line function, so it is no longer told to enable VMs. Yeah, that one drinkcat. It lost a bunch of other args too. good thing this one had a test! I'll copy the rest from the baseboard I guess, I don't want to disturbe atlas that shares the krabbylake parent.
,
Jan 10
So the other question is why did we mark the nocturne builder experimental instead of reverting the bad change in the first place?
,
Jan 10
I think we didn't figure out this was the bad change until now ,-(
,
Jan 10
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/852b89749fdb2618c2cf357d1d81f9b64cb805d7 commit 852b89749fdb2618c2cf357d1d81f9b64cb805d7 Author: Benson Leung <bleung@google.com> Date: Thu Jan 10 01:49:31 2019 Revert "nocturne: set modeswitch_delay_ms=35 in kernel commandline" This reverts commit 27e2929e4e86ef05e8f0ac5b64cce063baa1876a. Reason for revert: Two problems. 1. This change landed without updating the whitelist in signer (see chromium:920408), so this broke signing recovery images for 3 weeks. 2. modeswitch_delay_ms doesn't seem to be the right module param name. The one introduced in https://patchwork.kernel.org/patch/10702955/ is "modeswitch_delay" so I'm convinced this change does nothing. Original change's description: > nocturne: set modeswitch_delay_ms=35 in kernel commandline > > Set snd_soc_dmic.modeswitch_delay_ms=35 in kernel command line to enable > a delay between DMIC mode change. This is to enable the new module param > introduced in patch https://patchwork.kernel.org/patch/10702955/ > > BUG=b:117679624 > TEST=Verify DMIC capture on nocturne after cold reboot > > Change-Id: Ia4d40a92c497152f63fe22de1328a6bbe6cf2646 > Signed-off-by: Jenny TC <jenny.tc@intel.com> > Reviewed-on: https://chromium-review.googlesource.com/1381297 > Commit-Ready: Nick Vaccaro <nvaccaro@google.com> > Tested-by: Nick Vaccaro <nvaccaro@google.com> > Reviewed-by: Nick Vaccaro <nvaccaro@google.com> > Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Benson Leung <bleung@chromium.org> Bug=b:117679624, chromium:920408 , chromium:917053 TEST=SignerTest passes on nocturne canary (M-73). Change-Id: I4bb6b64919992a8800fa89367166cdc5dcfa11d0 Reviewed-on: https://chromium-review.googlesource.com/c/1404356 Commit-Queue: Nicolas Boichat <drinkcat@chromium.org> Tested-by: Nicolas Boichat <drinkcat@chromium.org> Trybot-Ready: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> Reviewed-by: Bob Moragues <moragues@chromium.org> [delete] https://crrev.com/51648c27ccfc458b5b53c9b0d1824f05b7a59180/overlay-nocturne/scripts/build_kernel_image.sh
,
Jan 10
nocturne-paladin is fine now: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924726364516887296 |
|||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||
Comment 1 by xixuan@chromium.org
, Dec 20