New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 917053 link

Starred by 7 users

Issue metadata

Status: Verified
Owner:
Closed: Jan 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: ----



Sign in to add a comment

nocturne provision failure: /dev/kvm is missing

Project Member Reported by xixuan@chromium.org, Dec 20

Issue description

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926675800237399424 breaks 3 nocturne DUTs due to error:

12/19 15:20:37.793 INFO |            repair:0354| Verifying this condition: /dev/kvm should exist if device supports Linux VMs
12/19 15:20:37.798 DEBUG|          ssh_host:0310| Running (ssh) '[ ! -e /dev/kvm -a -f /usr/bin/vm_concierge ]' from '_verify_dependencies|_verify_list|_verify_host|verify|run|run_very_slowly'
12/19 15:20:38.209 ERROR|            repair:0359| Failed: /dev/kvm should exist if device supports Linux VMs
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 356, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 416, in verify
    raise hosts.AutoservVerifyError('/dev/kvm is missing')
AutoservVerifyError: /dev/kvm is missing

It also cause the following repair job fail:

https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1461297-repair/


 
Assign this to sheriff to investigate.

In the meanwhile, I will send these broken DUTs to lab for manually repair and mark nocturne as experimental.
repair ticket: b/121323310
Cc: moragues@chromium.org matth...@chromium.org
Cc: nsanders@chromium.org
Cc: jkardatzke@chromium.org
Cc: w...@chromium.org
Owner: w...@chromium.org
wez: Shouldn't  issue 902607  keep this from breaking the build?
Owner: jhawkins@chromium.org
jhawkins: wez and jkardatzke are OOO this week.  Can you recommend someone to work on this?

Similar to  issue 905696 
Owner: mutexlox@chromium.org
Owner: dgreid@chromium.org
 issue 902607  doesn't seem related; that's about fuchsia, not crostini (aka "Linux VMs").  issue 905696  does seem relevant, but I don't know anything about this test framework or what recent changes could have affected this
Cc: -jkardatzke@chromium.org mutexlox@chromium.org
Cc: jkardatzke@chromium.org
Was the firmware updated recently?

missing /dev/kvm has to be firmware, kernel, or kernel command line changes.
This looks like a cold reboot did not fix the problem with a missing /dev/kvm. But I have no idea why that's happening.
Did these systems ever work with /dev/kvm? Can we confirm they are running new-ish firmware?
Cc: nvaccaro@chromium.org bleung@chromium.org
+bleung, nvacarro on expected nocturne kvm support.

/v/l/m has
2018-12-19T15:22:37.765249-08:00 ERR kernel: [    0.188180] kvm: disabled by bios

FW on the system is:
Google_Nocturne.10984.67.0

Kernel:
Linux version 4.4.167-15572-g5df2dcaf967c (chrome-bot@swarm-cros-457)

Command line: cros_secure console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4710400 verity payload=PARTUUID=f875aa00-64f6-8743-b0fa-f0e84d8bee93/PARTNROFF=1 hashtree=PARTUUID=f875aa00-64f6-8743-b0fa-f0e84d8bee93/PARTNROFF=1 hashstart=4710400 alg=sha1 root_hexdigest=759bfecc5c980031f15bc853d8aab8a0d9b2786b salt=0634b15d228c5c21cfa707bfb80139b788ef3e436c65305e0bcba07f6db3246c" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=f875aa00-64f6-8743-b0fa-f0e84d8bee93 add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic snd_soc_dmic.modeswitch_delay_ms=35 
Labels: -Pri-1 Hotlist-Deputy Pri-0
Status: Assigned (was: Untriaged)
FTR, nocturne DUTs are still dying in the lab due to this ToT bug.

So, we don't have enough working nocturne DUTs for CQ / release testing.
TreeCloser ToT bugs in the product should be dealt with as P0
Components: -Infra>Client>ChromeOS>Test OS>Systems>Containers
Resetting components based on best guess from the assigned owner.
Cc: furquan@chromium.org
Components: -OS>Systems>Containers OS>Firmware>BIOS
Nocturne supports KVM; this should go to firmware team.
Components: -OS>Firmware>BIOS OS>Systems>Containers
Owner: dtor@chromium.org
There was no firmware update between failing runs and passing runs. Assign to dtor to root cause.

pprabhu@: Why is provision running test code? This causes failures but doesn't assign them to a changelist. Can you remove the kvm check from provision?


Components: Infra>Client>ChromeOS>Test>Platform
Owner: pprabhu@chromium.org
Re #20, that check was added in https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1262395/

But that CL tells me that this verifier failure should simply lead to a specific repair step being attempted (repair.coldreboot).

The problem here is that the repair action failed, because of:
12/19 15:38:09.646 ERROR|       cros_repair:0448| Unexpected failure in log collection during coldboot.
Traceback (most recent call last):
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 441, in _check_reset_success
    self._collect_logs(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 433, in _collect_logs
    crashcollect.get_crashinfo(host, None)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 492, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 232, in get_crashinfo
    os.makedirs(log_path)
  File "/usr/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/usr/local/autotest/results/hosts/chromeos6-row5-rack16-host1/1461297-repair/20181912152100/crashinfo.chromeos6-row5-rack16-host1/var'


And after that, the verifier failed *again*, i.e., /dev/kvm was not restored by the coldboot.

So a few leads here:
[1] repair.coldboot shouldn't fail because of crash collection failure (crash collection as a separate step should be blocking, but as a part of repair should according to me not even be attempted for each repair action)
[2] Why did repair.coldboot not succeed in restoring /dev/kvm
  [2.1] either: repair.coldboot failed before doing the ec reset
  [2.2] or: a coldboot wasn't enough to resotre /dev/kvm: so there is still a problem, either with nocturne (which refuses to enable VM when it should) or with the verify-repair pair added in CL:1262395.

I'll chase [1] and try to determine which of [2.1 / 2.2] is the case, then reassign this bug accordingly.
Owner: jkardatzke@chromium.org
Re #20[1]: The repair did ignore the non-critical log collection failure, but didn't log it as such. Uploaded https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1391574 to clarify logs.

So, repair passed, but /dev/kvm refused to be restored.

@jkardatzke: Why was this verify-repair pair added, and what does it mean if the cold reboot fails to restore /dev/kvm? Is this is an issue with the nocturne DUTs / nocturne OS / the verify-repair pair?

My guess is that we were on the right track initially: something in the nocturne OS image changed so that the VM isn't getting enabled, even after a cold reboot.

Also, the question raised in #20 is still valid: Why should repair care about VM being enabled. Provision / repair should only ensure a basic sane DUT -- most OS features should be tested by specific tests for those features. If all OS features start blocking repair, we'll end up with broken DUTs for a large variety of reasons and a poor signal for what exactly is wrong with the DUT.

I don't see any discussion of this on the relevant bug ( issue 892782 ), so I ask it here. Why should all ChromeOS DUTs in the lab have a working VM setup?
The answer has to be one off:

- DUT without working VM makes no sense.
- All autotest tests need the working VM.
- Tell me more.

If there is no good reason for this, the verify-repair pair should be removed in favor of a specific autotest test that belongs in suite:bvt-cq


That said, for this bug, we still need to figure out what's broke on nocturne right now, which is making this verify-repair pair fail.
Components: -OS>Systems>Containers -Infra>Client>ChromeOS>Test>Platform OS>Firmware>BIOS
Owner: nsanders@chromium.org
I tries the first failing build on my nocturne (R73-11432.0.0-rc1). /dev/kvm was present on the first build, but when I updated the firmware (chromeos-firmwareupdate --factory) from 10984.67.0 to 10984.69.0 /deev/kvm disappeared.
Owner: dtor@google.com
Hmm, I might have been too quick to blame, investigating more.
Owner: nsanders@chromium.org
OK, so my setup:

- 72.0.3626.37 (latest beta) on SSD
- 73.0.3644.0 (the fist failing build of Nocturne with KVM not found) on USB

Results:

- Latest beta always detects KVM
- If I simply reboot into 73 all is fine as well
- KVM is NOT there if I try to cold reboot into 73 (via ssh <host> ectool reboot_ec cold)
- KVM is detected just fine on 72 (both normal and cold reboot) if I update the kernel to ToT

This does not look to me like kernel issue given that newer kernels work fine on the beta build and to me that leaves firmware...


Project Member

Comment 26 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/597a31cd288576e183113d290ab591a659b4bdb8

commit 597a31cd288576e183113d290ab591a659b4bdb8
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 03 23:04:14 2019

autotest: Explicitly note a non-critical exception.

BUG= chromium:917053 

Change-Id: Ia9caf83f284a82afaf1ceb55ccb5d112b14281f4
Reviewed-on: https://chromium-review.googlesource.com/1391574
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/597a31cd288576e183113d290ab591a659b4bdb8/server/hosts/cros_repair.py

Owner: dtor@chromium.org
Can you try just changing the firmware rather than the entire image except kernel? You can do this by copying the chromeos-firmwareupdate from the older one onto the newer one and running it. 

If the old firmware on the new system always passes that's a good indication. 

Can you comment on what firmware would need to do to make /dev/kvm appear, and what the reasons might be that kernel wouldn't create it?
Owner: nsanders@chromium.org
Nick, I am OOO till mid next week, you'll have to try experimenting yourself in the mean time. As far as I know the kernel checks CPUID-related facilities to see if virtualization is enabled and if BIOS/firmware disables VT-x it will not create the KVM node. 
There weren't any changes in firmware to disable KVM, so I think it's unlikely a fw issue given comment #25 states that KVM is not disabled (re: "KVM is detected just fine on 72 (both normal and cold reboot) if I update the kernel to ToT", but maybe something else / unrelated changed in fw that somehow effected this).

I verified nocturne (still) has the flag set to enable KVM, making me further doubt that this is a fw issue :

nocturne/devicetree.cb:	register "VmxEnable" = "1"

I will look further into why the kernel logs that it is disabled by bios.
Cc: aaboagye@chromium.org
Cc: akes...@chromium.org abhishekbh@chromium.org chirantan@chromium.org dgreid@chromium.org
 Issue 920055  has been merged into this issue.
Cc: -aaboagye@chromium.org
Cc: aaboagye@chromium.org
I have a nocturne here that is running R73-11425.0.0 and /dev/kvm was present. I was running firmware 10984.66.0. I then updated to 10984.69.0 and did a cold reboot and /dev/kvm is still present.

I'll try running a newer OS image and see if /dev/kvm remains or not. See below for my logs.



☻  dssh 100.90.31.22
SSHin' to DUT @ 100.90.31.22...
localhost ~ # crossystem fwid
Google_Nocturne.10984.66.0localhost ~ # 
localhost ~ # 
localhost ~ # 
localhost ~ # cat /etc/lsb-release 
CHROMEOS_RELEASE_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_BOARD_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_ARC_VERSION=5187506
CHROMEOS_ARC_ANDROID_SDK_VERSION=28
GOOGLE_RELEASE=11425.0.0
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_BUILDER_PATH=nocturne-release/R73-11425.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=11425
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=73
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=11425.0.0 (Official Build) dev-channel nocturne test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_BOARD=nocturne
CHROMEOS_RELEASE_VERSION=11425.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
localhost ~ # ls -l /dev/kvm
crw-rw----. 1 root crosvm 10, 232 Jan  8 16:13 /dev/kvm
localhost ~ # 

localhost ~ # chromeos-firmwareupdate -m autoupdate

  Machine Model: nocturne
  Write Protect: HW=0/0 SW=1
  Last Boot Version: RO=Google_Nocturne.10984.66.0 ACT/A=Google_Nocturne.10984.66.0
  Firmware Updater:  RO=Google_Nocturne.10984.28.0    RW=Google_Nocturne.10984.69.0
  
>> Starting firmware updater.
>> Target image: bios.bin (RO:Google_Nocturne.10984.28.0, RW/A:Google_Nocturne.10984.69.0, RW/B:Google_Nocturne.10984.69.0).
INFO: Loading current system firmware...
>> Current system: /tmp/fwupdater.u2f0dx (RO:Google_Nocturne.10984.66.0, RW/A:Google_Nocturne.10984.66.0, RW/B:Google_Nocturne.10984.66.0).
>> Write protection: 0 (disabled; HW=0, SW=1).
WARNING: RO changed and no WP. Need full update.
>> FULL UPDATE: Updating whole firmware image(s), RO+RW.
INFO: Checking compatibility...
INFO: Executing: flashrom -w /tmp/fwupdater.bK6xVF -p host    
flashrom v0.9.9  : 2c7bca0 : Dec 09 2018 06:59:50 UTC on Linux 4.4.167-15564-gb6a09eb6c05a (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7aabf000.
WARNING: SPI Configuration Lockdown activated.
Erasing and writing flash chip... Verifying flash... VERIFIED.          
SUCCESS
INFO: Executing: flashrom -w /tmp/fwupdater.XAo3Eo -p ec    
flashrom v0.9.9  : 2c7bca0 : Dec 09 2018 06:59:50 UTC on Linux 4.4.167-15564-gb6a09eb6c05a (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7aabf000.
WARNING: SPI Configuration Lockdown activated.
Skip jumping to RO
Erasing and writing flash chip... Verifying flash... VERIFIED.          
SUCCESS
>> DONE: Firmware updater exits successfully.


***reboot the EC***

~
☻  dssh 100.90.31.22
SSHin' to DUT @ 100.90.31.22...
localhost ~ # crossystem fwid
Google_Nocturne.10984.69.0localhost ~ # 
localhost ~ # ls -l /dev/kvm
crw-rw----. 1 root crosvm 10, 232 Jan  9 13:51 /dev/kvm
localhost ~ # cat /etc/lsb-release 
CHROMEOS_RELEASE_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_BOARD_APPID={BD7F7139-CC18-49C1-A847-33F155CCBCA8}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_ARC_VERSION=5187506
CHROMEOS_ARC_ANDROID_SDK_VERSION=28
GOOGLE_RELEASE=11425.0.0
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_BUILDER_PATH=nocturne-release/R73-11425.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=11425
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=73
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=11425.0.0 (Official Build) dev-channel nocturne test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_BOARD=nocturne
CHROMEOS_RELEASE_VERSION=11425.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
localhost ~ # 

I tried rolling forward to R73-11552.0.0, and found that /dev/kvm was missing. This was with firmware 10984.69.0. I then tried rolling the firmware back to 10984.67.0 and /dev/kvm was still missing.

Therefore, I think it's something within the image and not necessarily the firmware.

Here's the kernel command line on this test image:

cros_secure console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4710400 verity payload=PARTUUID=f37bb06d-605c-134f-801c-b81b548140c4/PARTNROFF=1 hashtree=PARTUUID=f37bb06d-605c-134f-801c-b81b548140c4/PARTNROFF=1 hashstart=4710400 alg=sha1 root_hexdigest=35bd93152901f9a1c2573fcd94be49a4d345c415 salt=9c3eb737e5e26e66a2d72e16524727a854b4a6fc5d0f7cca9e20b52fabd7db17" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=f37bb06d-605c-134f-801c-b81b548140c4 add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic snd_soc_dmic.modeswitch_delay_ms=35 

Since nocturne is a 4.4 device it needs a kernel command line option of the form "disablevmx=off".  I don't see that in command line from comment 36.
Oh, that's a good tip chirantan@, I think I know what happened then. The poppy baseboard defines that in build_kernel_image.sh, but when intel added the change to add that new snd_soc_dmic.modeswitch_delay_ms parameter, they defined it in nocturne, and therefore we lost that?

Attached is a dmesg from my DUT as well.
dmesg.log
56.4 KB View Download
drinkcat@, yep :)

bleung@ has since reverted that here: https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1404356

So maybe the next canary which has that change will just work.
Yes, it was almost certainly caused by https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1381297/


That patch is being reverted in https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1404356 because apparently the name of the option is wrong.


So this will likely end up being fixed but for unrelated reasons.
Great investigation aaboagye.

It looks like there was a kernel command line modification landed that overrode the base board's modify_kernel_command_line function, so it is no longer told to enable VMs.

Yeah, that one drinkcat.

It lost a bunch of other args too. good thing this one had a test!

I'll copy the rest from the baseboard I guess, I don't want to disturbe atlas that shares the krabbylake parent.
So the other question is why did we mark the nocturne builder experimental instead of reverting the bad change in the first place?
I think we didn't figure out this was the bad change until now ,-(
Project Member

Comment 45 by bugdroid1@chromium.org, Jan 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/852b89749fdb2618c2cf357d1d81f9b64cb805d7

commit 852b89749fdb2618c2cf357d1d81f9b64cb805d7
Author: Benson Leung <bleung@google.com>
Date: Thu Jan 10 01:49:31 2019

Revert "nocturne: set modeswitch_delay_ms=35 in kernel commandline"

This reverts commit 27e2929e4e86ef05e8f0ac5b64cce063baa1876a.

Reason for revert: Two problems.
1. This change landed without updating the whitelist in signer
   (see chromium:920408), so this broke signing recovery images for
   3 weeks.
2. modeswitch_delay_ms doesn't seem to be the right module param name.
   The one introduced in https://patchwork.kernel.org/patch/10702955/
   is "modeswitch_delay" so I'm convinced this change does nothing.

Original change's description:
> nocturne: set modeswitch_delay_ms=35 in kernel commandline
>
> Set snd_soc_dmic.modeswitch_delay_ms=35 in kernel command line to enable
> a delay between DMIC mode change. This is to enable the new module param
> introduced in patch https://patchwork.kernel.org/patch/10702955/
>
> BUG=b:117679624
> TEST=Verify DMIC capture on nocturne after cold reboot
>
> Change-Id: Ia4d40a92c497152f63fe22de1328a6bbe6cf2646
> Signed-off-by: Jenny TC <jenny.tc@intel.com>
> Reviewed-on: https://chromium-review.googlesource.com/1381297
> Commit-Ready: Nick Vaccaro <nvaccaro@google.com>
> Tested-by: Nick Vaccaro <nvaccaro@google.com>
> Reviewed-by: Nick Vaccaro <nvaccaro@google.com>
> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>

Signed-off-by: Benson Leung <bleung@chromium.org>

Bug=b:117679624, chromium:920408 , chromium:917053 
TEST=SignerTest passes on nocturne canary (M-73).

Change-Id: I4bb6b64919992a8800fa89367166cdc5dcfa11d0
Reviewed-on: https://chromium-review.googlesource.com/c/1404356
Commit-Queue: Nicolas Boichat <drinkcat@chromium.org>
Tested-by: Nicolas Boichat <drinkcat@chromium.org>
Trybot-Ready: Nicolas Boichat <drinkcat@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-by: Bob Moragues <moragues@chromium.org>

[delete] https://crrev.com/51648c27ccfc458b5b53c9b0d1824f05b7a59180/overlay-nocturne/scripts/build_kernel_image.sh

Sign in to add a comment