New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 822795 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Moblab VM is failing

Project Member Reported by haddowk@chromium.org, Mar 16 2018

Issue description


Example build failure

https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1073


From the moblab part of the logs it is a provision failure

03/16 08:57:57.372 ERROR|             utils:0282| [stderr] Autotest instance created: localhost
03/16 08:57:57.373 ERROR|             utils:0282| [stderr] 03-16-2018 [06:57:57] Submitted create_suite_job rpc
03/16 08:58:02.650 ERROR|             utils:0282| [stderr] 03-16-2018 [06:58:02] Created suite job: http://localhost/afe/#tab_id=view_job&object_id=1
03/16 08:58:02.651 ERROR|             utils:0282| [stderr] @@@STEP_LINK@Link to suite@http://localhost/afe/#tab_id=view_job&object_id=1@@@
03/16 09:04:44.640 ERROR|             utils:0282| [stderr] 03-16-2018 [07:04:44] Suite job is finished.
03/16 09:04:44.642 ERROR|             utils:0282| [stderr] 03-16-2018 [07:04:44] Start collecting test results and dump them to json.
03/16 09:04:44.898 ERROR|             utils:0282| [stderr] Suite job   [ PASSED ]
03/16 09:04:44.899 ERROR|             utils:0282| [stderr] provision   [ FAILED ]
03/16 09:04:44.900 ERROR|             utils:0282| [stderr] provision     FAIL: Unhandled AutoservRunError: command execution error
03/16 09:04:44.901 ERROR|             utils:0282| [stderr] 
03/16 09:04:44.902 ERROR|             utils:0282| [stderr] Suite timings:
03/16 09:04:44.904 ERROR|             utils:0282| [stderr] Downloads started at 2018-03-16 06:57:57
03/16 09:04:44.905 ERROR|             utils:0282| [stderr] Payload downloads ended at 2018-03-16 06:58:02
03/16 09:04:44.906 ERROR|             utils:0282| [stderr] Suite started at 2018-03-16 06:58:07
03/16 09:04:44.908 ERROR|             utils:0282| [stderr] Artifact downloads ended (at latest) at 2018-03-16 06:58:07
03/16 09:04:44.909 ERROR|             utils:0282| [stderr] Testing started at 2018-03-16 06:58:17
03/16 09:04:44.910 ERROR|             utils:0282| [stderr] Testing ended at 2018-03-16 07:03:06

Digging further into the provision logs

   --omaha_url=http://192.168.231.1:8080/update/moblab-generic-vm-
    paladin/R67-10494.0.0-rc2)\";fi; /usr/bin/update_engine_client --update
    --omaha_url=http://192.168.231.1:8080/update/moblab-generic-vm-
    paladin/R67-10494.0.0-rc2"
Exit status: 1
Duration: 219.780042171

stderr:
[0316/065903:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate.
[0316/065903:INFO:update_engine_client.cc(488)] Initiating update check and install.
[0316/065903:INFO:update_engine_client.cc(517)] Waiting for update to complete.
[0316/070243:ERROR:update_engine_client.cc(232)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kPostinstallRunnerError(5)
0

Digging into the auto_update logs

et boot target to /dev/sda5: Partition 5, Slot B
SetImage
KERNEL_CONFIG: console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4096000 verity payload=PARTUUID=%U/PARTNROFF=1 hashtree=PARTUUID=%U/PARTNROFF=1 hashstart=4096000 alg=sha1 root_hexdigest=133ce6429ff71637dffbba4cceb0280e84b9ab67 salt=a663aa4ba42806ac71c28539702ade940ec14e324125359da263f54faf8355c9" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=%U add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic 
Setting up verity.
Finished after 14 seconds.
Clearing network driver boot cache: /var/lib/preload-network-drivers.
Syncing filesystems before changing boot order...
Finished after 0 seconds.
Updating Partition Table Attributes using CgptManager...
Updated kernel 4 with Successful = 0 and NumTriesLeft = 6
Checking /mnt/stateful_partition/unencrypted permission.
Permission is ok.
Unlinked file /var/lib/ureadahead/pack
ChromeosChrootPostinst complete
Syncing filesystem at end of postinst...
Command: /bin/mkdir -p /tmp/boot_mnt
Finished after 0 seconds.
Command: /bin/mount /dev/sda12 /tmp/boot_mnt
Finished after 0 seconds.
Running LegacyPostInstall
Command: cp -nR '//boot/syslinux' '/tmp/boot_mnt'
Finished after 0 seconds.
CopyFile failed to open //boot/vmlinuz
Legacy PostInstall failed.
Command: /bin/umount /dev/sda12
Finished after 0 seconds.

[0316/070243:ERROR:postinstall_runner_action.cc(299)] Postinst command failed with code: 1
[0316/070243:ERROR:postinstall_runner_action.cc(343)] Postinstall action failed.
[0316/070243:INFO:prefs.cc(122)] delta-update-failures not present in /var/lib/update_engine/prefs
[0316/070243:INFO:action_processor.cc(116)] ActionProcessor: finished PostinstallRunnerAction with code ErrorCode::kPostinstallRunnerError
[0316/070243:INFO:action_processor.cc(121)] ActionProcessor: Aborting processing due to failure.
[0316/070243:INFO:update_attempter.cc(873)] Processing Done.
[0316/070243:INFO:cpu_limiter.cc(71)] Setting cgroup cpu shares to  1024
[0316/070243:INFO:cpu_limiter.cc(79)] CPU shares = 1024
[0316/070243:ERROR:update_attempter.cc(1278)] Update failed.
[0316/070243:INFO:payload_state.cc(248)] Updating payload state for error code: 5 (ErrorCode::kPostinstallRunnerError)
 
Cc: pprabhu@chromium.org
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
I can only take a look after Tuesday, so please find an owner / mark it experimental and wait for me to be free.

Assigning to deputy as this is P0 for CQ.

Comment 3 by pwang@chromium.org, Mar 16 2018

Cc: pwang@chromium.org ihf@chromium.org
Labels: Hotlist-Deputy

Comment 5 Deleted

I took a look and haven't made what would be called "progress", but some findings anyway.

The error is that /boot/vmlinuz is missing (obviously).

https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_postinst.cc?type=cs&q=%22Legacy+PostInstall%22&l=509
https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_legacy.cc?type=cs&q=RunLegacyUBootPostInstall&l=44
https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_legacy.cc?type=cs&q=RunLegacyUBootPostInstall&l=17

I downloaded the workspace and found no /boot/vmlinuz on the DUT (or the moblab).

Just for fun, I downloaded the most recent green betty and moblabvm images to set up my own workspace.  It fails.

It fails because the moblab image.zip was missing boot_images/vmlinuz  After hard linking vmlinuz.bin to boot_image/vmlinuz I was able to create the workspace.

I don't know if this is related, but it seems suspicious.

Anyway, there's no /boot/vmlinuz on the DUT in my new workspace either, so much for that.  Perhaps we're not supposed to be taking the kBiosTypeLegacy in the postinstall program in the first place.
Notes, postint called like

/postinst / /dev/sda3
postinst call is

 /postinst /dev/sda3

That calls cros_installer

 chroot . /usr/bin/cros_installer postinst "/" "$@"

https://cs.corp.google.com/chromeos_public/src/platform2/installer/cros_installer_main.cc?q=cros_installer&sq=package:%5Echromeos_public$&dr=C&l=19

cros_installer postinst <install_dev> <mount_point> [ args ]\n
That usage string sucks

    string install_dir = argv[optind++];
    string install_dev = argv[optind++];

So install_dir is / and install_dev is /dev/sda3
I do see the slotted kernels /syslinux/vmlinuz.{A,B} that are supposed to be created by copying /boot/vmlinuz in both the failed builds and my workspace
Actually WTF, these are EFI images, not legacy BIOS, right?  They shouldn't be going through this branch then.
Actually I'm rambling, since EFI uses the same logic and I'm not sure what controls EFI vs legacy anyway.  Basically, /boot/vmlinuz is missing and I made no progress.
tl;dr I spoke with Don and this is 99% caused by https://crrev.com/c/657262
Revert? This seems like an egrigious failure that the CL author needs to deal with sadly.
Project Member

Comment 15 by bugdroid1@chromium.org, Mar 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosutils/+/4409319b9bae3a5f6ea7f68859a04c58adffad1c

commit 4409319b9bae3a5f6ea7f68859a04c58adffad1c
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sat Mar 17 06:21:24 2018

Revert "Remove bulky vmlinu[zx] from rootfs."

This reverts commit 9e730a112ca18e0c1be1c79fc1bfd7bdca8314f6.

This is suspected to have broken moblab-generic-vm-paladin

BUG= chromium:461595 ,  chromium:822795 

Change-Id: I663611d830d894a62f2f77487a20d2da1ad10992
Reviewed-on: https://chromium-review.googlesource.com/966696
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/update_bootloaders.sh
[modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/build_library/base_image_util.sh
[modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/bin/cros_make_image_bootable

Comment 16 by ihf@chromium.org, Mar 17 2018

Cc: cra...@chromium.org
Labels: -Pri-0 Pri-2
Owner: cra...@chromium.org
Yep revert in #15 fixed it: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1081

Re-purposing bug for Craig to figure out the path ahead for re-landing their CL.
Labels: -Hotlist-Deputy

Comment 19 by sjg@chromium.org, Mar 20 2018

Pathhmesh, re #14, what do you mean by egregious failure? This went through the commit queue, right?
Oh, I didn't at all mean that the CL was egregiously bad. Just that the failure was so bad that I was unlikely to be able to quickly figure out what needs to be done to make moblab-generic-vm not hate this CL (I try to do that for changes not specifically targeted at this board that end up affecting it, because I'm the "maintainer" for the board). 

In this case, I was likely to need the CL authors help, so revert first.

Sorry that it sounded like I was blaming the CL / author.
About CL passing the CQ -- the affected board was marked experimental when the CL landed due to other issues. haddowk@ has fixed all those and marked it green again.

Let me know if you want my help to figure out how / if we can migrate this board off the old removed kernel bits.
Thanks for your help, pprabhu@.  Rolling back was definitely the right thing to do.  I believe the path forward is to wrap the "remove vmlinuz" logic in a USE flag and set it for all except for a small set of boards.  Sound reasonable?

Do you have any idea how to find all broken boards before submitting?

I haven't dug into this too much, but I think #6 and related comments are onto it. The problem is that the VM flow goes via legacy bios, which needs the kernel image to be in rootfs. I remember hitting this a few years ago when working on mobbuild.

It's not really a question of the installed packages (i.e. USE flag related), but of which bios we're using (UEFI vs legacy).

Most importantly, it seems to me that moblab-generic-vm and betty VMTest should behave similarly wrt this bug. And we should fix them similarly. Likely, the fix is not a USE flag change, but updated logic in the build scripts (that you modified) to only drop the kernel from rootfs if we know that it won't be needed for boot.

If you go down that path, you may have to additionally copy the kernel back into the rootfs in image_to_vm.sh, if the VM boots via legacy even though base image boots via EFI.

I'm rambling here, so take with a *big* grain of salt.
Thanks, that's useful context.  Do you have any thoughts on how to determine if a given board has a legacy bios or uses the boot partition?

Good point about image_to_vm, that's almost certainly necessary.

I've got an updated changelist that remove /boot/vmlinuz by default but allows boards to set a USE flag to disable this feature.  It's not at all clear to me which boards were affected and should have this feature disabled.

Is there anyone out there who can lend a hand in figuring out which boards to disable this feature on?
Status: Fixed (was: Assigned)

Sign in to add a comment