Moblab VM is failing |
||||||||
Issue descriptionExample build failure https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1073 From the moblab part of the logs it is a provision failure 03/16 08:57:57.372 ERROR| utils:0282| [stderr] Autotest instance created: localhost 03/16 08:57:57.373 ERROR| utils:0282| [stderr] 03-16-2018 [06:57:57] Submitted create_suite_job rpc 03/16 08:58:02.650 ERROR| utils:0282| [stderr] 03-16-2018 [06:58:02] Created suite job: http://localhost/afe/#tab_id=view_job&object_id=1 03/16 08:58:02.651 ERROR| utils:0282| [stderr] @@@STEP_LINK@Link to suite@http://localhost/afe/#tab_id=view_job&object_id=1@@@ 03/16 09:04:44.640 ERROR| utils:0282| [stderr] 03-16-2018 [07:04:44] Suite job is finished. 03/16 09:04:44.642 ERROR| utils:0282| [stderr] 03-16-2018 [07:04:44] Start collecting test results and dump them to json. 03/16 09:04:44.898 ERROR| utils:0282| [stderr] Suite job [ PASSED ] 03/16 09:04:44.899 ERROR| utils:0282| [stderr] provision [ FAILED ] 03/16 09:04:44.900 ERROR| utils:0282| [stderr] provision FAIL: Unhandled AutoservRunError: command execution error 03/16 09:04:44.901 ERROR| utils:0282| [stderr] 03/16 09:04:44.902 ERROR| utils:0282| [stderr] Suite timings: 03/16 09:04:44.904 ERROR| utils:0282| [stderr] Downloads started at 2018-03-16 06:57:57 03/16 09:04:44.905 ERROR| utils:0282| [stderr] Payload downloads ended at 2018-03-16 06:58:02 03/16 09:04:44.906 ERROR| utils:0282| [stderr] Suite started at 2018-03-16 06:58:07 03/16 09:04:44.908 ERROR| utils:0282| [stderr] Artifact downloads ended (at latest) at 2018-03-16 06:58:07 03/16 09:04:44.909 ERROR| utils:0282| [stderr] Testing started at 2018-03-16 06:58:17 03/16 09:04:44.910 ERROR| utils:0282| [stderr] Testing ended at 2018-03-16 07:03:06 Digging further into the provision logs --omaha_url=http://192.168.231.1:8080/update/moblab-generic-vm- paladin/R67-10494.0.0-rc2)\";fi; /usr/bin/update_engine_client --update --omaha_url=http://192.168.231.1:8080/update/moblab-generic-vm- paladin/R67-10494.0.0-rc2" Exit status: 1 Duration: 219.780042171 stderr: [0316/065903:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate. [0316/065903:INFO:update_engine_client.cc(488)] Initiating update check and install. [0316/065903:INFO:update_engine_client.cc(517)] Waiting for update to complete. [0316/070243:ERROR:update_engine_client.cc(232)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kPostinstallRunnerError(5) 0 Digging into the auto_update logs et boot target to /dev/sda5: Partition 5, Slot B SetImage KERNEL_CONFIG: console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 4096000 verity payload=PARTUUID=%U/PARTNROFF=1 hashtree=PARTUUID=%U/PARTNROFF=1 hashstart=4096000 alg=sha1 root_hexdigest=133ce6429ff71637dffbba4cceb0280e84b9ab67 salt=a663aa4ba42806ac71c28539702ade940ec14e324125359da263f54faf8355c9" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=%U add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic Setting up verity. Finished after 14 seconds. Clearing network driver boot cache: /var/lib/preload-network-drivers. Syncing filesystems before changing boot order... Finished after 0 seconds. Updating Partition Table Attributes using CgptManager... Updated kernel 4 with Successful = 0 and NumTriesLeft = 6 Checking /mnt/stateful_partition/unencrypted permission. Permission is ok. Unlinked file /var/lib/ureadahead/pack ChromeosChrootPostinst complete Syncing filesystem at end of postinst... Command: /bin/mkdir -p /tmp/boot_mnt Finished after 0 seconds. Command: /bin/mount /dev/sda12 /tmp/boot_mnt Finished after 0 seconds. Running LegacyPostInstall Command: cp -nR '//boot/syslinux' '/tmp/boot_mnt' Finished after 0 seconds. CopyFile failed to open //boot/vmlinuz Legacy PostInstall failed. Command: /bin/umount /dev/sda12 Finished after 0 seconds. [0316/070243:ERROR:postinstall_runner_action.cc(299)] Postinst command failed with code: 1 [0316/070243:ERROR:postinstall_runner_action.cc(343)] Postinstall action failed. [0316/070243:INFO:prefs.cc(122)] delta-update-failures not present in /var/lib/update_engine/prefs [0316/070243:INFO:action_processor.cc(116)] ActionProcessor: finished PostinstallRunnerAction with code ErrorCode::kPostinstallRunnerError [0316/070243:INFO:action_processor.cc(121)] ActionProcessor: Aborting processing due to failure. [0316/070243:INFO:update_attempter.cc(873)] Processing Done. [0316/070243:INFO:cpu_limiter.cc(71)] Setting cgroup cpu shares to 1024 [0316/070243:INFO:cpu_limiter.cc(79)] CPU shares = 1024 [0316/070243:ERROR:update_attempter.cc(1278)] Update failed. [0316/070243:INFO:payload_state.cc(248)] Updating payload state for error code: 5 (ErrorCode::kPostinstallRunnerError)
,
Mar 16 2018
I can only take a look after Tuesday, so please find an owner / mark it experimental and wait for me to be free. Assigning to deputy as this is P0 for CQ.
,
Mar 16 2018
,
Mar 16 2018
,
Mar 16 2018
I took a look and haven't made what would be called "progress", but some findings anyway. The error is that /boot/vmlinuz is missing (obviously). https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_postinst.cc?type=cs&q=%22Legacy+PostInstall%22&l=509 https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_legacy.cc?type=cs&q=RunLegacyUBootPostInstall&l=44 https://cs.corp.google.com/chromeos_public/src/platform2/installer/chromeos_legacy.cc?type=cs&q=RunLegacyUBootPostInstall&l=17 I downloaded the workspace and found no /boot/vmlinuz on the DUT (or the moblab). Just for fun, I downloaded the most recent green betty and moblabvm images to set up my own workspace. It fails. It fails because the moblab image.zip was missing boot_images/vmlinuz After hard linking vmlinuz.bin to boot_image/vmlinuz I was able to create the workspace. I don't know if this is related, but it seems suspicious. Anyway, there's no /boot/vmlinuz on the DUT in my new workspace either, so much for that. Perhaps we're not supposed to be taking the kBiosTypeLegacy in the postinstall program in the first place.
,
Mar 17 2018
Notes, postint called like /postinst / /dev/sda3
,
Mar 17 2018
postinst call is /postinst /dev/sda3 That calls cros_installer chroot . /usr/bin/cros_installer postinst "/" "$@" https://cs.corp.google.com/chromeos_public/src/platform2/installer/cros_installer_main.cc?q=cros_installer&sq=package:%5Echromeos_public$&dr=C&l=19 cros_installer postinst <install_dev> <mount_point> [ args ]\n
,
Mar 17 2018
That usage string sucks
string install_dir = argv[optind++];
string install_dev = argv[optind++];
So install_dir is / and install_dev is /dev/sda3
,
Mar 17 2018
I do see the slotted kernels /syslinux/vmlinuz.{A,B} that are supposed to be created by copying /boot/vmlinuz in both the failed builds and my workspace
,
Mar 17 2018
Actually WTF, these are EFI images, not legacy BIOS, right? They shouldn't be going through this branch then.
,
Mar 17 2018
Actually I'm rambling, since EFI uses the same logic and I'm not sure what controls EFI vs legacy anyway. Basically, /boot/vmlinuz is missing and I made no progress.
,
Mar 17 2018
tl;dr I spoke with Don and this is 99% caused by https://crrev.com/c/657262
,
Mar 17 2018
Revert? This seems like an egrigious failure that the CL author needs to deal with sadly.
,
Mar 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/crosutils/+/4409319b9bae3a5f6ea7f68859a04c58adffad1c commit 4409319b9bae3a5f6ea7f68859a04c58adffad1c Author: Ilja H. Friedel <ihf@chromium.org> Date: Sat Mar 17 06:21:24 2018 Revert "Remove bulky vmlinu[zx] from rootfs." This reverts commit 9e730a112ca18e0c1be1c79fc1bfd7bdca8314f6. This is suspected to have broken moblab-generic-vm-paladin BUG= chromium:461595 , chromium:822795 Change-Id: I663611d830d894a62f2f77487a20d2da1ad10992 Reviewed-on: https://chromium-review.googlesource.com/966696 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/update_bootloaders.sh [modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/build_library/base_image_util.sh [modify] https://crrev.com/4409319b9bae3a5f6ea7f68859a04c58adffad1c/bin/cros_make_image_bootable
,
Mar 17 2018
,
Mar 17 2018
Yep revert in #15 fixed it: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1081 Re-purposing bug for Craig to figure out the path ahead for re-landing their CL.
,
Mar 17 2018
,
Mar 20 2018
Pathhmesh, re #14, what do you mean by egregious failure? This went through the commit queue, right?
,
Mar 20 2018
Oh, I didn't at all mean that the CL was egregiously bad. Just that the failure was so bad that I was unlikely to be able to quickly figure out what needs to be done to make moblab-generic-vm not hate this CL (I try to do that for changes not specifically targeted at this board that end up affecting it, because I'm the "maintainer" for the board). In this case, I was likely to need the CL authors help, so revert first. Sorry that it sounded like I was blaming the CL / author.
,
Mar 20 2018
About CL passing the CQ -- the affected board was marked experimental when the CL landed due to other issues. haddowk@ has fixed all those and marked it green again. Let me know if you want my help to figure out how / if we can migrate this board off the old removed kernel bits.
,
Mar 21 2018
Thanks for your help, pprabhu@. Rolling back was definitely the right thing to do. I believe the path forward is to wrap the "remove vmlinuz" logic in a USE flag and set it for all except for a small set of boards. Sound reasonable? Do you have any idea how to find all broken boards before submitting?
,
Mar 21 2018
I haven't dug into this too much, but I think #6 and related comments are onto it. The problem is that the VM flow goes via legacy bios, which needs the kernel image to be in rootfs. I remember hitting this a few years ago when working on mobbuild. It's not really a question of the installed packages (i.e. USE flag related), but of which bios we're using (UEFI vs legacy). Most importantly, it seems to me that moblab-generic-vm and betty VMTest should behave similarly wrt this bug. And we should fix them similarly. Likely, the fix is not a USE flag change, but updated logic in the build scripts (that you modified) to only drop the kernel from rootfs if we know that it won't be needed for boot. If you go down that path, you may have to additionally copy the kernel back into the rootfs in image_to_vm.sh, if the VM boots via legacy even though base image boots via EFI. I'm rambling here, so take with a *big* grain of salt.
,
Mar 21 2018
Thanks, that's useful context. Do you have any thoughts on how to determine if a given board has a legacy bios or uses the boot partition? Good point about image_to_vm, that's almost certainly necessary.
,
Mar 26 2018
I've got an updated changelist that remove /boot/vmlinuz by default but allows boards to set a USE flag to disable this feature. It's not at all clear to me which boards were affected and should have this feature disabled. Is there anyone out there who can lend a hand in figuring out which boards to disable this feature on?
,
Mar 30 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by haddowk@chromium.org
, Mar 16 2018