amd64-generic-paladin broken by TastVMTest and UploadTestArtifacts running at same time |
||||||||||||
Issue descriptionError Message: mount: /tmp/tmpYjduXB/dir-1: wrong fs type, bad option, bad superblock on /dev/loop2, missing codepage or helper program, or other error. Error Log: https://luci-logdog.appspot.com/logs/chromiumos/bb/chromiumos/amd64-generic-paladin/33962/+/recipes/steps/UploadTestArtifacts/0/stdout#L10228_27 GE First Instance: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932899237996629792 It's the same error for all 3 of the previous runs.
,
Oct 12
There have been a lot of recent changes to payload related code, so cc'ing a few more people.
,
Oct 12
I'm going to reinstance the builder once the current CQ runs finishes. I'm very skeptical that it will fix anything, but I don't have a lot of better ideas.
,
Oct 12
Re-instancing finished, we'll see next run if that fixed it...
,
Oct 12
From errors seem to be the stateful partition of the test image is corrupted and cannot be mounted. That's all I can see at this point.
,
Oct 12
Builder was marked to experimental, bumping down to P1, still need to find a resolution for this ASAP though.
,
Oct 12
,
Oct 12
at least for the chromite cleanups, i would have expected the paygen code to break everywhere or nowhere, not just for one bot
,
Oct 12
Same, but an external bot is a little special.
,
Oct 12
re #8: I think chromite paygen state is skipped in anything except release builders so those will not affect here. I downloaded the paladin image and I cannot mount stateful partition per #5.
,
Oct 12
+other sheriffs #10 That means that re-instancing the bot isn't going to fix it, the image is indeed corrupt?
,
Oct 12
I just tried it again and it mounted. So again I'm confused! It seems the mount parameters are fine. There might be something on the builder that is causing this! I don't really know.
,
Oct 12
The reinstancing did not help, the problem reproduced again.
,
Oct 12
I'll start poking around again. Anyone have suggestions? This seems like a very perplexing issue. Because of the re-instance it seems improbable that it's an infra issue, but it also seems improbable that it's a CLs...
,
Oct 16
The issue showed up in CQ from time to time. Also refer to https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?builderName=amd64-generic-paladin&buildNumber=33992 Thanks.
,
Oct 16
Trying to manually mount the image failed with the same error. I am very stumped on this issue, does anyone have pointers on where to go / what to look into? @josephsih what do you mean from time to time? This has been consistently failing amd64-generic-paladin since Friday, no?
,
Oct 16
> Trying to manually mount the image failed with the same error. I am very stumped on this issue, does anyone have pointers on where to go / what to look into? We need to determine if this is an infra failure or not. Copy the chromiumos_test_image.bin locally and attempt to mount it. If that fails to mount, it's not a infra issue because the build process is outputting an invalid disk image and needs to be routed to a sheriff to find a Platform owner.
,
Oct 16
FYI, I've downloaded and mounted the image, it worked perfectly fine for me! I think it might be a problem with infra for this particular builder. (I got extracted the chromiumos_test_image.bin from image.zip thought)
,
Oct 16
@ahassani how did you download the image locally? (I'm guessing there is a GCE bucket somewhere I don't know about). That's super confusing though, because I re-imaged the entire bot and it still failed the same way.
,
Oct 16
You have an SSH connection to the bot: get it from there to more directly assert that what is used at this stage is the same thing you would be testing locally.
,
Oct 16
Yep, tried that between builds. Mounting it on the bot failed with the same error. I grabbed a copy and put it into my home dir on the bot too. I'm curious how to pull it onto my workstation though, I assume these artifacts are published somewhere?
,
Oct 16
Well, looks like this works too: `gcloud compute --project "chromeos-bot" scp --zone "us-central1-b" "cros-beefy468-c2":/home/athilenius/* ~/tmp` It will mount on my workstation, but only without the `ro` flag.
,
Oct 16
Okay, so the kernel state is somehow corrupted on the bot. What's in the kernel logs? What happens if you use another loop device?
,
Oct 16
Because the bot was re imaged, I assume that means the build itself is somehow corrupting the kernel? dmesg shows several: [ 8369.251588] EXT4-fs (loop0): Couldn't mount because of unsupported optional features (10000) and trying to mount again I got it to spit out: [ 8611.196954] EXT4-fs (loop0): VFS: Can't find ext4 filesystem [ 8611.202895] EXT4-fs (loop0): VFS: Can't find ext4 filesystem [ 8611.208772] EXT4-fs (loop0): VFS: Can't find ext4 filesystem [ 8611.216099] FAT-fs (loop0): invalid media value (0xb9) None of this makes much sense though. Note that the first error might also be a warning (as it's scanning for an EXT2 EXT3 FS first. How do I use a different device to mount a loopback?
,
Oct 16
,
Oct 16
Fwiw, the amd64-generic-chromium-pfq builder is green, and I believe it should be identical to this builder, so I think there's something wrong with this specific builder. https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=amd64-generic-chromium-pfq&buildBranch=master
,
Oct 16
Just a note. If you mount the image RW, it can modify it's contents, including partition/FS metadata.
,
Oct 16
Update on this bug. lannm and I have been looking at this the last few hours, there is lots of weird stuff going on. Mounting the pristine image (that was build on cros-beefy468-c2) as RO fails, with a dmesg log ``` [123947.733312] EXT4-fs (loop1): INFO: recovery required on readonly filesystem [123947.733314] EXT4-fs (loop1): write access unavailable, cannot proceed (try mounting with noload) ``` But, mounting it RW will show that the FS was recovered in dmesg and successfully mount the image (and change it's MD5 hash). Other AMD64 builders (including the tryjob builder that ran this tryjob https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932496979626823440) seem to work just fine, as do all the other CQ builds. Also this is a failure in the mounting the FS, not the loopback block device; you can `losetup` the image (with the offset given in the logs [2311061504 bytes]) just fine. It's the actual mounting the FS from that block device that fails. There are minor variations in kernel version, and a few other things but nothing that look promising. No CLs landed anywhere around when it started failing that look at all relevant. @vapier, do you know if anything special happens with the amd64-generic builds? Does it use a different image layout, different disk creation tools... anything at all?
,
Oct 16
Note: in the above I'm talking about mounting on our workstations, we cannot get the image to mount on cros-beefy468-c2
,
Oct 17
,
Oct 17
Leading theory right now is e2progs is building the EXT4 partition with the EXT4_FEATURE_INCOMPAT_ENCRYPT flag set, but the older kernel on the amd64-generic-paladin builder is unable to handle that flag and thus fails mounting. We are going to try and upgrade the kernel, but are first verifying that we can do so without breaking things. This doesn't explain several things though: why this just started failing (or why it ever worked in the first place) and why it works on other builders with the same kernel.
,
Oct 17
New theory: TastVMTest runs in parallel with UploadTestArtifacts. If TastVMTest is mounting the image file RW, it could enable ext4 encryption before the image is uploaded. I'm going to teach TastVMTest to copy the image before using it.
,
Oct 17
And TastVMTest was enabled on amd64-generic-paladin right when we started seeing failures: crrev.com/c/1232554 Reverting.
,
Oct 17
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/451dc721b571c5dfa791c593b1f46e85cf83f7b9 commit 451dc721b571c5dfa791c593b1f46e85cf83f7b9 Author: Lann Martin <lannm@chromium.org> Date: Wed Oct 17 18:24:09 2018 chromeos_config: Disable Tast VM tests on paladins. These appear to be mutating the test image in place, causing issues in parallel stages. BUG= chromium:894820 TEST=chromeos_config_unittest; inspect config_dump.json Change-Id: I3eddc92d5051f9989d40ed2a8a0704ba1b3a4da2 Reviewed-on: https://chromium-review.googlesource.com/1286480 Commit-Ready: Lann Martin <lannm@chromium.org> Tested-by: Lann Martin <lannm@chromium.org> Reviewed-by: Alec Thilenius <athilenius@google.com> [modify] https://crrev.com/451dc721b571c5dfa791c593b1f46e85cf83f7b9/config/chromeos_config.py [modify] https://crrev.com/451dc721b571c5dfa791c593b1f46e85cf83f7b9/config/config_dump.json
,
Oct 17
Back when I added TastVMTest, I based it very heavily on what VMTest was doing at the time. I'm guessing that VMTest must've changed in the meantime -- does it copy the image now before booting it? Alternately, does VMTest not run in parallel with UploadTestArtifacts? Achuith, would your https://crrev.com/c/1285111 help here once it's in?
,
Oct 17
My CL wouldn't as it is, but cros_run_vm_test now has a copy-on-write feature that may help: https://cs.corp.google.com/chromeos_public/chromite/scripts/cros_vm.py?l=534-538
,
Oct 17
If there's any parallelism with running VM tests, you will run into this kind of corruption.
,
Oct 17
Issue 779267 from last year is related. Then, we were having trouble with the (then-newly-added) TastVMTest stage running in parallel with VMTest, with both apparently modifying the same image file. In http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932899237996629792, the first failed run mentioned here, I don't see VMTest running at all, but TastVMTest and UploadTestArtifacts (which failed) started simultaneously. In an IM conversation, Lann wrote "in a tryjob I ran, VMTest and TastVMTest both ran, with TastVMTest waiting for VMTest to complete. so the upload was complete before tast ran. it looks like VMTest uses a different image file, but I haven't looked too closely. chromiumos_qemu_image.bin". And yeah, I think (it's a bit hard to follow the code) that cbuildbot/stages/vm_test_stages.py is passing a path with constants.VM_IMAGE_BIN to cros_run_vm_test, while cbuildbot/stages/tast_test_stages.py is passing a path with constants.TEST_IMAGE_BIN to bin/cros_run_tast_vm_test. I think that https://crrev.com/c/1174925 might have been what updated VMTest to use the VM image instead of the test image. Achuith's https://crrev.com/c/1285111 is in the CQ to update TastVMTest to use cros_run_vm_test, so after that's in, I'll try switching tast_test_stages.py to use VM_IMAGE_BIN. If that doesn't work, I'll experiment with cros_vm.py's --copy-on-write flag (mentioned in #36). I'm still not really sure how to test whether this works or not, though. My tryjob runs of https://crrev.com/c/1232554 passed, and the pre-CQ submitted it without actually going through the CQ to make sure that it worked there (I believe this is a known limitation in chromite testing). After switching to VM_IMAGE_BIN, should I just turn the stage back on and hope for the best?
,
Oct 18
FYI: most recent CQ run just passed the UploadTestArtifacts stage.
,
Oct 19
No longer a P1 as this is no longer breaking the CQ.
,
Oct 19
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4ec1257ee98882223e78c2e4c71da740539ddbcb commit 4ec1257ee98882223e78c2e4c71da740539ddbcb Author: Daniel Erat <derat@chromium.org> Date: Fri Oct 19 20:19:59 2018 chromeos_config: Reenable Tast on VM paladin builders. Reenable Tast on amd64-generic-paladin and betty-paladin, reverting 451dc721. Also make cros_run_vm_test use chromiumos_qemu_image.bin rather than chromiumos_test_image.bin in an attempt to avoid having the TastVMTest stage step on UploadTestArtifacts's toes (the reason Tast was disabled on these builders). BUG= chromium:894820 TEST=running tryjobs Change-Id: I0e6a883b1142bdcb1a4b989db8336433d5b3a1a4 Reviewed-on: https://chromium-review.googlesource.com/c/1287811 Tested-by: Dan Erat <derat@chromium.org> Trybot-Ready: Dan Erat <derat@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Lann Martin <lannm@chromium.org> Commit-Queue: Dan Erat <derat@chromium.org> [modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/config/chromeos_config.py [modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/config/config_dump.json
,
Oct 20
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9b771ac49bd273f73a2b48b31bcd23e9d510891b commit 9b771ac49bd273f73a2b48b31bcd23e9d510891b Author: Dan Erat <derat@chromium.org> Date: Sat Oct 20 00:39:04 2018 Revert "chromeos_config: Reenable Tast on VM paladin builders." This reverts commit 4ec1257ee98882223e78c2e4c71da740539ddbcb. Reason for revert: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932197775125580624 failed due to missing chromiumos_qemu_image.bin file. Original change's description: > chromeos_config: Reenable Tast on VM paladin builders. > > Reenable Tast on amd64-generic-paladin and betty-paladin, > reverting 451dc721. > > Also make cros_run_vm_test use chromiumos_qemu_image.bin > rather than chromiumos_test_image.bin in an attempt to avoid > having the TastVMTest stage step on UploadTestArtifacts's > toes (the reason Tast was disabled on these builders). > > BUG= chromium:894820 > TEST=running tryjobs > > Change-Id: I0e6a883b1142bdcb1a4b989db8336433d5b3a1a4 > Reviewed-on: https://chromium-review.googlesource.com/c/1287811 > Tested-by: Dan Erat <derat@chromium.org> > Trybot-Ready: Dan Erat <derat@chromium.org> > Reviewed-by: Don Garrett <dgarrett@chromium.org> > Reviewed-by: Lann Martin <lannm@chromium.org> > Commit-Queue: Dan Erat <derat@chromium.org> Bug: chromium:894820 Change-Id: I64829beb6208db6c0f6e67ecb5f83d9d33c5dbb6 Reviewed-on: https://chromium-review.googlesource.com/c/1292677 Reviewed-by: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/config/chromeos_config.py [modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/config/config_dump.json
,
Oct 21
Issue 885016 is tracking my continuing efforts to get TastVMTest running on amd64-generic-paladin and betty-paladin.
,
Oct 22
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/ea294acc019e0354fc0fca98dee2a7bf61d52f82 commit ea294acc019e0354fc0fca98dee2a7bf61d52f82 Author: Daniel Erat <derat@chromium.org> Date: Mon Oct 22 02:25:23 2018 cbuildbot: Reland "tast_test_stages: Use cros_run_vm_test" This reverts commit 785371775dd409fec5048260877191af68feeb59 to enable using cros_run_vm_test in the TastVMTest stage. It passes the --copy-on-write flag to cros_run_vm_test, which passes it through to cros_vm. This will hopefully avoid "permission denied" errors when opening chromiumos_test_image.bin. BUG= chromium:891928 , chromium:894820 TEST=ran tryjobs Change-Id: Id2e01becd58131806410976360dfda9a73a5261d Reviewed-on: https://chromium-review.googlesource.com/c/1292680 Reviewed-by: Achuith Bhandarkar <achuith@chromium.org> Tested-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/ea294acc019e0354fc0fca98dee2a7bf61d52f82/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/ea294acc019e0354fc0fca98dee2a7bf61d52f82/cbuildbot/stages/tast_test_stages_unittest.py
,
Oct 23
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/29e179cc0d5fa03dfeec7adab0d5ceecd247834c commit 29e179cc0d5fa03dfeec7adab0d5ceecd247834c Author: Daniel Erat <derat@chromium.org> Date: Tue Oct 23 20:39:23 2018 chromeos_config: Reenable Tast on VM paladin builders again. Run the TastVMTest stage on amd64-generic-paladin and betty-paladin yet again. chromiumos_qemu_image.bin doesn't appear to exist when the stage runs, so we're still using chromium_test_image.bin, but I'm hopeful that using cros_vm's --copy-on-write flag will prevent conflicts with the UploadTestArtifacts stage. BUG= chromium:894820 TEST=tryjobs Change-Id: I49578c91c0f2f5d30eb78b5c3e17d9f45985ed6f Reviewed-on: https://chromium-review.googlesource.com/1292681 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/29e179cc0d5fa03dfeec7adab0d5ceecd247834c/config/config_dump.json [modify] https://crrev.com/29e179cc0d5fa03dfeec7adab0d5ceecd247834c/config/chromeos_config_test.py |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by athilenius@chromium.org
, Oct 12Labels: -Pri-3 Pri-0