New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 894820 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

amd64-generic-paladin broken by TastVMTest and UploadTestArtifacts running at same time

Project Member Reported by saklein@chromium.org, Oct 12

Issue description

Error Message:
mount: /tmp/tmpYjduXB/dir-1: wrong fs type, bad option, bad superblock on /dev/loop2, missing codepage or helper program, or other error.

Error Log: https://luci-logdog.appspot.com/logs/chromiumos/bb/chromiumos/amd64-generic-paladin/33962/+/recipes/steps/UploadTestArtifacts/0/stdout#L10228_27

GE First Instance: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932899237996629792

It's the same error for all 3 of the previous runs.

 
Cc: dgarr...@chromium.org saklein@chromium.org
Labels: -Pri-3 Pri-0
Looks like this is blocking CQ. Don, any insights here? Can I restart the bot and then kill the CQ run? (Hopefully shortly after a CQ run finishes)
Cc: vapier@chromium.org ahass...@chromium.org
There have been a lot of recent changes to payload related code, so cc'ing a few more people.
I'm going to reinstance the builder once the current CQ runs finishes. I'm very skeptical that it will fix anything, but I don't have a lot of better ideas.
Re-instancing finished, we'll see next run if that fixed it...
From errors seem to be the stateful partition of the test image is corrupted and cannot be mounted. That's all I can see at this point.
Builder was marked to experimental, bumping down to P1, still need to find a resolution for this ASAP though.
Labels: -Pri-0 Pri-1
at least for the chromite cleanups, i would have expected the paygen code to break everywhere or nowhere, not just for one bot
Same, but an external bot is a little special.
re #8: I think chromite paygen state is skipped in anything except release builders so those will not affect here. I downloaded the paladin image and I cannot mount stateful partition per #5.


Cc: tbroch@chromium.org yueherngl@chromium.org
+other sheriffs

#10 That means that re-instancing the bot isn't going to fix it, the image is indeed corrupt?
I just tried it again and it mounted. So again I'm confused! It seems the mount parameters are fine. There might be something on the builder that is causing this! I don't really know.
The reinstancing did not help, the problem reproduced again.
I'll start poking around again. Anyone have suggestions?

This seems like a very perplexing issue. Because of the re-instance it seems improbable that it's an infra issue, but it also seems improbable that it's a CLs...
Trying to manually mount the image failed with the same error. I am very stumped on this issue, does anyone have pointers on where to go / what to look into?

@josephsih what do you mean from time to time? This has been consistently failing amd64-generic-paladin since Friday, no?
> Trying to manually mount the image failed with the same error. I am very stumped on this issue, does anyone have pointers on where to go / what to look into?

We need to determine if this is an infra failure or not. Copy the chromiumos_test_image.bin locally and attempt to mount it. If that fails to mount, it's not a infra issue because the build process is outputting an invalid disk image and needs to be routed to a sheriff to find a Platform owner.

FYI, I've downloaded and mounted the image, it worked perfectly fine for me! I think it might be a problem with infra for this particular builder. (I got extracted the chromiumos_test_image.bin from image.zip thought)
@ahassani how did you download the image locally? (I'm guessing there is a GCE bucket somewhere I don't know about).

That's super confusing though, because I re-imaged the entire bot and it still failed the same way.
Status: Started (was: Untriaged)
You have an SSH connection to the bot: get it from there to more directly assert that what is used at this stage is the same thing you would be testing locally.
Yep, tried that between builds. Mounting it on the bot failed with the same error. I grabbed a copy and put it into my home dir on the bot too. I'm curious how to pull it onto my workstation though, I assume these artifacts are published somewhere?
Well, looks like this works too: `gcloud compute --project "chromeos-bot" scp --zone "us-central1-b" "cros-beefy468-c2":/home/athilenius/* ~/tmp`

It will mount on my workstation, but only without the `ro` flag.
Okay, so the kernel state is somehow corrupted on the bot. What's in the kernel logs? What happens if you use another loop device?
Because the bot was re imaged, I assume that means the build itself is somehow corrupting the kernel? dmesg shows several:

[ 8369.251588] EXT4-fs (loop0): Couldn't mount because of unsupported optional features (10000)

and trying to mount again I got it to spit out:
[ 8611.196954] EXT4-fs (loop0): VFS: Can't find ext4 filesystem
[ 8611.202895] EXT4-fs (loop0): VFS: Can't find ext4 filesystem
[ 8611.208772] EXT4-fs (loop0): VFS: Can't find ext4 filesystem
[ 8611.216099] FAT-fs (loop0): invalid media value (0xb9)

None of this makes much sense though. Note that the first error might also be a warning (as it's scanning for an EXT2 EXT3 FS first.

How do I use a different device to mount a loopback?
Cc: rcui@chromium.org achuith@chromium.org
Fwiw, the amd64-generic-chromium-pfq builder is green, and I believe it should be identical to this builder, so I think there's something wrong with this specific builder.

https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=amd64-generic-chromium-pfq&buildBranch=master
Just a note. If you mount the image RW, it can modify it's contents, including partition/FS metadata.
Update on this bug. lannm and I have been looking at this the last few hours, there is lots of weird stuff going on.

Mounting the pristine image (that was build on cros-beefy468-c2) as RO fails, with a dmesg log
```
[123947.733312] EXT4-fs (loop1): INFO: recovery required on readonly filesystem
[123947.733314] EXT4-fs (loop1): write access unavailable, cannot proceed (try mounting with noload)
```
But, mounting it RW will show that the FS was recovered in dmesg and successfully mount the image (and change it's MD5 hash). Other AMD64 builders (including the tryjob builder that ran this tryjob https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932496979626823440) seem to work just fine, as do all the other CQ builds.

Also this is a failure in the mounting the FS, not the loopback block device; you can `losetup` the image (with the offset given in the logs [2311061504 bytes]) just fine. It's the actual mounting the FS from that block device that fails. There are minor variations in kernel version, and a few other things but nothing that look promising. No CLs landed anywhere around when it started failing that look at all relevant.

@vapier, do you know if anything special happens with the amd64-generic builds? Does it use a different image layout, different disk creation tools... anything at all?
Note: in the above I'm talking about mounting on our workstations, we cannot get the image to mount on cros-beefy468-c2
Cc: -tbroch@chromium.org
Cc: jclinton@chromium.org zwisler@chromium.org bmgordon@chromium.org
Leading theory right now is e2progs is building the EXT4 partition with the EXT4_FEATURE_INCOMPAT_ENCRYPT flag set, but the older kernel on the amd64-generic-paladin builder is unable to handle that flag and thus fails mounting. We are going to try and upgrade the kernel, but are first verifying that we can do so without breaking things.

This doesn't explain several things though: why this just started failing (or why it ever worked in the first place) and why it works on other builders with the same kernel.
Cc: derat@chromium.org
New theory: TastVMTest runs in parallel with UploadTestArtifacts. If TastVMTest is mounting the image file RW, it could enable ext4 encryption before the image is uploaded.

I'm going to teach TastVMTest to copy the image before using it.
And TastVMTest was enabled on amd64-generic-paladin right when we started seeing failures: crrev.com/c/1232554

Reverting.
Project Member

Comment 34 by bugdroid1@chromium.org, Oct 17

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/451dc721b571c5dfa791c593b1f46e85cf83f7b9

commit 451dc721b571c5dfa791c593b1f46e85cf83f7b9
Author: Lann Martin <lannm@chromium.org>
Date: Wed Oct 17 18:24:09 2018

chromeos_config: Disable Tast VM tests on paladins.

These appear to be mutating the test image in place, causing issues in
parallel stages.

BUG= chromium:894820 
TEST=chromeos_config_unittest; inspect config_dump.json

Change-Id: I3eddc92d5051f9989d40ed2a8a0704ba1b3a4da2
Reviewed-on: https://chromium-review.googlesource.com/1286480
Commit-Ready: Lann Martin <lannm@chromium.org>
Tested-by: Lann Martin <lannm@chromium.org>
Reviewed-by: Alec Thilenius <athilenius@google.com>

[modify] https://crrev.com/451dc721b571c5dfa791c593b1f46e85cf83f7b9/config/chromeos_config.py
[modify] https://crrev.com/451dc721b571c5dfa791c593b1f46e85cf83f7b9/config/config_dump.json

Back when I added TastVMTest, I based it very heavily on what VMTest was doing at the time. I'm guessing that VMTest must've changed in the meantime -- does it copy the image now before booting it? Alternately, does VMTest not run in parallel with UploadTestArtifacts?

Achuith, would your https://crrev.com/c/1285111 help here once it's in?
My CL wouldn't as it is, but cros_run_vm_test now has a copy-on-write feature that may help:
https://cs.corp.google.com/chromeos_public/chromite/scripts/cros_vm.py?l=534-538
If there's any parallelism with running VM tests, you will run into this kind of corruption.
Cc: athilenius@chromium.org zamorzaev@chromium.org
Owner: derat@chromium.org
 Issue 779267  from last year is related. Then, we were having trouble with the (then-newly-added) TastVMTest stage running in parallel with VMTest, with both apparently modifying the same image file.

In http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932899237996629792, the first failed run mentioned here, I don't see VMTest running at all, but TastVMTest and UploadTestArtifacts (which failed) started simultaneously.

In an IM conversation, Lann wrote "in a tryjob I ran, VMTest and TastVMTest both ran, with TastVMTest waiting for VMTest to complete. so the upload was complete before tast ran. it looks like VMTest uses a different image file, but I haven't looked too closely. chromiumos_qemu_image.bin".

And yeah, I think (it's a bit hard to follow the code) that cbuildbot/stages/vm_test_stages.py is passing a path with constants.VM_IMAGE_BIN to cros_run_vm_test, while cbuildbot/stages/tast_test_stages.py is passing a path with constants.TEST_IMAGE_BIN to bin/cros_run_tast_vm_test. I think that https://crrev.com/c/1174925 might have been what updated VMTest to use the VM image instead of the test image.

Achuith's https://crrev.com/c/1285111 is in the CQ to update TastVMTest to use cros_run_vm_test, so after that's in, I'll try switching tast_test_stages.py to use VM_IMAGE_BIN. If that doesn't work, I'll experiment with cros_vm.py's --copy-on-write flag (mentioned in #36).

I'm still not really sure how to test whether this works or not, though. My tryjob runs of https://crrev.com/c/1232554 passed, and the pre-CQ submitted it without actually going through the CQ to make sure that it worked there (I believe this is a known limitation in chromite testing). After switching to VM_IMAGE_BIN, should I just turn the stage back on and hope for the best?
FYI: most recent CQ run just passed the UploadTestArtifacts stage.
Components: Tests>Tast
Labels: -Pri-1 Pri-2
Summary: amd64-generic-paladin broken by TastVMTest and UploadTestArtifacts running at same time (was: amd64-generic-paladin UploadTestArtifacts Failures)
No longer a P1 as this is no longer breaking the CQ.
Project Member

Comment 41 by bugdroid1@chromium.org, Oct 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/4ec1257ee98882223e78c2e4c71da740539ddbcb

commit 4ec1257ee98882223e78c2e4c71da740539ddbcb
Author: Daniel Erat <derat@chromium.org>
Date: Fri Oct 19 20:19:59 2018

chromeos_config: Reenable Tast on VM paladin builders.

Reenable Tast on amd64-generic-paladin and betty-paladin,
reverting 451dc721.

Also make cros_run_vm_test use chromiumos_qemu_image.bin
rather than chromiumos_test_image.bin in an attempt to avoid
having the TastVMTest stage step on UploadTestArtifacts's
toes (the reason Tast was disabled on these builders).

BUG= chromium:894820 
TEST=running tryjobs

Change-Id: I0e6a883b1142bdcb1a4b989db8336433d5b3a1a4
Reviewed-on: https://chromium-review.googlesource.com/c/1287811
Tested-by: Dan Erat <derat@chromium.org>
Trybot-Ready: Dan Erat <derat@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Lann Martin <lannm@chromium.org>
Commit-Queue: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/config/chromeos_config.py
[modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/4ec1257ee98882223e78c2e4c71da740539ddbcb/config/config_dump.json

Project Member

Comment 42 by bugdroid1@chromium.org, Oct 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9b771ac49bd273f73a2b48b31bcd23e9d510891b

commit 9b771ac49bd273f73a2b48b31bcd23e9d510891b
Author: Dan Erat <derat@chromium.org>
Date: Sat Oct 20 00:39:04 2018

Revert "chromeos_config: Reenable Tast on VM paladin builders."

This reverts commit 4ec1257ee98882223e78c2e4c71da740539ddbcb.

Reason for revert: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932197775125580624 failed due to missing chromiumos_qemu_image.bin file.

Original change's description:
> chromeos_config: Reenable Tast on VM paladin builders.
> 
> Reenable Tast on amd64-generic-paladin and betty-paladin,
> reverting 451dc721.
> 
> Also make cros_run_vm_test use chromiumos_qemu_image.bin
> rather than chromiumos_test_image.bin in an attempt to avoid
> having the TastVMTest stage step on UploadTestArtifacts's
> toes (the reason Tast was disabled on these builders).
> 
> BUG= chromium:894820 
> TEST=running tryjobs
> 
> Change-Id: I0e6a883b1142bdcb1a4b989db8336433d5b3a1a4
> Reviewed-on: https://chromium-review.googlesource.com/c/1287811
> Tested-by: Dan Erat <derat@chromium.org>
> Trybot-Ready: Dan Erat <derat@chromium.org>
> Reviewed-by: Don Garrett <dgarrett@chromium.org>
> Reviewed-by: Lann Martin <lannm@chromium.org>
> Commit-Queue: Dan Erat <derat@chromium.org>

Bug:  chromium:894820 
Change-Id: I64829beb6208db6c0f6e67ecb5f83d9d33c5dbb6
Reviewed-on: https://chromium-review.googlesource.com/c/1292677
Reviewed-by: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/config/chromeos_config.py
[modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/9b771ac49bd273f73a2b48b31bcd23e9d510891b/config/config_dump.json

Status: Fixed (was: Started)
 Issue 885016  is tracking my continuing efforts to get TastVMTest running on amd64-generic-paladin and betty-paladin.
Project Member

Comment 44 by bugdroid1@chromium.org, Oct 22

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/ea294acc019e0354fc0fca98dee2a7bf61d52f82

commit ea294acc019e0354fc0fca98dee2a7bf61d52f82
Author: Daniel Erat <derat@chromium.org>
Date: Mon Oct 22 02:25:23 2018

cbuildbot: Reland "tast_test_stages: Use cros_run_vm_test"

This reverts commit 785371775dd409fec5048260877191af68feeb59
to enable using cros_run_vm_test in the TastVMTest stage. It
passes the --copy-on-write flag to cros_run_vm_test, which
passes it through to cros_vm. This will hopefully avoid
"permission denied" errors when opening
chromiumos_test_image.bin.

BUG= chromium:891928 , chromium:894820 
TEST=ran tryjobs

Change-Id: Id2e01becd58131806410976360dfda9a73a5261d
Reviewed-on: https://chromium-review.googlesource.com/c/1292680
Reviewed-by: Achuith Bhandarkar <achuith@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/ea294acc019e0354fc0fca98dee2a7bf61d52f82/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/ea294acc019e0354fc0fca98dee2a7bf61d52f82/cbuildbot/stages/tast_test_stages_unittest.py

Project Member

Comment 45 by bugdroid1@chromium.org, Oct 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/29e179cc0d5fa03dfeec7adab0d5ceecd247834c

commit 29e179cc0d5fa03dfeec7adab0d5ceecd247834c
Author: Daniel Erat <derat@chromium.org>
Date: Tue Oct 23 20:39:23 2018

chromeos_config: Reenable Tast on VM paladin builders again.

Run the TastVMTest stage on amd64-generic-paladin and
betty-paladin yet again. chromiumos_qemu_image.bin doesn't
appear to exist when the stage runs, so we're still using
chromium_test_image.bin, but I'm hopeful that using
cros_vm's --copy-on-write flag will prevent conflicts with
the UploadTestArtifacts stage.

BUG= chromium:894820 
TEST=tryjobs

Change-Id: I49578c91c0f2f5d30eb78b5c3e17d9f45985ed6f
Reviewed-on: https://chromium-review.googlesource.com/1292681
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/29e179cc0d5fa03dfeec7adab0d5ceecd247834c/config/config_dump.json
[modify] https://crrev.com/29e179cc0d5fa03dfeec7adab0d5ceecd247834c/config/chromeos_config_test.py

Sign in to add a comment