BuildPackages failure: "Read-only file system" for "vdb_metadata.pickle.pid"
Reported by
jrbarnette@chromium.org,
Mar 21 2018
|
||||||
Issue description
This CQ slave run failed:
https://luci-milo.appspot.com/buildbot/chromeos/bob-paladin/2606
The failure was in the BuildPackages stage. The messages suggest a
problem on the builder, not a bug in any particular CL. In particular,
the complaint about "Read-only file system" which is a bit concerning...
Here are the logs that appear most relevant:
====
INFO : Adding developer packages to chromiumos_image.bin
INFO : Mounting temporary rootfs /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs-local-build-dir to /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs//build/rootfs.
Setting up symlinks for /usr/local for /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/stateful/dev_image
INFO : Image specified by /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2 mounted at /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs successfully.
!!! Invalid binary package: '/build/bob/packages/dev-only-extras.tbz2'
!!! Missing metadata key(s): CATEGORY, PF, SLOT. This binary package is not
!!! recoverable and should be deleted.
Starting fast-emerge.
Building package chromeos-dev-root on /build/bob
Calculating deps...
parallel_emerge: Unhandled exception:
Traceback (most recent call last):
File "/mnt/host/source/chromite/bin/parallel_emerge", line 169, in <module>
DoMain()
File "/mnt/host/source/chromite/bin/parallel_emerge", line 165, in DoMain
commandline.ScriptWrapperMain(FindTarget)
File "/mnt/host/source/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
ret = target(argv[1:])
File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 1863, in main
return real_main(argv)
File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 1907, in real_main
deps_tree, deps_info = deps.GenDependencyTree()
File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 477, in GenDependencyTree
self.CreateDepgraph(emerge, packages)
File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 433, in CreateDepgraph
packages, emerge.spinner)
File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 9196, in backtrack_depgraph
myaction, myfiles, spinner)
File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 9233, in _backtrack_depgraph
success, favorites = mydepgraph.select_files(myfiles)
File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 3583, in select_files
return self._select_files(args)
File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 3592, in _select_files
self._load_vdb()
File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 590, in _load_vdb
fake_vartree.sync()
File "/usr/lib64/python2.7/site-packages/_emerge/FakeVartree.py", line 213, in sync
self._sync()
File "/usr/lib64/python2.7/site-packages/_emerge/FakeVartree.py", line 272, in _sync
real_vardb.flush_cache()
File "/usr/lib64/python2.7/site-packages/portage/dbapi/vartree.py", line 598, in flush_cache
f = atomic_ofstream(self._aux_cache_filename, 'wb')
File "/usr/lib64/python2.7/site-packages/portage/util/__init__.py", line 1288, in __init__
mode=mode, **portage._native_kwargs(kargs)))
IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs/var/cache/edb/vdb_metadata.pickle.12506'
INFO : Unmounting image from /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/stateful and /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs
INFO : Unmounting temporary rootfs /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs//build/rootfs.
,
Mar 21 2018
,
Mar 21 2018
Seen twice now on two different builders; let's bump the priority.
,
Mar 21 2018
I'm not making much progress here. "INFO : Image specified by /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2 mounted at /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs successfully." -- that print comes from mount_gpt_image.sh, probably we passed in the "--safe" / FLAGS_safe flag to mount the partition as RO. Maybe we always mount the partition in RO? In that case, maybe the write to vdb_metadata is abnormal: src/third_party/portage_tool/pym/portage/dbapi/vartree.py: self._aux_cache_filename = os.path.join(self._eroot, CACHE_PATH, "vdb_metadata.pickle") def flush_cache(self): """If the current user has permission and the internal aux_get cache has been updated, save it to disk and mark it unmodified. This is called by emerge after it has loaded the full vdb for use in dependency calculations. Currently, the cache is only written if the user has superuser privileges (since that's required to obtain a lock), but all users have read access and benefit from faster metadata lookups (as long as at least part of the cache is still valid).""" if self._flush_cache_enabled and \ self._aux_cache is not None and \ secpass >= 2 and \ (len(self._aux_cache["modified"]) >= self._aux_cache_threshold or not os.path.exists(self._cache_delta_filename)): ensure_dirs(os.path.dirname(self._aux_cache_filename)) self._owners.populate() # index any unindexed contents valid_nodes = set(self.cpv_all()) for cpv in list(self._aux_cache["packages"]): if cpv not in valid_nodes: del self._aux_cache["packages"][cpv] del self._aux_cache["modified"] timestamp = time.time() self._aux_cache["timestamp"] = timestamp f = atomic_ofstream(self._aux_cache_filename, 'wb') <- fail here Maybe we rarely meet those conditions to proceed with flush_cache in this scenario.
,
Mar 21 2018
"Maybe we always mount the partition in RO?" - I don't think that's the case, it looks like we're installing packages to that partition shorty after (if we don't fail first). So the problem / difference from the successful case is that we're actually mounting the partition RO, possibly by passing "--safe" to mount_gpt_image.sh.
,
Mar 21 2018
The "Read-only file system" can happen when file system corruption happens. Testing out that theory, I found the following was logged on the build server for bob-paladin at the time of the failure: jrbarnette@cros-beefy69-c2:~$ grep loop /var/log/syslog | grep '^Mar 20 15:' | tail -10 Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.914330] EXT4-fs (loop1): mounting ext2 file system using the ext4 su bsystem Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.914722] EXT4-fs (loop1): mounted filesystem without journal. Opts: ( null) Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.931928] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null) Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.944034] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null) Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.114359] EXT4-fs error (device loop2): ext4_iget:4161: inode #2909: c omm python2 /mnt/ho: checksum invalid Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.124523] Aborting journal on device loop2-8. Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.129283] EXT4-fs (loop2): Remounting filesystem read-only Mar 20 15:40:11 cros-beefy69-c2 kernel: [ 6247.581464] EXT4-fs error (device loop2): ext4_put_super:797: Couldn't c lean up the journal Mar 20 15:42:43 cros-beefy69-c2 kernel: [ 6.118016] Calibrating delay loop (skipped) preset value.. 4600.00 Bogo MIPS (lpj=9200000) Mar 20 15:42:43 cros-beefy69-c2 kernel: [ 9.294656] loop: module loaded So... I don't know what caused it, but there was some sort of corruption error on the loop device.
,
Mar 21 2018
The same sequence of error messages shows up on the builder for lakitu-st-paladin. I note that the two build servers are notably different hardware; bob-paladin is hosted on a GCE server, whereas lakitu-st-paladin is bare metal hosted in the Chrome GOLO. The are both running variations of the 3.13.0 kernel, albeit different builds. I searched the logs, and both servers have been running their respective kernels since 3/13, which is as far back as either host's logs go. The upshot of all this is that it feels like something caused by whatever we're doing in userland, not the hardware, and not the kernel. Full logs from the lakitu-st-paladin failure: chrome-bot@build23-m2:(Linux 14.04):~$ sudo grep '^Mar 21 09:0' /var/log/syslog | grep loop Mar 21 09:03:12 build23-m2 kernel: [ 4754.774448] EXT4-fs (loop1): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:03:13 build23-m2 kernel: [ 4755.656641] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem Mar 21 09:03:13 build23-m2 kernel: [ 4755.656878] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null) Mar 21 09:03:14 build23-m2 kernel: [ 4756.162221] EXT4-fs (loop1): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:03:14 build23-m2 kernel: [ 4756.424561] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem Mar 21 09:03:14 build23-m2 kernel: [ 4756.424822] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null) Mar 21 09:03:14 build23-m2 kernel: [ 4756.431800] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:03:14 build23-m2 kernel: [ 4756.438932] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:09:12 build23-m2 kernel: [ 5113.812727] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem Mar 21 09:09:12 build23-m2 kernel: [ 5113.812967] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null) Mar 21 09:09:12 build23-m2 kernel: [ 5113.819819] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:09:12 build23-m2 kernel: [ 5113.826945] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:09:30 build23-m2 kernel: [ 5132.230232] loop1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 Mar 21 09:09:52 build23-m2 kernel: [ 5153.943910] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem Mar 21 09:09:52 build23-m2 kernel: [ 5153.944337] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null) Mar 21 09:09:52 build23-m2 kernel: [ 5153.951797] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:09:52 build23-m2 kernel: [ 5153.959304] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null) Mar 21 09:09:54 build23-m2 kernel: [ 5155.692610] EXT4-fs error (device loop2): ext4_iget:4156: inode #131670: comm python2 /mnt/ho: checksum invalid Mar 21 09:09:54 build23-m2 kernel: [ 5155.692954] Aborting journal on device loop2-8. Mar 21 09:09:54 build23-m2 kernel: [ 5155.693102] EXT4-fs (loop2): Remounting filesystem read-only Mar 21 09:09:54 build23-m2 kernel: [ 5155.739834] EXT4-fs error (device loop2): ext4_iget:4156: inode #131102: comm python2 /mnt/ho: checksum invalid Mar 21 09:09:55 build23-m2 kernel: [ 5157.160437] EXT4-fs error (device loop2): ext4_put_super:797: Couldn't clean up the journal
,
Mar 21 2018
,
Mar 21 2018
my gut reaction is that there is a problem with the build machines. userland writing files to ext4 formatted images should never be hitting checksum/write/journal errors.
,
Mar 21 2018
> my gut reaction is that there is a problem with the build machines.
I'd be first in line to buy that theory, but I've got some hang-ups
based on the data:
* It's two different servers with rather different hardware.
* The problem hasn't been seen before, yet now we see it two days in
a row.
* /var/log/syslog shows no other errors concurrent with the failures.
* The specific sequence of events logged on the two different servers
is identical.
That is, blaming the hardware requires accepting a level of coincidence
that strains my credulity.
So, I'm more inclined to blame software. However, the kernel hasn't changed,
and I assume (though I don't know) that the relevant userland code has been
stable, too. Which, I suppose, leaves us nowhere.
Is there something we can do to rule in or out the possibility of a kernel
problem?
,
Mar 23 2018
Happened again here:
https://luci-milo.appspot.com/buildbot/chromeos/kahlee-paladin/1699
,
Mar 26 2018
,
Apr 5 2018
Another instance of this, now on the pre-cq: https://ci.chromium.org/p/chromeos/builds/b8950133071756921872#
,
Apr 5 2018
,
Apr 6 2018
,
Apr 9 2018
,
Apr 12 2018
And another one: https://uberchromegw.corp.google.com/i/chromeos/builders/capri-paladin/builds/1209 IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs/var/cache/edb/vdb_metadata.pickle.20747' INFO : Unmounting image from /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/stateful and /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs INFO : Unmounting temporary rootfs /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs//build/rootfs. rmdir: failed to remove '/mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs//build': Directory not empty
,
Apr 12 2018
And again: https://uberchromegw.corp.google.com/i/chromeos/builders/peppy-paladin/builds/18387 IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/peppy/R67-10569.0.0-rc3/rootfs/var/cache/edb/vdb_metadata.pickle.29325' |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by sha...@chromium.org
, Mar 21 2018