New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 823959 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

BuildPackages failure: "Read-only file system" for "vdb_metadata.pickle.pid"

Reported by jrbarnette@chromium.org, Mar 21 2018

Issue description

This CQ slave run failed:
    https://luci-milo.appspot.com/buildbot/chromeos/bob-paladin/2606

The failure was in the BuildPackages stage.  The messages suggest a
problem on the builder, not a bug in any particular CL.  In particular,
the complaint about "Read-only file system" which is a bit concerning...

Here are the logs that appear most relevant:
====
INFO    : Adding developer packages to chromiumos_image.bin
INFO    : Mounting temporary rootfs /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs-local-build-dir to /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs//build/rootfs.
Setting up symlinks for /usr/local for /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/stateful/dev_image
INFO    : Image specified by /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2 mounted at /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs successfully.
!!! Invalid binary package: '/build/bob/packages/dev-only-extras.tbz2'
!!! Missing metadata key(s): CATEGORY, PF, SLOT. This binary package is not
!!! recoverable and should be deleted.
Starting fast-emerge.
 Building package chromeos-dev-root on /build/bob
Calculating deps...
parallel_emerge: Unhandled exception:
Traceback (most recent call last):
  File "/mnt/host/source/chromite/bin/parallel_emerge", line 169, in <module>
    DoMain()
  File "/mnt/host/source/chromite/bin/parallel_emerge", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/mnt/host/source/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 1863, in main
    return real_main(argv)
  File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 1907, in real_main
    deps_tree, deps_info = deps.GenDependencyTree()
  File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 477, in GenDependencyTree
    self.CreateDepgraph(emerge, packages)
  File "/mnt/host/source/chromite/scripts/parallel_emerge.py", line 433, in CreateDepgraph
    packages, emerge.spinner)
  File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 9196, in backtrack_depgraph
    myaction, myfiles, spinner)
  File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 9233, in _backtrack_depgraph
    success, favorites = mydepgraph.select_files(myfiles)
  File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 3583, in select_files
    return self._select_files(args)
  File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 3592, in _select_files
    self._load_vdb()
  File "/usr/lib64/python2.7/site-packages/_emerge/depgraph.py", line 590, in _load_vdb
    fake_vartree.sync()
  File "/usr/lib64/python2.7/site-packages/_emerge/FakeVartree.py", line 213, in sync
    self._sync()
  File "/usr/lib64/python2.7/site-packages/_emerge/FakeVartree.py", line 272, in _sync
    real_vardb.flush_cache()
  File "/usr/lib64/python2.7/site-packages/portage/dbapi/vartree.py", line 598, in flush_cache
    f = atomic_ofstream(self._aux_cache_filename, 'wb')
  File "/usr/lib64/python2.7/site-packages/portage/util/__init__.py", line 1288, in __init__
    mode=mode, **portage._native_kwargs(kargs)))
IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs/var/cache/edb/vdb_metadata.pickle.12506'
INFO    : Unmounting image from /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/stateful and /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs
INFO    : Unmounting temporary rootfs /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs//build/rootfs.
 

Comment 1 by sha...@chromium.org, Mar 21 2018

Cc: eugenegonzalez@chromium.org sha...@chromium.org
Here's another example from a different board / builder:

https://luci-milo.appspot.com/buildbot/chromeos/lakitu-st-paladin/2614

IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/lakitu-st/R67-10506.0.0-rc3/rootfs/var/cache/edb/vdb_metadata.pickle.2405'

I'll look into what this "vdb_metadata.pickle.pid" file is, and see if anything related changed recently.
Cc: -sha...@chromium.org
Owner: sha...@chromium.org
Status: Assigned (was: Untriaged)
Labels: -Pri-3 Pri-1
Summary: BuildPackages failure: "Read-only file system" for "vdb_metadata.pickle.pid" (was: BuildPackages failure in bob-paladin)
Seen twice now on two different builders; let's bump the priority.

Comment 4 by sha...@chromium.org, Mar 21 2018

I'm not making much progress here.

"INFO    : Image specified by /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2 mounted at /mnt/host/source/src/build/images/bob/R67-10505.0.0-rc2/rootfs successfully." --  that print comes from mount_gpt_image.sh, probably we passed in the "--safe" / FLAGS_safe flag to mount the partition as RO. Maybe we always mount the partition in RO? In that case, maybe the write to vdb_metadata is abnormal:

src/third_party/portage_tool/pym/portage/dbapi/vartree.py:

		self._aux_cache_filename = os.path.join(self._eroot,
			CACHE_PATH, "vdb_metadata.pickle")


	def flush_cache(self):
		"""If the current user has permission and the internal aux_get cache has
		been updated, save it to disk and mark it unmodified.  This is called
		by emerge after it has loaded the full vdb for use in dependency
		calculations.  Currently, the cache is only written if the user has
		superuser privileges (since that's required to obtain a lock), but all
		users have read access and benefit from faster metadata lookups (as
		long as at least part of the cache is still valid)."""
		if self._flush_cache_enabled and \
			self._aux_cache is not None and \
			secpass >= 2 and \
			(len(self._aux_cache["modified"]) >= self._aux_cache_threshold or
			not os.path.exists(self._cache_delta_filename)):

			ensure_dirs(os.path.dirname(self._aux_cache_filename))

			self._owners.populate() # index any unindexed contents
			valid_nodes = set(self.cpv_all())
			for cpv in list(self._aux_cache["packages"]):
				if cpv not in valid_nodes:
					del self._aux_cache["packages"][cpv]
			del self._aux_cache["modified"]
			timestamp = time.time()
			self._aux_cache["timestamp"] = timestamp

			f = atomic_ofstream(self._aux_cache_filename, 'wb') <- fail here


Maybe we rarely meet those conditions to proceed with flush_cache in this scenario.

Comment 5 by sha...@chromium.org, Mar 21 2018

"Maybe we always mount the partition in RO?" - I don't think that's the case, it looks like we're installing packages to that partition shorty after (if we don't fail first). So the problem / difference from the successful case is that we're actually mounting the partition RO, possibly by passing "--safe" to mount_gpt_image.sh.
The "Read-only file system" can happen when file system corruption
happens.  Testing out that theory, I found the following was logged
on the build server for bob-paladin at the time of the failure:

jrbarnette@cros-beefy69-c2:~$ grep loop /var/log/syslog | grep '^Mar 20 15:' | tail -10
Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.914330] EXT4-fs (loop1): mounting ext2 file system using the ext4 su
bsystem
Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.914722] EXT4-fs (loop1): mounted filesystem without journal. Opts: (
null)
Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.931928] EXT4-fs (loop2): mounted filesystem with ordered data mode. 
Opts: (null)
Mar 20 15:40:08 cros-beefy69-c2 kernel: [ 6243.944034] EXT4-fs (loop3): mounted filesystem with ordered data mode. 
Opts: (null)
Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.114359] EXT4-fs error (device loop2): ext4_iget:4161: inode #2909: c
omm python2 /mnt/ho: checksum invalid
Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.124523] Aborting journal on device loop2-8.
Mar 20 15:40:10 cros-beefy69-c2 kernel: [ 6246.129283] EXT4-fs (loop2): Remounting filesystem read-only
Mar 20 15:40:11 cros-beefy69-c2 kernel: [ 6247.581464] EXT4-fs error (device loop2): ext4_put_super:797: Couldn't c
lean up the journal
Mar 20 15:42:43 cros-beefy69-c2 kernel: [    6.118016] Calibrating delay loop (skipped) preset value.. 4600.00 Bogo
MIPS (lpj=9200000)
Mar 20 15:42:43 cros-beefy69-c2 kernel: [    9.294656] loop: module loaded

So... I don't know what caused it, but there was some sort of corruption
error on the loop device.

The same sequence of error messages shows up on the builder for
lakitu-st-paladin.

I note that the two build servers are notably different hardware;
bob-paladin is hosted on a GCE server, whereas lakitu-st-paladin is
bare metal hosted in the Chrome GOLO.  The are both running variations
of the 3.13.0 kernel, albeit different builds.  I searched the logs,
and both servers have been running their respective kernels since 3/13,
which is as far back as either host's logs go.

The upshot of all this is that it feels like something caused by whatever
we're doing in userland, not the hardware, and not the kernel.

Full logs from the lakitu-st-paladin failure:
chrome-bot@build23-m2:(Linux 14.04):~$ sudo grep '^Mar 21 09:0' /var/log/syslog | grep loop
Mar 21 09:03:12 build23-m2 kernel: [ 4754.774448] EXT4-fs (loop1): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:03:13 build23-m2 kernel: [ 4755.656641] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem
Mar 21 09:03:13 build23-m2 kernel: [ 4755.656878] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null)
Mar 21 09:03:14 build23-m2 kernel: [ 4756.162221] EXT4-fs (loop1): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:03:14 build23-m2 kernel: [ 4756.424561] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem
Mar 21 09:03:14 build23-m2 kernel: [ 4756.424822] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null)
Mar 21 09:03:14 build23-m2 kernel: [ 4756.431800] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:03:14 build23-m2 kernel: [ 4756.438932] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:09:12 build23-m2 kernel: [ 5113.812727] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem
Mar 21 09:09:12 build23-m2 kernel: [ 5113.812967] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null)
Mar 21 09:09:12 build23-m2 kernel: [ 5113.819819] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:09:12 build23-m2 kernel: [ 5113.826945] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:09:30 build23-m2 kernel: [ 5132.230232]  loop1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12
Mar 21 09:09:52 build23-m2 kernel: [ 5153.943910] EXT4-fs (loop1): mounting ext2 file system using the ext4 subsystem
Mar 21 09:09:52 build23-m2 kernel: [ 5153.944337] EXT4-fs (loop1): mounted filesystem without journal. Opts: (null)
Mar 21 09:09:52 build23-m2 kernel: [ 5153.951797] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:09:52 build23-m2 kernel: [ 5153.959304] EXT4-fs (loop3): mounted filesystem with ordered data mode. Opts: (null)
Mar 21 09:09:54 build23-m2 kernel: [ 5155.692610] EXT4-fs error (device loop2): ext4_iget:4156: inode #131670: comm python2 /mnt/ho: checksum invalid
Mar 21 09:09:54 build23-m2 kernel: [ 5155.692954] Aborting journal on device loop2-8.
Mar 21 09:09:54 build23-m2 kernel: [ 5155.693102] EXT4-fs (loop2): Remounting filesystem read-only
Mar 21 09:09:54 build23-m2 kernel: [ 5155.739834] EXT4-fs error (device loop2): ext4_iget:4156: inode #131102: comm python2 /mnt/ho: checksum invalid
Mar 21 09:09:55 build23-m2 kernel: [ 5157.160437] EXT4-fs error (device loop2): ext4_put_super:797: Couldn't clean up the journal

Cc: dgarr...@chromium.org vapier@chromium.org ihf@chromium.org

Comment 9 by vapier@chromium.org, Mar 21 2018

my gut reaction is that there is a problem with the build machines.  userland writing files to ext4 formatted images should never be hitting checksum/write/journal errors.
> my gut reaction is that there is a problem with the build machines.

I'd be first in line to buy that theory, but I've got some hang-ups
based on the data:
  * It's two different servers with rather different hardware.
  * The problem hasn't been seen before, yet now we see it two days in
    a row.
  * /var/log/syslog shows no other errors concurrent with the failures.
  * The specific sequence of events logged on the two different servers
    is identical.

That is, blaming the hardware requires accepting a level of coincidence
that strains my credulity.

So, I'm more inclined to blame software.  However, the kernel hasn't changed,
and I assume (though I don't know) that the relevant userland code has been
stable, too.  Which, I suppose, leaves us nowhere.

Is there something we can do to rule in or out the possibility of a kernel
problem?

Cc: shuqianz@chromium.org
Cc: smbar...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Another instance of this, now on the pre-cq: https://ci.chromium.org/p/chromeos/builds/b8950133071756921872#
Cc: lakitu-dev@google.com cra...@chromium.org
 Issue 828054  has been merged into this issue.
And another one: https://uberchromegw.corp.google.com/i/chromeos/builders/capri-paladin/builds/1209

IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs/var/cache/edb/vdb_metadata.pickle.20747'
INFO    : Unmounting image from /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/stateful and /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs
INFO    : Unmounting temporary rootfs /mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs//build/rootfs.
rmdir: failed to remove '/mnt/host/source/src/build/images/capri/R67-10567.0.0-rc3/rootfs//build': Directory not empty
And again: https://uberchromegw.corp.google.com/i/chromeos/builders/peppy-paladin/builds/18387

IOError: [Errno 30] Read-only file system: '/mnt/host/source/src/build/images/peppy/R67-10569.0.0-rc3/rootfs/var/cache/edb/vdb_metadata.pickle.29325'

Sign in to add a comment