Run e2fsck when "needs_recovery" Filesystem feature is set |
|||||||||||||
Issue descriptionLooking at crbug/760007, log debug-logs_20170829-143125 contains "warning: mounting fs with errors, running e2fsck is recommended". It keeps being mounted with only light correction (ext4_orphan_cleanup()) while more damage may linger. We can test on filesystem flags "need_recovery" to trigger e2fsck before mount stateful and /var. Using time "dumpe2fs -h /dev/mmcblk0p1 | grep features", the testing cost is around 20ms. ⛆ |
|
|
,
Sep 18 2017
,
Sep 18 2017
Anecdotally, e2fsck sometimes wipes out files that were previously still showing up on the corrupted but working filesystem. We will probably need to be very careful about what we do here to ensure that e2fsck doesn't cause more harm than good -- like more device unenrollments or loss of user data.
,
Sep 19 2017
few thoughts: - Is it possible to have a UMA stat on how often this happens? (I was thinking grepping for logs in log rotation once a day is less performance-intrusive as a first pass) - doesn't just running fsck usually not do anything without "need_recovery" flag unless -f is I given. (I wonder whether dumpe2fs or fsck is faster)
,
Sep 19 2017
yes I think we should make UMA stats for several aspects of filesystem state 1) count recoveries on stateful and encstateful 2) count "mounting fs with errors" for stateful and encstateful 3) count errors since last fsck from ext4 superblock We could break that off into a separate bug if we want
,
Sep 19 2017
,
Sep 19 2017
filed 766452
,
Oct 31 2017
,
Nov 7 2017
The metrics are in, and I see 1-2% of boots requiring fsck. I'm uncomfortable about just letting those run with such file system state.
,
Nov 7 2017
about 40% needing recovery sounds wrong too; why are we doing unclean shutdown that often?
,
Nov 7 2017
probably because of issue 773474
,
Nov 7 2017
,
Nov 8 2017
samus and kevin and caroline and cyan are at around the same rate minnie is higher. issue 773474 does not really explain this.
,
Nov 8 2017
Hmm I see caroline - 3.18 kernel - on beta channel for the last 7 days is much lower at about 7% https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CCAROLINE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Kevin - 4.4 kernel - is at 15% https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CKEVIN%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Minnie - 3.14 kernel - is at 34% https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CMINNIE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Link - 3.8 kernel - is much higher at 85% https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CLINK%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Blaze - 3.10 kernel - is at 89% https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CBLAZE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial so it does seem like 773474
,
Nov 8 2017
,
Nov 8 2017
Yes, that's exactly what I would expect4, for the older kernels to show the problem and then quickly resolve once we ship the patches.
,
Nov 8 2017
With a timeline, you can see the machines are getting cleaner and cleaner when patches are present: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=a9b43a9c86eab5dd845c9d66ea3790b1
,
Nov 9 2017
nice, let's see what it looks like in a few weeks time.
,
Nov 20 2017
,
Nov 21 2017
Looking at a system with ext4 errors
EXT4-fs error (device dm-1): ext4_lookup:1593: inode #7971: comm TaskSchedulerFo: deleted inode referenced: 8013
[more enclosed]
The system - guado - had an unexpected power loss between 2017-11-09 10:35:35 and 2017-11-14 11:46:05, causing disk corruption.
Modified chromeos_shutdown to force "mount-encrypted umount" to a do a 'fsck -a',
MOUNT_ENCRYPTED_FSCK=1 mount-encrypted umount ...
but that was not enough:
[pid:5198] Starting.
[pid:5198] Unmounting /var.
[pid:5198] Unmounting /home/chronos.
[pid:5198] Unmounting /mnt/stateful_partition/encrypted.
fsck from util-linux 2.28.2
/dev/mapper/encstateful contains a file system with errors, check forced.
/dev/mapper/encstateful: Inode 24209 seems to have inline data but extent flag is set.
/dev/mapper/encstateful: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
ERROR[pid:5198] shutdown (../../../cryptohome-0.0.1/platform2/cryptohome/mount-encrypted.c, 1278): 'fsck -a /dev/mapper/encstateful' failed: 1024
[pid:5198] Removing /dev/mapper/encstateful.
[pid:5198] Unlooping /mnt/stateful_partition/encrypted.block (named encstateful).
In Min's guado case, we need a fsck -y (and potentially lose data).
Modified fsck by interposing a script
#!/bin/sh
echo "fsck with option $@"
args="$@"
new_args="$(echo "${args}" | sed -e 's/-a/-y/')"
/sbin/fsck_true ${new_args}
And dm-1 was fixed. [fsck output enclosed]
There are no notification when a filesystem is corrupted, and we could have continued using the device for a long time.
,
Jan 16 2018
|
||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by vapier@chromium.org
, Sep 5 2017