New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 761911 link

Starred by 5 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on: View detail
issue 766347
issue 773590
issue 766452

Blocking:
issue 779567
issue 787970



Sign in to add a comment

Run e2fsck when "needs_recovery" Filesystem feature is set

Project Member Reported by gwendal@chromium.org, Sep 5 2017

Issue description

Looking at crbug/760007, log debug-logs_20170829-143125 contains
"warning: mounting fs with errors, running e2fsck is recommended". 
It keeps being mounted with only light correction (ext4_orphan_cleanup()) while more damage may linger.

We can test on filesystem flags "need_recovery" to trigger e2fsck before mount stateful and /var.

Using time "dumpe2fs -h /dev/mmcblk0p1 | grep features", the testing cost is around 20ms.

 
20ms to test a bit seems high to me.

how does this interact with the kernel's autorecovery features ?  is it the same flag ?  or does the kernel only support journal replays and that isn't the same thing as "need_recovery" ?

i guess the kernel recovery isn't online either ... it blocks while everything is processed and only then returns to userspace (right?).  so having a blocking recovery in userspace isn't worse than the status quo today.
Blockedon: 766347
Cc: sonnyrao@chromium.org
Anecdotally, e2fsck sometimes wipes out files that were previously still showing up on the corrupted but working filesystem.  

We will probably need to be very careful about what we do here to ensure that e2fsck doesn't cause more harm than good -- like more device unenrollments or loss of user data.

Comment 4 by uekawa@google.com, Sep 19 2017

few thoughts:

 - Is it possible to have a UMA stat on how often this happens? 
(I was thinking grepping for logs in log rotation once a day is less performance-intrusive as a first pass)

 - doesn't just running fsck usually not do anything without "need_recovery" flag unless -f is I given. (I wonder whether dumpe2fs or fsck is faster)
yes I think we should make UMA stats for several aspects of filesystem state

1) count recoveries on stateful and encstateful
2) count "mounting fs with errors" for stateful and encstateful
3) count errors since last fsck from ext4 superblock

We could break that off into a separate bug if we want

Comment 6 by uekawa@chromium.org, Sep 19 2017

Blockedon: 766452

Comment 7 by uekawa@chromium.org, Sep 19 2017

filed 766452

Comment 8 by uekawa@google.com, Oct 31 2017

Blocking: 779567

Comment 9 by uekawa@google.com, Nov 7 2017

The metrics are in, and I see 1-2% of boots requiring fsck.
I'm uncomfortable about just letting those run with such file system state.

Comment 10 by uekawa@google.com, Nov 7 2017

about 40% needing recovery sounds wrong too; why are we doing unclean shutdown that often?
probably because of  issue 773474 
Cc: gwendal@chromium.org

Comment 13 by uekawa@google.com, Nov 8 2017

samus and kevin and caroline and cyan are at around the same rate
minnie is higher.
 issue 773474  does not really explain this.




Hmm I see caroline - 3.18 kernel - on beta channel for the last 7 days is much lower at about 7%

https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CCAROLINE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

Kevin - 4.4 kernel - is at 15%
https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CKEVIN%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

Minnie - 3.14 kernel - is at 34%
https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CMINNIE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

Link - 3.8 kernel -  is much higher at 85%
https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CLINK%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

Blaze  - 3.10 kernel - is at 89%
https://uma.googleplex.com/p/chrome/histograms/?endDate=20171106&dayCount=7&histograms=Platform.FileSystem.EncStateful_RecoveryNeeded&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C3%2Cshort_hw_class%2Ccnt%2CBLAZE%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

so it does seem like 773474
Cc: kerrnel@chromium.org
Yes, that's exactly what I would expect4, for the older kernels to show the problem and then quickly resolve once we ship the patches.
With a timeline, you can see the machines are getting cleaner and cleaner when patches are present:
https://uma.googleplex.com/p/chrome/timeline_v2/?sid=a9b43a9c86eab5dd845c9d66ea3790b1

Comment 18 by uekawa@google.com, Nov 9 2017

nice, let's see what it looks like in a few weeks time.

Blockedon: 773590
Cc: mzhuo@chromium.org
Looking at a system with ext4 errors
EXT4-fs error (device dm-1): ext4_lookup:1593: inode #7971: comm TaskSchedulerFo: deleted inode referenced: 8013
[more enclosed]

The system - guado - had an unexpected power loss between 2017-11-09 10:35:35 and 2017-11-14 11:46:05, causing disk corruption.

Modified chromeos_shutdown to force "mount-encrypted umount" to a do a 'fsck -a', 

MOUNT_ENCRYPTED_FSCK=1 mount-encrypted umount ...

but that was not enough:

[pid:5198] Starting.
[pid:5198] Unmounting /var.
[pid:5198] Unmounting /home/chronos.
[pid:5198] Unmounting /mnt/stateful_partition/encrypted.
fsck from util-linux 2.28.2
/dev/mapper/encstateful contains a file system with errors, check forced.
/dev/mapper/encstateful: Inode 24209 seems to have inline data but extent flag is set.


/dev/mapper/encstateful: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)
ERROR[pid:5198] shutdown (../../../cryptohome-0.0.1/platform2/cryptohome/mount-encrypted.c, 1278): 'fsck -a /dev/mapper/encstateful' failed: 1024
[pid:5198] Removing /dev/mapper/encstateful.
[pid:5198] Unlooping /mnt/stateful_partition/encrypted.block (named encstateful).


In Min's guado case, we need a fsck -y (and potentially lose data).

Modified fsck by interposing a script

#!/bin/sh
echo "fsck with option $@"
args="$@"
new_args="$(echo "${args}" | sed -e 's/-a/-y/')"
/sbin/fsck_true ${new_args}

And dm-1 was fixed. [fsck output enclosed]

There are no notification when a filesystem is corrupted, and we could have continued using the device for a long time. 

dmesg.txt
71.0 KB View Download
umount-encrypted.log
14.9 KB View Download
Blocking: 787970

Sign in to add a comment