New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 740503 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 701292
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocked on:
issue 740780
issue 701292



Sign in to add a comment

ext4 migration failing on samus

Project Member Reported by bartfab@chromium.org, Jul 10 2017

Issue description

I tried to go to ext4 migration on my samus right now (61.0.3142.0 / 9693.0.0-17.06.28) and ran into two serious issues and one less critical:

* Migration failed after about 20 seconds, telling me that my data had been destroyed and offering me to report an issue or restart. This is absolutely unacceptable:
1) Migration was hardened to skip damanged files. Why did it fail
2) If migration is impossible, disable ARC for that user, or wipe ARC, but do not wipe the entire profile, destroying all local files without warning.
I filed a feedback report, including my google.com e-mail in it for easy lookup. Unfortunately, I do not have access to feedback reports, so I cannot reference it from here myself.

* After reboot, I see my user name with an exclamation mark. When I click on the name, I am taken through the online auth flow, which always ends with "Sorry, your password could not be verified." There is nothing telling the user how to get out of this stuck situation (you have to know that there is a context menu which will allow you to delete the user).

* The "You've got a ciritical update. Sign in to get started." banner shows even though we know already that - for whatever reason - this user cannot be migrated.

Assigning to Dan for prioritization.

+ Cyrus and Naveen for visibility - this is an example of how migration can go sour.
 
Labels: -Pri-1 Pri-0
Bumping to P0 as this is worse than I thought.

This is a corp device, with forced re-enrollment on and dev mode off. After I had wiped the device (Esc-Reload-Power, Ctrl-D, then return to verified mode immediately), I got the usual OOBE flow. I was asked to re-enroll the device and my google.com account work for this (showing that the account and my two-factor are OK).

When I tried adding myself as a user then, i got "Sorry, your password could not be verified." after going through GAIA login. This sounds like there was an auth error, but there was not: Auth succeeded but the system was unable to add me as a user, I suspect because there was still a corrupt cryptohome lying around.

Another Googler was able to log in and add himself without issues.

I now recovered the device from a USB key, which does a more thorough stateful wipe, and was finally able to re-add myself as a user.

Comment 2 by dspaid@chromium.org, Jul 10 2017

Will take a deeper look tomorrow morning, but something looks badly broken on that file system.  Lots of critical kernel errors, with the result being that nothing short of a format was ever going to fix that device.  Specifically the problem (from the point of the migration) is that we had files that we couldn't delete (probably due to the kernal errors).  If we can't delete the files we can't finish the migration (space issues aside the system isn't designed to handle two profiles for the same user outside of migration).  Furthermore since we can't remove the files we also can't remove the parent cryptohome, so normal user removal doesn't work either.

2017-07-10T12:12:04.504569+01:00 CRIT kernel: [27713.577234] EXT4-fs error (device sda1): ext4_iget:4044: inode #264168: comm MountThread: bad extra_isize (65663 != 256)
2017-07-10T12:12:04.539554+01:00 CRIT kernel: [27713.612681] EXT4-fs error (device sda1): ext4_lookup:1593: inode #262180: comm MountThread: deleted inode referenced: 262727
2017-07-10T12:12:04.545558+01:00 CRIT kernel: [27713.618426] EXT4-fs error (device sda1): ext4_iget:4229: inode #264340: comm MountThread: bogus i_mode (177774)
2017-07-10T12:12:04.548557+01:00 CRIT kernel: [27713.621118] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275486
2017-07-10T12:12:04.548570+01:00 CRIT kernel: [27713.621610] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275507
2017-07-10T12:12:04.549550+01:00 CRIT kernel: [27713.622642] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 266312
2017-07-10T12:12:04.549556+01:00 CRIT kernel: [27713.622956] EXT4-fs error (device sda1): ext4_iget:4044: inode #264166: comm MountThread: bad extra_isize (65663 != 256)
2017-07-10T12:12:04.550549+01:00 CRIT kernel: [27713.623338] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275312
2017-07-10T12:12:04.550554+01:00 CRIT kernel: [27713.623909] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 264443
2017-07-10T12:12:04.551547+01:00 CRIT kernel: [27713.624235] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275449
2017-07-10T12:12:05.346457+01:00 INFO cryptohomed[1699]: Number of files: 36823
2017-07-10T12:12:05.346509+01:00 INFO cryptohomed[1699]: Number of directories: 2611
2017-07-10T12:12:05.346517+01:00 INFO cryptohomed[1699]: Number of symlinks: 12
2017-07-10T12:12:05.346597+01:00 INFO cryptohomed[1699]: Preparation took 2437 ms.
2017-07-10T12:12:05.388809+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000032
2017-07-10T12:12:05.388929+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000009
2017-07-10T12:12:05.389058+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000021
2017-07-10T12:12:05.389081+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000089
2017-07-10T12:12:05.389178+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000085
2017-07-10T12:12:05.389307+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000046
2017-07-10T12:12:05.389633+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000032
2017-07-10T12:12:05.389797+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000009
2017-07-10T12:12:05.389983+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000032"
2017-07-10T12:12:05.390101+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000009"
2017-07-10T12:12:05.390428+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000021
2017-07-10T12:12:05.390696+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000021"
2017-07-10T12:12:05.392595+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000089
2017-07-10T12:12:05.392834+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000089"
2017-07-10T12:12:05.393636+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000085
2017-07-10T12:12:05.394011+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000085"
2017-07-10T12:12:05.394711+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000046
2017-07-10T12:12:05.395125+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000046"
2017-07-10T12:12:05.404156+01:00 ERR cryptohomed[1699]: Migration Failed, aborting.
2017-07-10T12:12:06.246038+01:00 ERR cryptohomed[1699]: Failed to migrate.
Labels: -M-60 M-61
I realized I observed this on M61 - the issue *may* not be affecting M60.

Comment 4 by dspaid@chromium.org, Jul 11 2017

Cc: gwendal@chromium.org
Looking at the dmesg output we were getting a constant stream of the kernel errors before migration was ever attempted, so I'm fairly confident that the filesystem corruption wasn't caused by the migration.  Adding Gwendal who may have more insight into the kernel errors.

Comment 5 by dspaid@chromium.org, Jul 11 2017

Cc: uekawa@chromium.org hashimoto@chromium.org fukino@chromium.org kinaba@chromium.org

Comment 6 by dspaid@chromium.org, Jul 11 2017

Owner: uekawa@chromium.org
Assigning to Junichi for prioritization as there's not much we can do about it at migration time.  I did create a separate bug (crbug/740780) for fixing the UX so that users don't need to go through recovery themselves.
Original Report ID:68013820014
https://feedback.corp.google.com/product/208/neutron?lView=rd&lRSort=1&lROrder=2&lRFilter=1&lReportSearch=bartfab&lReport=68013820014

From the event log:
3 | 2017-07-06 11:42:33 | Last post code in previous boot | 0xef | Resume Failure
4 | 2017-07-06 11:42:33 | SUS Power Fail

Bartosz,
Did you experience a crash, did you have to do hard reset around that time?
Was the migration initiated at that time?

I don't have dmesg since boot (only starting at 26688s). 
The filesystem is fixated on accessing some inodes, but these are corrupted:

...
EXT4-fs error (device sda1): ext4_lookup:1593: inode #266765: comm TaskSchedulerFo: deleted inode referenced: 264442
EXT4-fs error (device sda1): ext4_iget:4044: inode #267602: comm TaskSchedulerFo: bad extra_isize (29638 != 256)
EXT4-fs error (device sda1): ext4_iget:4229: inode #264352: comm TaskSchedulerFo: bogus i_mode (177774)
EXT4-fs error (device sda1): ext4_iget:4044: inode #267601: comm TaskSchedulerFo: bad extra_isize (21712 != 256)
...

21712 != 256 : means extra_isize == 21712 - 128 = 21584, bigger than is allowed (~4K): the inode location is corrupted on the disk.

Did your coworked filed a feedback report at #1?

Given migration relies a lot on xattr, it would be interesting to fake a corruption of inode extra_isize on few inodes and see how migration/fs system deal with them.
The device took two updates on 6th July.
* The first update and reboot went smoothly, leaving me with the familiar login screen.
* I received a second update shortly after. Reboot failed (I got a blank screen and unresponsive machine). I had to hard reboot. I presume this is what you are seeing in the logs. After the forced reboot, I got M61 and the new login screen.

Migration was offered to me on 6th July, but I declined it and only tried yesterday, on 10th July.

Unfortuantely, we did not take any further bug reports during experimentation.

One aspect of this bug is that migration failed (or did not even start). Another is that even though I had toggled dev mode on and off again, and I got back to OOBE, some state persisted on disk - the corruption was still there, and I was unable to add myself. This I think is the biggest issue here: When the machine goes back to OOBE, we should make sure that the stateful file system has been recreated and any corruption on there is gone.

Comment 9 by uekawa@google.com, Jul 12 2017

Blockedon: 701292 740780
From observing some bugs already filed:
 - for migration experience; crbug.com/740780
 - for stateful partition not being recreated based on enterprise policy  crbug.com/701292 


I think there's something about cryptohome failing to delete that is not great, and UI doesn't seem to be very useful.
I think s_h_o_u is different and failing to delete directory and recreate shouldn't affect ? 
If that part is the problem cryptohomed can try renaming the directory before removing / recreating.


I'm inclined to mark this as dup of 701292
Mergedinto: 701292
Status: Duplicate (was: Assigned)
I agree with your analysis.

Sign in to add a comment