Issue metadata
Sign in to add a comment
|
ext4 migration failing on samus |
||||||||||||||||||||||
Issue descriptionI tried to go to ext4 migration on my samus right now (61.0.3142.0 / 9693.0.0-17.06.28) and ran into two serious issues and one less critical: * Migration failed after about 20 seconds, telling me that my data had been destroyed and offering me to report an issue or restart. This is absolutely unacceptable: 1) Migration was hardened to skip damanged files. Why did it fail 2) If migration is impossible, disable ARC for that user, or wipe ARC, but do not wipe the entire profile, destroying all local files without warning. I filed a feedback report, including my google.com e-mail in it for easy lookup. Unfortunately, I do not have access to feedback reports, so I cannot reference it from here myself. * After reboot, I see my user name with an exclamation mark. When I click on the name, I am taken through the online auth flow, which always ends with "Sorry, your password could not be verified." There is nothing telling the user how to get out of this stuck situation (you have to know that there is a context menu which will allow you to delete the user). * The "You've got a ciritical update. Sign in to get started." banner shows even though we know already that - for whatever reason - this user cannot be migrated. Assigning to Dan for prioritization. + Cyrus and Naveen for visibility - this is an example of how migration can go sour.
,
Jul 10 2017
Will take a deeper look tomorrow morning, but something looks badly broken on that file system. Lots of critical kernel errors, with the result being that nothing short of a format was ever going to fix that device. Specifically the problem (from the point of the migration) is that we had files that we couldn't delete (probably due to the kernal errors). If we can't delete the files we can't finish the migration (space issues aside the system isn't designed to handle two profiles for the same user outside of migration). Furthermore since we can't remove the files we also can't remove the parent cryptohome, so normal user removal doesn't work either. 2017-07-10T12:12:04.504569+01:00 CRIT kernel: [27713.577234] EXT4-fs error (device sda1): ext4_iget:4044: inode #264168: comm MountThread: bad extra_isize (65663 != 256) 2017-07-10T12:12:04.539554+01:00 CRIT kernel: [27713.612681] EXT4-fs error (device sda1): ext4_lookup:1593: inode #262180: comm MountThread: deleted inode referenced: 262727 2017-07-10T12:12:04.545558+01:00 CRIT kernel: [27713.618426] EXT4-fs error (device sda1): ext4_iget:4229: inode #264340: comm MountThread: bogus i_mode (177774) 2017-07-10T12:12:04.548557+01:00 CRIT kernel: [27713.621118] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275486 2017-07-10T12:12:04.548570+01:00 CRIT kernel: [27713.621610] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275507 2017-07-10T12:12:04.549550+01:00 CRIT kernel: [27713.622642] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 266312 2017-07-10T12:12:04.549556+01:00 CRIT kernel: [27713.622956] EXT4-fs error (device sda1): ext4_iget:4044: inode #264166: comm MountThread: bad extra_isize (65663 != 256) 2017-07-10T12:12:04.550549+01:00 CRIT kernel: [27713.623338] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275312 2017-07-10T12:12:04.550554+01:00 CRIT kernel: [27713.623909] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 264443 2017-07-10T12:12:04.551547+01:00 CRIT kernel: [27713.624235] EXT4-fs error (device sda1): ext4_lookup:1593: inode #918297: comm MountThread: deleted inode referenced: 275449 2017-07-10T12:12:05.346457+01:00 INFO cryptohomed[1699]: Number of files: 36823 2017-07-10T12:12:05.346509+01:00 INFO cryptohomed[1699]: Number of directories: 2611 2017-07-10T12:12:05.346517+01:00 INFO cryptohomed[1699]: Number of symlinks: 12 2017-07-10T12:12:05.346597+01:00 INFO cryptohomed[1699]: Preparation took 2437 ms. 2017-07-10T12:12:05.388809+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000032 2017-07-10T12:12:05.388929+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000009 2017-07-10T12:12:05.389058+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000021 2017-07-10T12:12:05.389081+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000089 2017-07-10T12:12:05.389178+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000085 2017-07-10T12:12:05.389307+01:00 ERR cryptohomed[1699]: Unknown file type: user/local/00000046 2017-07-10T12:12:05.389633+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000032 2017-07-10T12:12:05.389797+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000009 2017-07-10T12:12:05.389983+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000032" 2017-07-10T12:12:05.390101+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000009" 2017-07-10T12:12:05.390428+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000021 2017-07-10T12:12:05.390696+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000021" 2017-07-10T12:12:05.392595+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000089 2017-07-10T12:12:05.392834+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000089" 2017-07-10T12:12:05.393636+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000085 2017-07-10T12:12:05.394011+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000085" 2017-07-10T12:12:05.394711+01:00 ERR cryptohomed[1699]: Failed to delete file user/local/00000046 2017-07-10T12:12:05.395125+01:00 ERR cryptohomed[1699]: Failed to migrate "user/local/00000046" 2017-07-10T12:12:05.404156+01:00 ERR cryptohomed[1699]: Migration Failed, aborting. 2017-07-10T12:12:06.246038+01:00 ERR cryptohomed[1699]: Failed to migrate.
,
Jul 10 2017
I realized I observed this on M61 - the issue *may* not be affecting M60.
,
Jul 11 2017
Looking at the dmesg output we were getting a constant stream of the kernel errors before migration was ever attempted, so I'm fairly confident that the filesystem corruption wasn't caused by the migration. Adding Gwendal who may have more insight into the kernel errors.
,
Jul 11 2017
,
Jul 11 2017
Assigning to Junichi for prioritization as there's not much we can do about it at migration time. I did create a separate bug (crbug/740780) for fixing the UX so that users don't need to go through recovery themselves.
,
Jul 11 2017
Original Report ID:68013820014 https://feedback.corp.google.com/product/208/neutron?lView=rd&lRSort=1&lROrder=2&lRFilter=1&lReportSearch=bartfab&lReport=68013820014 From the event log: 3 | 2017-07-06 11:42:33 | Last post code in previous boot | 0xef | Resume Failure 4 | 2017-07-06 11:42:33 | SUS Power Fail Bartosz, Did you experience a crash, did you have to do hard reset around that time? Was the migration initiated at that time? I don't have dmesg since boot (only starting at 26688s). The filesystem is fixated on accessing some inodes, but these are corrupted: ... EXT4-fs error (device sda1): ext4_lookup:1593: inode #266765: comm TaskSchedulerFo: deleted inode referenced: 264442 EXT4-fs error (device sda1): ext4_iget:4044: inode #267602: comm TaskSchedulerFo: bad extra_isize (29638 != 256) EXT4-fs error (device sda1): ext4_iget:4229: inode #264352: comm TaskSchedulerFo: bogus i_mode (177774) EXT4-fs error (device sda1): ext4_iget:4044: inode #267601: comm TaskSchedulerFo: bad extra_isize (21712 != 256) ... 21712 != 256 : means extra_isize == 21712 - 128 = 21584, bigger than is allowed (~4K): the inode location is corrupted on the disk. Did your coworked filed a feedback report at #1? Given migration relies a lot on xattr, it would be interesting to fake a corruption of inode extra_isize on few inodes and see how migration/fs system deal with them.
,
Jul 11 2017
The device took two updates on 6th July. * The first update and reboot went smoothly, leaving me with the familiar login screen. * I received a second update shortly after. Reboot failed (I got a blank screen and unresponsive machine). I had to hard reboot. I presume this is what you are seeing in the logs. After the forced reboot, I got M61 and the new login screen. Migration was offered to me on 6th July, but I declined it and only tried yesterday, on 10th July. Unfortuantely, we did not take any further bug reports during experimentation. One aspect of this bug is that migration failed (or did not even start). Another is that even though I had toggled dev mode on and off again, and I got back to OOBE, some state persisted on disk - the corruption was still there, and I was unable to add myself. This I think is the biggest issue here: When the machine goes back to OOBE, we should make sure that the stateful file system has been recreated and any corruption on there is gone.
,
Jul 12 2017
From observing some bugs already filed: - for migration experience; crbug.com/740780 - for stateful partition not being recreated based on enterprise policy crbug.com/701292 I think there's something about cryptohome failing to delete that is not great, and UI doesn't seem to be very useful. I think s_h_o_u is different and failing to delete directory and recreate shouldn't affect ? If that part is the problem cryptohomed can try renaming the directory before removing / recreating. I'm inclined to mark this as dup of 701292
,
Jul 12 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by bartfab@chromium.org
, Jul 10 2017