New issue
Advanced search Search tips

Issue 832204 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Dec 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

terra: lost enrollment after ext4 corruption

Project Member Reported by apronin@chromium.org, Apr 12 2018

Issue description

Device lost enrollment (original report by marcore@chromium.org in  https://crbug.com/810716#c45 ): 

case: 15132165
model: Asus Chromebook C202 
version: 63.0.3239.140 / 10032.86.0
customer info & serial#: https://drive.google.com/open?id=1j-nSYUYxGvHNesiUM4D1pstSs-qscdkK0ZW7-DlNatQ
debug logs:  https://drive.google.com/open?id=1ZenXe2KZVn9qJB4AdJgJwXgVE2uDQ6h8
but in the log I have:
messages.5:2018-02-20T13:04:48.022483-08:00 INFO session_manager[1138]: [INFO:policy_key.cc(51)] No policy key on disk at /home/root/f394d6c960ed35d8d736e60486726787d1936532/session_manager/policy/key

chrome/chrome_20180227-124455:[1162:1162:0227/124457.038302:ERROR:device_cloud_policy_store_chromeos.cc(233)] Device policy read on enrolled device yields no DM token! Status: 5.
Status 5 ==   STORE_VALIDATION_ERROR,  // Policy validation failure.
instead of 1  STORE_KEY_UNAVAILABLE,   // Owner key not yet configured.
 
I do see policy errors:
2018-02-27T11:04:09.014158-08:00 ERR shill[1104]: [ERROR:device_policy_impl.cc(132)] Policy on disk could not be parsed!
...
2018-02-27T11:04:09.116916-08:00 ERR session_manager[1132]: [ERROR:device_policy_impl.cc(132)] Policy on disk could not be parsed!

which started after ext4 errors:
018-02-27T11:04:07.222269-08:00 WARNING kernel: [    1.377391] EXT4-fs (mmcblk0p1): warning: mounting fs with errors, running e2fsck is recommended
...
2018-02-27T11:04:07.222303-08:00 WARNING kernel: [    1.697537] EXT4-fs warning (device dm-1): ext4_clear_journal_err:4396: Filesystem error recorded from previous mount: IO failure
2018-02-27T11:04:07.222306-08:00 WARNING kernel: [    1.697560] EXT4-fs warning (device dm-1): ext4_clear_journal_err:4397: Marking fs in need of filesystem check.
2018-02-27T11:04:07.222315-08:00 WARNING kernel: [    1.697815] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
...
2018-02-27T11:04:07.222454-08:00 CRIT kernel: [    2.066768] EXT4-fs error (device mmcblk0p1): ext4_mb_generate_buddy:757: group 0, block bitmap and bg descriptor inconsistent: 23092 vs 23130 free clusters
< ... and many more like that ...>
2018-02-27T11:04:08.635131-08:00 CRIT kernel: [    3.666727] EXT4-fs error (device mmcblk0p1): ext4_mb_generate_buddy:757: group 19, block bitmap and bg descriptor inconsistent: 32357 vs 32406 free clusters

which in turn started after seeing multiple occurrences of
2018-02-26T09:09:43.537022-08:00 CRIT kernel: [ 7697.550318] EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 12 at logical offset 21653 with max blocks 2 with error 117
2018-02-26T09:09:43.537051-08:00 CRIT kernel: [ 7697.550341] EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost

after errors during trim:

2018-02-21T12:39:15.045683-08:00 CRIT kernel: [  634.956912] EXT4-fs error (device mmcblk0p1): ext4_mb_generate_buddy:757: group 0, block bitmap and bg descriptor inconsistent: 23092 vs 23130 free clusters
...
2018-02-21T12:39:31.363693-08:00 WARNING kernel: [  651.274401] EXT4-fs warning (device mmcblk0p1): ext4_trim_all_free:5118: Error -117 loading buddy information for 3
2018-02-21T12:39:31.366003-08:00 NOTICE /usr/sbin/chromeos-trim[6551]: Error Trimming /mnt/stateful_partition at current_offset 384M: 
2018-02-21T12:39:31.368002-08:00 NOTICE /usr/sbin/chromeos-trim[6552]: /mnt/stateful_partition trim interrupted at 384M out of 10481M, 16 seconds so far
2018-02-21T12:39:31.374432-08:00 INFO periodic_scheduler[6554]: trim: job completed

So, sounds like a filesystem corruption.

The earliest errors I could find were these on 02/13, the logs before that are lost, so unclear how it all started:

2018-02-13T12:52:24.315721-08:00 CRIT kernel: [ 1347.773684] EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 12 at logical offset 40096 with max blocks 31 with error 117
2018-02-13T12:52:24.315751-08:00 CRIT kernel: [ 1347.773709] EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
2018-02-13T12:52:24.315755-08:00 CRIT kernel: [ 1347.773709] 
2018-02-13T12:52:36.608720-08:00 CRIT kernel: [ 1360.065715] EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 12 at logical offset 40192 with max blocks 31 with error 117
2018-02-13T12:52:36.608750-08:00 CRIT kernel: [ 1360.065740] EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
2018-02-13T12:52:36.608754-08:00 CRIT kernel: [ 1360.065740] 
If I read the update_engine logs right, on 02/20 it was still running 9901.77.0, i.e. M62, and it had those errors there already.
Can it be the case of something like  issue 760007  that went unnoticed over several AUs until now?
Cc: sonnyrao@chromium.org
Or could it be  issue 766786 ?
Those fixes were ported to 3.18, but only in M64, while we are having M62/M63 here.
re #4 - the stuff in 766786 was mostly preventative.  I don't know of anything actively causing corruption on our releases in the field.
Yeah, that cause for corruption could be something similar to 760007.

Looks like the problems started after reboot on 02/27 (it was still on M-62 then).

That's where we first see

2018-02-27T11:04:14.033109-08:00 WARNING cryptohomed[1493]: Could not load the device policy file.

after we had recovery:

2018-02-27T11:04:07.222303-08:00 WARNING kernel: [    1.697537] EXT4-fs warning (device dm-1): ext4_clear_journal_err:4396: Filesystem error recorded from previous mount: IO failure
2018-02-27T11:04:07.222306-08:00 WARNING kernel: [    1.697560] EXT4-fs warning (device dm-1): ext4_clear_journal_err:4397: Marking fs in need of filesystem check.
2018-02-27T11:04:07.222315-08:00 WARNING kernel: [    1.697815] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
2018-02-27T11:04:07.222319-08:00 INFO kernel: [    1.697978] EXT4-fs (dm-1): recovery complete

On the previous reboot on 02/20 there are already ext4 errors, but no ext4 recovery on boot, and no complaints from cryptohomed.
Could be -- if the device got corrupted by 760007, we didn't really do anything to correct the issue like running fsck.  My recollection was that it's a bit unclear whether fsck can do more harm than good sometimes.  So, the cleanest solution is to powerwash any devices which have corruption.
Components: OS>Systems>Security
Status: Archived (was: Untriaged)
Looks like this was an isolated case (?) and has been inactive for a while - archiving

Sign in to add a comment