memd: avoid crash if files are corrupted on disk |
|||
Issue description
We got this out of a crash report
2018-10-28T00:57:38.179175-05:00 WARNING memd[3480]: memd started
2018-10-28T00:57:38.182039-05:00 CRIT kernel: [ 9.861132] EXT4-fs error (device dm-1): ext4_find_entry:1283: inode #224: comm memd: checksumming directory block 0
2018-10-28T00:57:38.182044-05:00 CRIT kernel: [ 9.861350] EXT4-fs error (device dm-1): ext4_find_entry:1283: inode #224: comm memd: checksumming directory block 0
2018-10-28T00:57:38.182270-05:00 ERR memd[3480]: memd: panicked at 'memd failed: LogStaticParametersError(Os { code: 5, kind: Other, message: "Input/output error" })', libcore/result.rs:945:5
it looks like the state files on disk got corrupted -- can we try to handle this more gracefully?
,
Nov 1
In this scenario I think the disk itself is fine but the files maybe corrupted, so it might be recoverable if we deleted the directory and state files and start from a clean state. Also memd will keep launching and crashing several times until maybe upstart stops respawning -- so we'll get a bunch of crash reports for the same problem. I think it would be pretty easy to delete all the state and try starting from scratch, no?
,
Nov 1
#2 I suppose there's a chance that the directory is corrupted but it can still be removed successfully. But really, I have to wonder again if this is worth the trouble. We have ONE instance of this. It could be a one-off. Many other daemons could find themselves in an identical situation. Do they deal with it?
,
Nov 1
we have one instance of it on canary channel which is probably a very small percentage of our users. We'll see how common it is after the logging rolls out to more users. I wouldn't be surprised if it's more common than the one person. I'm not sure what other daemons do, but generally repeatedly crashing and not running isn't a great option for any daemon. memd has the luxury of not being system critical, if it were something like shill, I doubt it would have similar behavior.
,
Nov 1
OK let's wait and see then. Actually one thing that we can do is avoid writing that file if it's already there. Right now it's updated because there's a chance that the information in it is stale. We could also write it only if it has changed (except the time stamp).
,
Nov 8
<UI triage> Bug owners, please add the appropriate component to your bug. Thanks!
,
Nov 9
,
Jan 11
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.
,
Jan 11
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this. |
|||
►
Sign in to add a comment |
|||
Comment 1 by semenzato@chromium.org
, Nov 1