New issue
Advanced search Search tips

Issue 901050 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

memd: avoid crash if files are corrupted on disk

Project Member Reported by sonnyrao@chromium.org, Nov 1

Issue description

We got this out of a crash report


2018-10-28T00:57:38.179175-05:00 WARNING memd[3480]: memd started
2018-10-28T00:57:38.182039-05:00 CRIT kernel: [ 9.861132] EXT4-fs error (device dm-1): ext4_find_entry:1283: inode #224: comm memd: checksumming directory block 0
2018-10-28T00:57:38.182044-05:00 CRIT kernel: [ 9.861350] EXT4-fs error (device dm-1): ext4_find_entry:1283: inode #224: comm memd: checksumming directory block 0
2018-10-28T00:57:38.182270-05:00 ERR memd[3480]: memd: panicked at 'memd failed: LogStaticParametersError(Os { code: 5, kind: Other, message: "Input/output error" })', libcore/result.rs:945:5


it looks like the state files on disk got corrupted -- can we try to handle this more gracefully?
 
We should consider the possibility that this is WAI.  It may be reasonable to crash if file operations fail with I/O errors.  It's probably not very common.  I sure hope it isn't.

Or do you think that if a file operation returns I/O error, we should just log that and exit?

It seems that other daemons would have the same issues too.

In this scenario I think the disk itself is fine but the files maybe corrupted, so it might be recoverable if we deleted the directory and state files and start from a clean state.

Also memd will keep launching and crashing several times until maybe upstart stops respawning -- so we'll get a bunch of crash reports for the same problem.

I think it would be pretty easy to delete all the state and try starting from scratch, no?
#2 I suppose there's a chance that the directory is corrupted but it can still be removed successfully.

But really, I have to wonder again if this is worth the trouble.  We have ONE instance of this.  It could be a one-off.  Many other daemons could find themselves in an identical situation.  Do they deal with it?

we have one instance of it on canary channel which is probably a very small percentage of our users.  We'll see how common it is after the logging rolls out to more users.  I wouldn't be surprised if it's more common than the one person. 

I'm not sure what other daemons do, but generally repeatedly crashing and not running isn't a great option for any daemon.  memd has the luxury of not being system critical, if it were something like shill, I doubt it would have similar behavior.
OK let's wait and see then.

Actually one thing that we can do is avoid writing that file if it's already there.  Right now it's updated because there's a chance that the information in it is stale.  We could also write it only if it has changed (except the time stamp).
<UI triage> Bug owners, please add the appropriate component to your bug. Thanks!
Components: OS>Performance>Memory
Status: Assigned (was: Unconfirmed)
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.

Sign in to add a comment