Collect machine check logs on x86 systems when an mce occurs |
|||||
Issue descriptionWe see occasional MCEs (machine check exceptions) on our x86 systems, especially in preproduction. We should use a utility like mcelog to decode these at run time and log them. http://www.mcelog.org/ We should have mcelog run in daemon mode on all x86 systems. mcelog was pulled into the chromiumos overlay back in 2013, and that's still the version that's installed when I emerge it. We should update to a more recent version.
,
May 16 2018
does mcelog need to be running on the system all the time, or at all ? isn't there a raw log we can just save & upload and do the analysis on the server side ? we really dislike doing any local processing if we can avoid it.
,
May 16 2018
We could use something like https://github.com/thockin/mcedaemon and upload for later decoding.
,
May 16 2018
we can't get away from running a persistent daemon though ? ideally a MCE would be noticed either by a udev event or via the existing crash anomaly collector (which parses the kernel/syslog stream), and then that event would kick off the MCE data collection. a glance at the docs suggests that MCEs are processed via a blocking read/poll on /dev/mcelog. i guess we're stuck with adding another daemon. at least it'd only use up RAM resources at runtime though in the normal case. adding a new crash handler type to crash-reporter shouldn't be too hard, and then we'd add a handler to /etc/mced/ which would call crash-reporter with the right flags/etc... will need to update the PDD to describe this new source of data, and document PII/etc... that is collected.
,
Aug 2
,
Aug 16
,
Aug 17
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by bleung@chromium.org
, May 16 2018