New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 843442 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

Collect machine check logs on x86 systems when an mce occurs

Project Member Reported by bleung@chromium.org, May 16 2018

Issue description

We see occasional MCEs (machine check exceptions) on our x86 systems, especially in preproduction. We should use a utility like mcelog to decode these at run time and log them.

http://www.mcelog.org/

We should have mcelog run in daemon mode on all x86 systems.

mcelog was pulled into the chromiumos overlay back in 2013, and that's still the version that's installed when I emerge it.  We should update to a more recent version.
 

Comment 1 by bleung@chromium.org, May 16 2018

Cc: kirtika@chromium.org
+kirtika as an fyi.

Comment 2 by vapier@chromium.org, May 16 2018

Components: Internals>CrashReporting
does mcelog need to be running on the system all the time, or at all ?  isn't there a raw log we can just save & upload and do the analysis on the server side ?

we really dislike doing any local processing if we can avoid it.

Comment 3 by bleung@chromium.org, May 16 2018

We could use something like https://github.com/thockin/mcedaemon and upload for later decoding.

Comment 4 by vapier@chromium.org, May 16 2018

we can't get away from running a persistent daemon though ?  ideally a MCE would be noticed either by a udev event or via the existing crash anomaly collector (which parses the kernel/syslog stream), and then that event would kick off the MCE data collection.

a glance at the docs suggests that MCEs are processed via a blocking read/poll on /dev/mcelog.  i guess we're stuck with adding another daemon.  at least it'd only use up RAM resources at runtime though in the normal case.

adding a new crash handler type to crash-reporter shouldn't be too hard, and then we'd add a handler to /etc/mced/ which would call crash-reporter with the right flags/etc...

will need to update the PDD to describe this new source of data, and document PII/etc... that is collected.
Status: Assigned (was: Available)
Cc: rajatja@google.com
Components: -Internals>CrashReporting OS>Systems>CrashReporting

Sign in to add a comment