New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 724324 link

Starred by 3 users

Issue metadata

Status: Archived
Owner:
Closed: Jul 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

report recovery reason to UMA

Project Member Reported by semenzato@chromium.org, May 19 2017

Issue description

Gidon would be very interested (and rightfully so) in getting UMA metrics for the recovery reason.

I didn't find any except for UpdateEngine.Daily.RecoveryCount which however shows a count of 0 so maybe it's not completely working.

We have the information already with "mosys eventlog list", just a matter of parsing it carefully and not reporting it multiple times.

If the mosys mechanism doesn't work well, we can ask the firmware folks for another flashram field or some other persistent gizmo containing the "latest recovery reason" which we would send to UMA and clear.
 
We have some CLs for the recovery count, which as it says only repost the number of recoveries performed on a device. We landed the CL in the chromium tools side for the UMA metrics but at the end we decided to postpone the Chrome os side of it to M61. So eventually recovery count numbers will be ok.

I'm not sure though how you want to capture the reason for recovery? Are you looking for recovery reason for chromium itself? 
Cc: aaboagye@chromium.org
I believe we already log the recovery reason to mosys.  Aseda did some work on this related to FAFT testing and may remember more details.
#1 Gidon can comment on why they would like to keep track of the recovery reasons.  Offhand I'd say that it could be useful to know.

#2 Yes, I mentioned "mosys eventlog list" in the initial comment, is there some other way?  The "mosys eventlog list" source is a little messy to work with because we need to keep track of whether we've already sent the sample for a specific recovery.  If we keep track of it in the file system, we may send extra samples if multiple recoveries occur within the same eventlog.

Ideally there should be a byte that persists across reboot and contains the recovery reason (if any).  After reading that byte and sending the sample, we clear the byte.  There is still a race, but it's a small one and we may just ignore it.
Aren't the eventlog entries timestamped though? Could you use that to determine what the last recovery you sent was?
Cc: keta...@chromium.org stephenlin@chromium.org
The recovery count is kept in vpd so it persist across recoveries.

Comment 6 by gidon@google.com, May 19 2017

Tracking the recovery reasons will enable us to know and track recovery reasons in an aggregated level - what's causing the most pain. Being able to cross-reference Board name and OS version, etc would be even better... 
The goal of collecting this information is to help identify root cause and mitigate the need for CrOS Recovery (which is a bad user experience).
Labels: Restrict-View-Google
gidon@ can you please share some examples of recovery reasons? 

semenzato@ the work that Ahassani@ is doing right now is to save the recovery count and first active flag across recoveries. Since this information is saved in the vpd and persists across recoveries it goes through a lot of privacy scrutiny. Adding reasons, board, OS information is another level of information.

Comment 8 by gidon@google.com, May 19 2017

0x13: RW firmware verification failed
0x26, 0x29, 0x2a, 0x57 or 0x58: EC software sync - error updating or verifying EC image hash
0x42 or 0x43: No OS kernel detected or failed signature check <- disk is likely corrupted
0x45: RW firmware in dev mode, but dev is switched off
0x46: Shared data error in rewritable firmware
0x7f: Error in rewritable firmware
0x81or 0x8f: DM-verity error or error in kernel
Does these reasons force the user to do a USB recovery (by using recovery usb)? Because that's what we actually mean by recovery here. User does recovery using usb. This is usually is done in schools before start of the new school year.
#9 actually no, these samples would indicate that the device entered recovery mode for the given reason, and won't necessarily correspond to USB recoveries.  Some of the recovery reasons can be transient and not require a recovery.  Some ARE transient (reason: recovery button pressed).

A recovery image could also send a sample the very first time it runs---then you would know.  But there may be interference with a powerwash (i.e. if we're not careful, after a powerwash the OS might think that it's the first time it runs, so this needs some investigation).

Comment 12 by gidon@google.com, May 23 2017

What are the next steps to aggregate CrOS recovery error codes in UMA?

Comment 13 by gidon@google.com, Jun 5 2017

Can someone please provide an update.
I'm not sure if anyone is working on this. Is this a priority at this point?
I also don't think anybody is working on this.  A good fraction of the kernel team right now is working on some complicated emergency.

Comment 16 by gidon@google.com, Jun 5 2017

Yes, from my perspective this most certainly is a priority.
Capturing device recovery reasons in an aggregated manner will enable us to quantify and track device failures reasons. While some recovery reason codes are user generated or not failures; receiving Chrome OS Missing or Damaged error is associated to one of the worst user experiences - the user’s device is inoperable. 
gidon@ currently the team is heads down on various hi pri things. However, there are a couple of ways we can approach this. Let me know which one you prefer. In order to prioritize this over other fires we need a good business justification (e.g- serious regressions affecting all boards, all enterprise users unable to enterprise enroll etc). Since this is something that has been happening for a while (not a regression) and is more of a failure analysis effort it is going to be harder to justify over other fires. Another approach could be if you have cycles you could help Ahassani@ with this CL. WDYT?

Comment 18 by gidon@google.com, Jun 6 2017

I can imagine there are higher priority issues at hand and will try to assist Ahassani@. Let's reassess value vs effort when I've made some progress and things settle down.

Comment 19 by sjg@google.com, Jun 15 2017

Cc: sjg@chromium.org
Cc: adlr@chromium.org
So as we stand now, we know that the recovery error information is captured by mosys and can be seen by:
`mosys eventlog list`
mosys writes into a circular persistent log which survives recovery/powerwash/update/etc.
Errors and events are timestamped. This log can be parsed and proper error codes (which may require a mapping) be sent to UMA. The problem is duplication: How to make sure we don't sent the same error (with the same timestamp) to UMA after recovery. One solution is to write the last sent error and timestamp to VPD and don't sent it if it was there. The other way is to figure out if this duplication can be detected by UMA metrics server side. We also have to see if this duplication is a high occuring problem that needs attention or not. We certainly have to no data on it. The other possible problem is the limited size of mosys log and the possibility the last error code be overridden.

Comment 21 by sjg@google.com, Jun 15 2017

Owner: la...@chromium.org
Status: Available (was: Untriaged)

Comment 22 by la...@chromium.org, Jun 15 2017

I'll propose this:

1. at the end of a successful recovery (in recovery_init.sh), write a record to vpd including recovery timestamp and reason, e.g.: last_recovery="ts:123456789 reason:2"

2. on boot, check for that key in vpd. If present:
 a. erase from vpd 
 b. (optionally) extract a chunk of the eventlog based on the recovery timestamp
 c. report recovery to UMA

Thoughts? Is this worth a design doc?

Comment 23 by adlr@chromium.org, Jun 15 2017

I'm guessing if we get a spontaneous reboot in 2b that we are okay losing that data? I think that's likely okay as long as loggging to UMA is reliable (ie, it will store the info locally before trying to send it up to the server).
And by "on boot", I'm assuming you mean 'after boot' because writing into vpd is quite slow and we don't want to increase boot time. Just checking.

I agree with adlr@, you should erase the value once you successfully logged it in UMA.
If you do this you're only going to log all recovery mode events where the user actually recovered the system, though. Are you not interested in transient boot errors where the system just entered recovery mode once and a normal "turn it off and on again" fixed it?

Why not just store the timestamp of the last uploaded recovery timestamp so you won't miss those? (You could even store that on the stateful partition, no need for VPD. If the timestamp file is not initialized, you can assume the device was just recovered and scan the eventlog backwards for the last few boots to find the reason.)

Comment 26 by la...@chromium.org, Jun 16 2017

> I'm guessing if we get a spontaneous reboot in 2b that we are okay losing that data?

I was thinking it may be better to under-report (where UMA report fails after erase) than over-report (where erase fails after report), but now that I'm thinking about it again it may be that sending the report is much more likely to fail than erasing vpd, so erasing after might produce better data.

> And by "on boot", I'm assuming you mean 'after boot'

Yes, by "on boot" I mean "some 'suitable time' after boot".

> Are you not interested in [...] errors where the system just entered recovery mode

We could track every entry into recovery mode if that would be interesting data; those are already recorded in eventlog. As far as I can tell, there is not currently a way to determine whether an actual recovery has taken place, which would be the new feature here.

Comment 27 by la...@chromium.org, Jun 16 2017

After a closer look at the recovery script, I think this can be done without touching vfd. Proposal doc here: https://docs.google.com/a/google.com/document/d/1y9wNeoHP2GfaWCmNR6SbK29h2Er5raZChmITVG-DBUE/edit

Comment 28 by gidon@google.com, Jun 16 2017

I added a few comments to the proposal doc - feel free to setup some time with me to discuss further. 
A few thoughts: 
1. What other information can we collect from mosys?
Keep in mind that we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc 
2. Is mosys being used for some other purpose?
We should check before erasing any data
3. Does UMA reporting include a unique identifier - if a device fails (Example: 0x5b bad SSD) repeatedly 10 times in a day, we wouldn't want to consider this as 10 devices that failed but rather just the one device. 

Comment 29 by gidon@google.com, Jun 16 2017

I added a few comments to the proposal doc - feel free to setup some time with me to discuss further. 
A few thoughts: 
1. What other information can we collect from mosys?
Keep in mind that we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc 
2. Is mosys being used for some other purpose?
We should check before erasing any data
3. Does UMA reporting include a unique identifier - if a device fails (Example: 0x5b bad SSD) repeatedly 10 times in a day, we wouldn't want to consider this as 10 devices that failed but rather just the one device. 

Comment 30 by la...@chromium.org, Jun 16 2017

> we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc 

My understanding is that all UMA metrics are associated with this sort of information.

> We should check before erasing any data

The proposal in the doc no longer includes erasing anything except the temporary file passed from the recovery image.

> Does UMA reporting include a unique identifier

There is an ID, but I don't know that it would normally survive a recovery.

Comment 31 by la...@chromium.org, Jun 19 2017

Status: Started (was: Available)

Comment 34 Deleted

Comment 35 by sjg@google.com, Jul 5 2017

Labels: Team-BLD
Project Member

Comment 36 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f

commit c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f
Author: Lann Martin <lannm@chromium.org>
Date: Thu Jul 06 04:54:55 2017

Report recovery metrics to UMA after recovery

During recovery, write a recovery histograms file to the stateful partition
with metrics about the recovery process.

At next boot, check for the presence of this histograms file, send metrics
to UMA, and delete the file.

Metrics reported with this change:

Installer.Recovery.Count: number of times this system has been recovered
Installer.Recovery.Reason: reason for entering recovery mode

BUG= chromium:724324 
TEST=ran recovery, checked uma-events file

Change-Id: Icb34599a9df8d0a842c6c5e8bbadd36468013a9e
Reviewed-on: https://chromium-review.googlesource.com/540828
Commit-Ready: Lann Martin <lannm@chromium.org>
Tested-by: Lann Martin <lannm@chromium.org>
Reviewed-by: Mattias Nissler <mnissler@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[add] https://crrev.com/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f/init/upstart/send-recovery-metrics.conf
[modify] https://crrev.com/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f/installer/chromeos-install

Project Member

Comment 37 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/32fffa7d331c89094c0f91db5994e5b72dfa6092

commit 32fffa7d331c89094c0f91db5994e5b72dfa6092
Author: Lann Martin <lannm@chromium.org>
Date: Thu Jul 06 16:29:00 2017

Add UMA histogram "Installer.Recovery.Reason" for Chrome OS

This tracks the reason for entering recovery mode (e.g. "recovery
button") prior to a recovery in Chrome OS.

Also rename "UpdateEngine.Daily.RecoveryCount" to
"Installer.Recovery.Count" to better reflect how it will be reported.

BUG= chromium:724324 
BUG=chromium:717788

Change-Id: I6508aca15c9d99fc10861be89e527d83c36cd9b2
Reviewed-on: https://chromium-review.googlesource.com/541602
Commit-Queue: Lann Martin <lannm@chromium.org>
Reviewed-by: Mark Pearson <mpearson@chromium.org>
Cr-Commit-Position: refs/heads/master@{#484623}
[modify] https://crrev.com/32fffa7d331c89094c0f91db5994e5b72dfa6092/tools/metrics/histograms/enums.xml
[modify] https://crrev.com/32fffa7d331c89094c0f91db5994e5b72dfa6092/tools/metrics/histograms/histograms.xml

Comment 39 by gidon@google.com, Jul 11 2017

This is great to see. Is there any way to show/filter by HWID and/or Board ID?

Comment 40 by la...@chromium.org, Jul 11 2017

I believe HWID is "Hardware class" in the UMA filters.

Comment 41 by la...@chromium.org, Jul 11 2017

Status: Fixed (was: Started)
Cc: tnagel@chromium.org
Ketaki, can we remove RVG from this issue? Skimming through it, I didn't come across anything that looked sensitive to me. 
Labels: -Restrict-View-Google OS-Chrome
After offline discussion with Ketaki: Reading through the bug I couldn't find confidential information, thus removing RVG.

Comment 44 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment