report recovery reason to UMA |
|||||||||||||
Issue descriptionGidon would be very interested (and rightfully so) in getting UMA metrics for the recovery reason. I didn't find any except for UpdateEngine.Daily.RecoveryCount which however shows a count of 0 so maybe it's not completely working. We have the information already with "mosys eventlog list", just a matter of parsing it carefully and not reporting it multiple times. If the mosys mechanism doesn't work well, we can ask the firmware folks for another flashram field or some other persistent gizmo containing the "latest recovery reason" which we would send to UMA and clear.
,
May 19 2017
I believe we already log the recovery reason to mosys. Aseda did some work on this related to FAFT testing and may remember more details.
,
May 19 2017
#1 Gidon can comment on why they would like to keep track of the recovery reasons. Offhand I'd say that it could be useful to know. #2 Yes, I mentioned "mosys eventlog list" in the initial comment, is there some other way? The "mosys eventlog list" source is a little messy to work with because we need to keep track of whether we've already sent the sample for a specific recovery. If we keep track of it in the file system, we may send extra samples if multiple recoveries occur within the same eventlog. Ideally there should be a byte that persists across reboot and contains the recovery reason (if any). After reading that byte and sending the sample, we clear the byte. There is still a race, but it's a small one and we may just ignore it.
,
May 19 2017
Aren't the eventlog entries timestamped though? Could you use that to determine what the last recovery you sent was?
,
May 19 2017
The recovery count is kept in vpd so it persist across recoveries.
,
May 19 2017
Tracking the recovery reasons will enable us to know and track recovery reasons in an aggregated level - what's causing the most pain. Being able to cross-reference Board name and OS version, etc would be even better... The goal of collecting this information is to help identify root cause and mitigate the need for CrOS Recovery (which is a bad user experience).
,
May 19 2017
gidon@ can you please share some examples of recovery reasons? semenzato@ the work that Ahassani@ is doing right now is to save the recovery count and first active flag across recoveries. Since this information is saved in the vpd and persists across recoveries it goes through a lot of privacy scrutiny. Adding reasons, board, OS information is another level of information.
,
May 19 2017
0x13: RW firmware verification failed 0x26, 0x29, 0x2a, 0x57 or 0x58: EC software sync - error updating or verifying EC image hash 0x42 or 0x43: No OS kernel detected or failed signature check <- disk is likely corrupted 0x45: RW firmware in dev mode, but dev is switched off 0x46: Shared data error in rewritable firmware 0x7f: Error in rewritable firmware 0x81or 0x8f: DM-verity error or error in kernel
,
May 19 2017
Does these reasons force the user to do a USB recovery (by using recovery usb)? Because that's what we actually mean by recovery here. User does recovery using usb. This is usually is done in schools before start of the new school year.
,
May 19 2017
,
May 19 2017
#9 actually no, these samples would indicate that the device entered recovery mode for the given reason, and won't necessarily correspond to USB recoveries. Some of the recovery reasons can be transient and not require a recovery. Some ARE transient (reason: recovery button pressed). A recovery image could also send a sample the very first time it runs---then you would know. But there may be interference with a powerwash (i.e. if we're not careful, after a powerwash the OS might think that it's the first time it runs, so this needs some investigation).
,
May 23 2017
What are the next steps to aggregate CrOS recovery error codes in UMA?
,
Jun 5 2017
Can someone please provide an update.
,
Jun 5 2017
I'm not sure if anyone is working on this. Is this a priority at this point?
,
Jun 5 2017
I also don't think anybody is working on this. A good fraction of the kernel team right now is working on some complicated emergency.
,
Jun 5 2017
Yes, from my perspective this most certainly is a priority. Capturing device recovery reasons in an aggregated manner will enable us to quantify and track device failures reasons. While some recovery reason codes are user generated or not failures; receiving Chrome OS Missing or Damaged error is associated to one of the worst user experiences - the user’s device is inoperable.
,
Jun 5 2017
gidon@ currently the team is heads down on various hi pri things. However, there are a couple of ways we can approach this. Let me know which one you prefer. In order to prioritize this over other fires we need a good business justification (e.g- serious regressions affecting all boards, all enterprise users unable to enterprise enroll etc). Since this is something that has been happening for a while (not a regression) and is more of a failure analysis effort it is going to be harder to justify over other fires. Another approach could be if you have cycles you could help Ahassani@ with this CL. WDYT?
,
Jun 6 2017
I can imagine there are higher priority issues at hand and will try to assist Ahassani@. Let's reassess value vs effort when I've made some progress and things settle down.
,
Jun 15 2017
,
Jun 15 2017
So as we stand now, we know that the recovery error information is captured by mosys and can be seen by: `mosys eventlog list` mosys writes into a circular persistent log which survives recovery/powerwash/update/etc. Errors and events are timestamped. This log can be parsed and proper error codes (which may require a mapping) be sent to UMA. The problem is duplication: How to make sure we don't sent the same error (with the same timestamp) to UMA after recovery. One solution is to write the last sent error and timestamp to VPD and don't sent it if it was there. The other way is to figure out if this duplication can be detected by UMA metrics server side. We also have to see if this duplication is a high occuring problem that needs attention or not. We certainly have to no data on it. The other possible problem is the limited size of mosys log and the possibility the last error code be overridden.
,
Jun 15 2017
,
Jun 15 2017
I'll propose this: 1. at the end of a successful recovery (in recovery_init.sh), write a record to vpd including recovery timestamp and reason, e.g.: last_recovery="ts:123456789 reason:2" 2. on boot, check for that key in vpd. If present: a. erase from vpd b. (optionally) extract a chunk of the eventlog based on the recovery timestamp c. report recovery to UMA Thoughts? Is this worth a design doc?
,
Jun 15 2017
I'm guessing if we get a spontaneous reboot in 2b that we are okay losing that data? I think that's likely okay as long as loggging to UMA is reliable (ie, it will store the info locally before trying to send it up to the server).
,
Jun 15 2017
And by "on boot", I'm assuming you mean 'after boot' because writing into vpd is quite slow and we don't want to increase boot time. Just checking. I agree with adlr@, you should erase the value once you successfully logged it in UMA.
,
Jun 16 2017
If you do this you're only going to log all recovery mode events where the user actually recovered the system, though. Are you not interested in transient boot errors where the system just entered recovery mode once and a normal "turn it off and on again" fixed it? Why not just store the timestamp of the last uploaded recovery timestamp so you won't miss those? (You could even store that on the stateful partition, no need for VPD. If the timestamp file is not initialized, you can assume the device was just recovered and scan the eventlog backwards for the last few boots to find the reason.)
,
Jun 16 2017
> I'm guessing if we get a spontaneous reboot in 2b that we are okay losing that data? I was thinking it may be better to under-report (where UMA report fails after erase) than over-report (where erase fails after report), but now that I'm thinking about it again it may be that sending the report is much more likely to fail than erasing vpd, so erasing after might produce better data. > And by "on boot", I'm assuming you mean 'after boot' Yes, by "on boot" I mean "some 'suitable time' after boot". > Are you not interested in [...] errors where the system just entered recovery mode We could track every entry into recovery mode if that would be interesting data; those are already recorded in eventlog. As far as I can tell, there is not currently a way to determine whether an actual recovery has taken place, which would be the new feature here.
,
Jun 16 2017
After a closer look at the recovery script, I think this can be done without touching vfd. Proposal doc here: https://docs.google.com/a/google.com/document/d/1y9wNeoHP2GfaWCmNR6SbK29h2Er5raZChmITVG-DBUE/edit
,
Jun 16 2017
I added a few comments to the proposal doc - feel free to setup some time with me to discuss further. A few thoughts: 1. What other information can we collect from mosys? Keep in mind that we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc 2. Is mosys being used for some other purpose? We should check before erasing any data 3. Does UMA reporting include a unique identifier - if a device fails (Example: 0x5b bad SSD) repeatedly 10 times in a day, we wouldn't want to consider this as 10 devices that failed but rather just the one device.
,
Jun 16 2017
I added a few comments to the proposal doc - feel free to setup some time with me to discuss further. A few thoughts: 1. What other information can we collect from mosys? Keep in mind that we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc 2. Is mosys being used for some other purpose? We should check before erasing any data 3. Does UMA reporting include a unique identifier - if a device fails (Example: 0x5b bad SSD) repeatedly 10 times in a day, we wouldn't want to consider this as 10 devices that failed but rather just the one device.
,
Jun 16 2017
> we'd ideally want to be able to slice the this UMA Errors report by Board ID and HWID, Firmware (OS) version when error occurred (not after recovery), etc My understanding is that all UMA metrics are associated with this sort of information. > We should check before erasing any data The proposal in the doc no longer includes erasing anything except the temporary file passed from the recovery image. > Does UMA reporting include a unique identifier There is an ID, but I don't know that it would normally survive a recovery.
,
Jun 19 2017
,
Jun 20 2017
,
Jun 20 2017
,
Jul 5 2017
,
Jul 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f commit c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f Author: Lann Martin <lannm@chromium.org> Date: Thu Jul 06 04:54:55 2017 Report recovery metrics to UMA after recovery During recovery, write a recovery histograms file to the stateful partition with metrics about the recovery process. At next boot, check for the presence of this histograms file, send metrics to UMA, and delete the file. Metrics reported with this change: Installer.Recovery.Count: number of times this system has been recovered Installer.Recovery.Reason: reason for entering recovery mode BUG= chromium:724324 TEST=ran recovery, checked uma-events file Change-Id: Icb34599a9df8d0a842c6c5e8bbadd36468013a9e Reviewed-on: https://chromium-review.googlesource.com/540828 Commit-Ready: Lann Martin <lannm@chromium.org> Tested-by: Lann Martin <lannm@chromium.org> Reviewed-by: Mattias Nissler <mnissler@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [add] https://crrev.com/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f/init/upstart/send-recovery-metrics.conf [modify] https://crrev.com/c4972eb84773f9bdf56f1fd82ebc5a36280bfa5f/installer/chromeos-install
,
Jul 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/32fffa7d331c89094c0f91db5994e5b72dfa6092 commit 32fffa7d331c89094c0f91db5994e5b72dfa6092 Author: Lann Martin <lannm@chromium.org> Date: Thu Jul 06 16:29:00 2017 Add UMA histogram "Installer.Recovery.Reason" for Chrome OS This tracks the reason for entering recovery mode (e.g. "recovery button") prior to a recovery in Chrome OS. Also rename "UpdateEngine.Daily.RecoveryCount" to "Installer.Recovery.Count" to better reflect how it will be reported. BUG= chromium:724324 BUG=chromium:717788 Change-Id: I6508aca15c9d99fc10861be89e527d83c36cd9b2 Reviewed-on: https://chromium-review.googlesource.com/541602 Commit-Queue: Lann Martin <lannm@chromium.org> Reviewed-by: Mark Pearson <mpearson@chromium.org> Cr-Commit-Position: refs/heads/master@{#484623} [modify] https://crrev.com/32fffa7d331c89094c0f91db5994e5b72dfa6092/tools/metrics/histograms/enums.xml [modify] https://crrev.com/32fffa7d331c89094c0f91db5994e5b72dfa6092/tools/metrics/histograms/histograms.xml
,
Jul 11 2017
This is great to see. Is there any way to show/filter by HWID and/or Board ID?
,
Jul 11 2017
I believe HWID is "Hardware class" in the UMA filters.
,
Jul 11 2017
,
Aug 3 2017
Ketaki, can we remove RVG from this issue? Skimming through it, I didn't come across anything that looked sensitive to me.
,
Aug 8 2017
After offline discussion with Ketaki: Reading through the bug I couldn't find confidential information, thus removing RVG.
,
Jan 22 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by ahass...@chromium.org
, May 19 2017