Figure out how often wifi hardware fails to come up |
||||||||||||||||
Issue descriptionOn Caroline, we recently encountered a dogfood unit on which the PCIe wifi module was not detected by coreboot (attached; see feedback 71346758639 for the complete log). It would be good to understand how commonplace this problem is on various Chrome OS devices, to see if some drivers / designs / modules / etc. have a recurring problem. UMA provides the DevicePresenceStatus metric, but it is unclear to me whether this accurately records the absence of wifi when the system has no functioning network interfaces: https://screenshot.googleplex.com/MKWmn510dbm https://screenshot.googleplex.com/LaKORpv75Hf Can we figure out whether this metric is cached across a reboot and reported later, or if it is lost forever if the user reboots before connectivity is somehow re-established?
,
Aug 23 2017
I am still investigating, but so far it seems that UMA samples are propagated across reboots with high probability. There are two parts to the transport: 1. the Chrome OS metrics library appends samples to /var/lib/metrics/uma-events, which AFAICT persists across clean reboots. 2. Chrome collects samples from that file 30s after startup and at 30s intervals after that. It truncates the file at every collection. (Adding and deleting samples is synchronized via flock() so there is no race there). 3. Chrome sends the samples to the UMA servers after a longish interval, but it also saves them in a local file in case connectivity is lost or the browser crashes. In that case, Chrome attempts to send those samples at a later session. There is a race in Chrome between getting the samples and writing them to the "backup" file, but I don't know how bad. I have asked the UMA folks and will report here. For now it seems reasonable to assume that the samples reflect reality. (Minus other bugs, of course.)
,
Sep 13 2017
Heh, I was about to file a similar bug after looking through the UMA metrics and not finding them sufficient. For me, even if those stats are accurate, they don't quite measure what I care about. I'm interested in normalizing against "number of boots", not against "time running". I think it's safe to say that if a user boots up to find no Wifi device, they aren't likely to keep the system on for long ("try rebooting!"). And furthermore, for these types of problems, the Wifi device is likely to stay in whatever state it started in -- if we fail to detect it at boot, we're not likely to detect it later; and if we successfully probe it at boot, it's not very likely to disappear (this does happen of course, but those are a different class of issue).
So, I'd be interested in a metric that tracks a fixed number of samples per boot cycle. I think this metric would be more illuminating both for the Caroline issues described above, and for some Kevin issues I've looked at.
For reference, I see bugs like this on Kevin:
https://bugs.chromium.org/p/chromium/issues/detail?id=693724
which is tracked internally here:
https://b.corp.google.com/issues/36264732
(Finally seeing some progress! Yay!)
Basically, it looks like there's some flakiness in the firmware load sequence, such that the firmware is likely to die at boot if there's external noise on certain channels.
So WDYT? Is that a reasonable metric to add? Or even better: is there a way to filter existing metrics to produce my desired result?
,
Sep 13 2017
I have a partial answer to my queries from comment #2. When Chrome reads histogram samples from /var/lib/metrics/uma-events, it immediately appends them to a memory-mapped file which is maintained across clean reboots. From there they are read by an uploader, which also stores in the file system (in Local State prefs) any logs not sent. This state is also preserved across clean reboots (i.e. samples may be lost on a system crash). If the unsent logs grow beyond a certain size, further samples are discarded. I am trying to find out what the size is, but it's probably reasonably large. So the answer to the original question is: yes, almost all metrics are cached across reboots. For comment #3: it may be possible to use dremel to correlate samples sent only at boot with wifi functionality samples (by looking at time stamps) but I haven't touched dremel in years. It is fairly easy to create a new histograms which logs the state of wifi with samples at, say, 1, 2, 5 and minutes after boot, and even whether connectivity was ever present in some time interval. Just start a small shell script from upstart and send the samples with metrics_client.
,
Sep 13 2017
From Steven: https://cs.chromium.org/chromium/src/components/metrics/metrics_log_store.cc You should typically be able to store logs from at least 4 chrome sessions, since while you are offline, you would usually generate at most 2 logs per session (startup + one on clean shutdown).
,
Sep 14 2017
Additionally, the metrics daemon has an API for "persistent integers" used to propagate some counters across reboots. If useful, we can use that to keep track of how many boots in a row did not achieve connectivity before the session that finally did.
,
Sep 14 2017
> It is fairly easy to create a new histograms which logs the state of wifi with samples at, say, 1, 2, 5 and minutes after boot, and even whether connectivity was ever present in some time interval. Just start a small shell script from upstart and send the samples with metrics_client. Here is the script I added in 2015 to track the infamous Pixel 1 disappearing battery bug: https://bugs.chromium.org/p/chromium/issues/detail?id=458878#c30 Maybe we should remove it now that the bug is fixed... or maybe we should add other checks like "missing wifi device" to it. What do you think? BTW, a simple test to see if there is at least one wifi device on the system: grep -q "^DEVTYPE=wlan" /sys/class/net/*/uevent ; echo $?
,
Sep 14 2017
> maybe we should add other checks like "missing wifi device" to it I like that. One difficulty is that I'd want to wait some reasonable time after boot, since Wifi drivers (a) often aren't built-in (so they'll sometimes load later -- although I think we have optimizations to try to load them manually before "the rest"?) and (b) often load their FW asynchronously; so the interface doesn't show up until some unspecified time after modprobe. "start on started system-services" seems a little early for that, IIUC, unless we just throw in some extra "sleep" or something.
,
Sep 14 2017
Agreed, the current deps for this upstart job are probably not correct for wifi driver checks. Apparently there was an "upstart-time-bridge" project in the works[0]. Not sure if it ever got finished (I don't see a command or an USE flag for it). If not, adding `sleep 10` may suffice. [0] https://www.linuxplumbersconf.org/2013/ocw/system/presentations/1527/original/upstart-roadmap-plumbers-2013.pdf
,
Sep 14 2017
For the record, the link in #7 points to a histograms.xml CL. I think that the actual script is here. https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/272278/5/init/send-kernel-errors.conf #9: yes it would be easy to sample at 10s after boot. However, is it possible that connectivity appears later? But if we wait too long, the user might reboot...
,
Sep 20 2017
Sorry for the confusion, this is not about connectivity, just the device being present or not. CL at https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/675908
,
Sep 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/70996250916043cd5e54804c59ecd1aa135242a8 commit 70996250916043cd5e54804c59ecd1aa135242a8 Author: Luigi Semenzato <semenzato@chromium.org> Date: Fri Sep 22 02:38:27 2017 init: collect Platform.WifiDeviceCount sample. This reports how many wifi devices the kernel has loaded at boot. We expect samples with value 1 in most cases. BUG= chromium:758016 TEST=ran on chromebook, observed sample with value 1. Change-Id: I8d56cde8bf257d185a240e32d3dfb8a4b9d6894c Reviewed-on: https://chromium-review.googlesource.com/675908 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/70996250916043cd5e54804c59ecd1aa135242a8/init/upstart/send-kernel-errors.conf
,
Sep 25 2017
,
Sep 25 2017
This bug requires manual review: M62 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 26 2017
Approved for 62.
,
Sep 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/fc8ff1feb4e0fed6190138b55f60957847cba5ac commit fc8ff1feb4e0fed6190138b55f60957847cba5ac Author: Luigi Semenzato <semenzato@chromium.org> Date: Wed Sep 27 18:41:12 2017 init: collect Platform.WifiDeviceCount sample. This reports how many wifi devices the kernel has loaded at boot. We expect samples with value 1 in most cases. BUG= chromium:758016 TEST=ran on chromebook, observed sample with value 1. Change-Id: I8d56cde8bf257d185a240e32d3dfb8a4b9d6894c Reviewed-on: https://chromium-review.googlesource.com/675908 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> (cherry picked from commit 70996250916043cd5e54804c59ecd1aa135242a8) Reviewed-on: https://chromium-review.googlesource.com/681617 Commit-Queue: Luigi Semenzato <semenzato@chromium.org> [modify] https://crrev.com/fc8ff1feb4e0fed6190138b55f60957847cba5ac/init/upstart/send-kernel-errors.conf
,
Sep 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4450c157f2587532ee7f32ad3f8f80814a324788 commit 4450c157f2587532ee7f32ad3f8f80814a324788 Author: Luigi Semenzato <semenzato@chromium.org> Date: Wed Sep 27 21:03:20 2017 histograms.xml: add Platform.WiFiDeviceCount BUG= chromium:758016 TEST=none Change-Id: I6da6a2ebadabf39109534706fb8673b13ae95fff Reviewed-on: https://chromium-review.googlesource.com/679274 Commit-Queue: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Ilya Sherman <isherman@chromium.org> Reviewed-by: Robert Kaplow <rkaplow@chromium.org> Cr-Commit-Position: refs/heads/master@{#504767} [modify] https://crrev.com/4450c157f2587532ee7f32ad3f8f80814a324788/tools/metrics/histograms/histograms.xml
,
Sep 29 2017
Done.
,
Oct 2 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 2 2017
I assume #17 didn't need merged to M-62? Is it OK, as long as ToT has the appropriate metric listed? At any rate, I'm seeing some stats on M-62, so LGTM.
,
Oct 2 2017
,
Oct 2 2017
#20: yes, #17 is server-side so it doesn't need merging. Changes to histograms.xml are magically pushed to the UMA servers nightly. Thanks!
,
Jan 22 2018
,
Jan 23 2018
,
Sep 13
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by rajatja@google.com
, Aug 22 2017