New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 758016 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Task



Sign in to add a comment

Figure out how often wifi hardware fails to come up

Project Member Reported by cernekee@chromium.org, Aug 22 2017

Issue description

On Caroline, we recently encountered a dogfood unit on which the PCIe wifi module was not detected by coreboot (attached; see feedback 71346758639 for the complete log).  It would be good to understand how commonplace this problem is on various Chrome OS devices, to see if some drivers / designs / modules / etc. have a recurring problem.

UMA provides the DevicePresenceStatus metric, but it is unclear to me whether this accurately records the absence of wifi when the system has no functioning network interfaces:

https://screenshot.googleplex.com/MKWmn510dbm
https://screenshot.googleplex.com/LaKORpv75Hf

Can we figure out whether this metric is cached across a reboot and reported later, or if it is lost forever if the user reboots before connectivity is somehow re-established?
 
good.txt
59.1 KB View Download
bad.txt
59.1 KB View Download

Comment 1 by rajatja@google.com, Aug 22 2017

Cc: furquan@chromium.org
Status: Started (was: Untriaged)
I am still investigating, but so far it seems that UMA samples are propagated across reboots with high probability.  There are two parts to the transport:

1. the Chrome OS metrics library appends samples to /var/lib/metrics/uma-events, which AFAICT persists across clean reboots.

2. Chrome collects samples from that file 30s after startup and at 30s intervals after that.  It truncates the file at every collection. (Adding and deleting samples is synchronized via flock() so there is no race there).

3. Chrome sends the samples to the UMA servers after a longish interval, but it also saves them in a local file in case connectivity is lost or the browser crashes.  In that case, Chrome attempts to send those samples at a later session.

There is a race in Chrome between getting the samples and writing them to the "backup" file, but I don't know how bad.  I have asked the UMA folks and will report here.

For now it seems reasonable to assume that the samples reflect reality.  (Minus other bugs, of course.)

Cc: briannorris@chromium.org
Heh, I was about to file a similar bug after looking through the UMA metrics and not finding them sufficient. For me, even if those stats are accurate, they don't quite measure what I care about. I'm interested in normalizing against "number of boots", not against "time running". I think it's safe to say that if a user boots up to find no Wifi device, they aren't likely to keep the system on for long ("try rebooting!"). And furthermore, for these types of problems, the Wifi device is likely to stay in whatever state it started in -- if we fail to detect it at boot, we're not likely to detect it later; and if we successfully probe it at boot, it's not very likely to disappear (this does happen of course, but those are a different class of issue).

So, I'd be interested in a metric that tracks a fixed number of samples per boot cycle. I think this metric would be more illuminating both for the Caroline issues described above, and for some Kevin issues I've looked at.

For reference, I see bugs like this on Kevin:

https://bugs.chromium.org/p/chromium/issues/detail?id=693724
which is tracked internally here:
https://b.corp.google.com/issues/36264732
(Finally seeing some progress! Yay!)

Basically, it looks like there's some flakiness in the firmware load sequence, such that the firmware is likely to die at boot if there's external noise on certain channels.

So WDYT? Is that a reasonable metric to add? Or even better: is there a way to filter existing metrics to produce my desired result?
I have a partial answer to my queries from comment #2.  When Chrome reads histogram samples from /var/lib/metrics/uma-events, it immediately appends them to a memory-mapped file which is maintained across clean reboots.  From there they are read by an uploader, which also stores in the file system (in Local State prefs) any logs not sent.  This state is also preserved across clean reboots (i.e. samples may be lost on a system crash).

If the unsent logs grow beyond a certain size, further samples are discarded.  I am trying to find out what the size is, but it's probably reasonably large.  So the answer to the original question is: yes, almost all metrics are cached across reboots.

For comment #3: it may be possible to use dremel to correlate samples sent only at boot with wifi functionality samples (by looking at time stamps) but I haven't touched dremel in years.  It is fairly easy to create a new histograms which logs the state of wifi with samples at, say, 1, 2, 5 and minutes after boot, and even whether connectivity was ever present in some time interval.  Just start a small shell script from upstart and send the samples with metrics_client.


Cc: holte@chromium.org
From Steven:

https://cs.chromium.org/chromium/src/components/metrics/metrics_log_store.cc

You should typically be able to store logs from at least 4 chrome sessions, since while you are offline, you would usually generate at most 2 logs per session (startup + one on clean shutdown).

Additionally, the metrics daemon has an API for "persistent integers" used to propagate some counters across reboots.  If useful, we can use that to keep track of how many boots in a row did not achieve connectivity before the session that finally did.
Cc: matthewmwang@chromium.org
> It is fairly easy to create a new histograms which logs the state of wifi with samples at, say, 1, 2, 5 and minutes after boot, and even whether connectivity was ever present in some time interval.  Just start a small shell script from upstart and send the samples with metrics_client.

Here is the script I added in 2015 to track the infamous Pixel 1 disappearing battery bug:

https://bugs.chromium.org/p/chromium/issues/detail?id=458878#c30

Maybe we should remove it now that the bug is fixed... or maybe we should add other checks like "missing wifi device" to it.  What do you think?


BTW, a simple test to see if there is at least one wifi device on the system:

grep -q "^DEVTYPE=wlan" /sys/class/net/*/uevent ; echo $?

> maybe we should add other checks like "missing wifi device" to it

I like that. One difficulty is that I'd want to wait some reasonable time after boot, since Wifi drivers

(a) often aren't built-in (so they'll sometimes load later -- although I think we have optimizations to try to load them manually before "the rest"?) and
(b) often load their FW asynchronously; so the interface doesn't show up until some unspecified time after modprobe.


"start on started system-services" seems a little early for that, IIUC, unless we just throw in some extra "sleep" or something.
Agreed, the current deps for this upstart job are probably not correct for wifi driver checks.

Apparently there was an "upstart-time-bridge" project in the works[0].  Not sure if it ever got finished (I don't see a command or an USE flag for it).  If not, adding `sleep 10` may suffice.

[0] https://www.linuxplumbersconf.org/2013/ocw/system/presentations/1527/original/upstart-roadmap-plumbers-2013.pdf
For the record, the link in #7 points to a histograms.xml CL.  I think that the actual script is here.

https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/272278/5/init/send-kernel-errors.conf

#9: yes it would be easy to sample at 10s after boot.  However, is it possible that connectivity appears later?  But if we wait too long, the user might reboot...
Sorry for the confusion, this is not about connectivity, just the device being present or not.

CL at https://chromium-review.googlesource.com/#/c/chromiumos/platform2/+/675908

Project Member

Comment 12 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/70996250916043cd5e54804c59ecd1aa135242a8

commit 70996250916043cd5e54804c59ecd1aa135242a8
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Fri Sep 22 02:38:27 2017

init: collect Platform.WifiDeviceCount sample.

This reports how many wifi devices the kernel has loaded at boot.
We expect samples with value 1 in most cases.

BUG= chromium:758016 
TEST=ran on chromebook, observed sample with value 1.

Change-Id: I8d56cde8bf257d185a240e32d3dfb8a4b9d6894c
Reviewed-on: https://chromium-review.googlesource.com/675908
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/70996250916043cd5e54804c59ecd1aa135242a8/init/upstart/send-kernel-errors.conf

Labels: Merge-Request-62
Project Member

Comment 14 by sheriffbot@chromium.org, Sep 25 2017

Labels: -Merge-Request-62 Merge-Review-62 Hotlist-Merge-Review
This bug requires manual review: M62 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-62 Merge-Approved-62
Approved for 62.
Project Member

Comment 16 by bugdroid1@chromium.org, Sep 27 2017

Labels: merge-merged-release-R62-9901.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/fc8ff1feb4e0fed6190138b55f60957847cba5ac

commit fc8ff1feb4e0fed6190138b55f60957847cba5ac
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Wed Sep 27 18:41:12 2017

init: collect Platform.WifiDeviceCount sample.

This reports how many wifi devices the kernel has loaded at boot.
We expect samples with value 1 in most cases.

BUG= chromium:758016 
TEST=ran on chromebook, observed sample with value 1.

Change-Id: I8d56cde8bf257d185a240e32d3dfb8a4b9d6894c
Reviewed-on: https://chromium-review.googlesource.com/675908
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>
(cherry picked from commit 70996250916043cd5e54804c59ecd1aa135242a8)
Reviewed-on: https://chromium-review.googlesource.com/681617
Commit-Queue: Luigi Semenzato <semenzato@chromium.org>

[modify] https://crrev.com/fc8ff1feb4e0fed6190138b55f60957847cba5ac/init/upstart/send-kernel-errors.conf

Project Member

Comment 17 by bugdroid1@chromium.org, Sep 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4450c157f2587532ee7f32ad3f8f80814a324788

commit 4450c157f2587532ee7f32ad3f8f80814a324788
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Wed Sep 27 21:03:20 2017

histograms.xml: add Platform.WiFiDeviceCount

BUG= chromium:758016 
TEST=none

Change-Id: I6da6a2ebadabf39109534706fb8673b13ae95fff
Reviewed-on: https://chromium-review.googlesource.com/679274
Commit-Queue: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Ilya Sherman <isherman@chromium.org>
Reviewed-by: Robert Kaplow <rkaplow@chromium.org>
Cr-Commit-Position: refs/heads/master@{#504767}
[modify] https://crrev.com/4450c157f2587532ee7f32ad3f8f80814a324788/tools/metrics/histograms/histograms.xml

Status: Fixed (was: Started)
Done.
Project Member

Comment 19 by sheriffbot@chromium.org, Oct 2 2017

Cc: bhthompson@google.com
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-62 Merge-Merged
I assume #17 didn't need merged to M-62? Is it OK, as long as ToT has the appropriate metric listed? At any rate, I'm seeing some stats on M-62, so LGTM.
Labels: -Hotlist-Merge-Review
#20: yes, #17 is server-side so it doesn't need merging.  Changes to histograms.xml are magically pushed to the UMA servers nightly.  Thanks!

Comment 23 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 24 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)
Status: Verified (was: Fixed)

Sign in to add a comment