New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 698457 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 9
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Shill should be resilient to wifi interface not being present

Project Member Reported by kirtika@chromium.org, Mar 4 2017

Issue description

Chrome OS R58

Spawned off from  crbug.com/698394 . 
When the wifi interface isn't present because of driver issues,
current shill behavior is to crash and be respawned. 
Eventually something notices that shill is respawning too fast and stops, as per syslog. 
This is bad behavior, as it cuts off the option to recover via usb-ethernet or usb-cellular both of which need shill to work. 

What steps will reproduce the problem?
(1) Boot  9334.0.0 on a 4.4 kernel Intel-wireless board like Eve or Reef. 
(2) Check syslogs for Shill respawning. 

What is the expected result?
Shill should be stable and usb-ethernet connections should work. 

What happens instead?
Chromebook has no networking. If its still in OOBE (no user signed in)
the Chromebook becomes a brick as one can't get past OOBE without
any connectivity. 

 
Cc: diand...@chromium.org
> When the wifi interface isn't present because of driver issues, current shill behavior is to crash and be respawned. 

I believe there are some unresolved issues on Veyron/Broadcom systems that can cause the wifi device to fail to come up.  When this happens in the field, I think we probably want to log whatever information we can, increment an UMA stat, and initiate a graceful reboot.

On test builds it's fine if we skip the reboot, since a developer might want to plug in a USB/Ethernet adapter.  But on production builds with verified boot, I'd rather reboot until the device gets back into a working state, because it is poor UX for a Chromebook to come up in a "living dead" state.

WDYT?
Cc: benchan@chromium.org
>I believe there are some unresolved issues on Veyron/Broadcom systems that can cause the wifi device to fail to come up. 

Not just those systems, the LOG(FATAL) in shill I am talking about is about 50% of our crashes in the field. There are BYT systems where the Intel wifi driver
doesn't initialize - we've not yet been able to figure out whether the cause is one of two open bugs (PCIE jitter or a claimed wifi-bluetooth init race per Intel) or something else. 
As we change anything, we should be careful not to lose that current information that we are getting from the field.

bool ChromeosSupplicantProcessProxy::GetInterface(
    const std::string& ifname, string* rpc_identifier) {
  SLOG(&supplicant_proxy_->GetObjectPath(), 2) << __func__ << ": " << ifname;
  if (!service_available_) {
    LOG(ERROR) << "Supplicant process not present";
    return false;
  }

  dbus::ObjectPath path;
  brillo::ErrorPtr error;
  if (!supplicant_proxy_->GetInterface(ifname, &path, &error)) {
 // This shows up a lot in the field
    LOG(FATAL) << "Failed to get interface " << ifname << ": "
               << error->GetCode() << " " << error->GetMessage();
    return false;  // Make the compiler happy.
  }
  *rpc_identifier = path.value();
  return rpc_identifier;
}






> On test builds it's fine if we skip the reboot, since a developer might want to > plug in a USB/Ethernet adapter.  But on production builds with verified boot, 
> I'd rather reboot until the device gets back into a working state, because it 
> is poor UX for a Chromebook to come up in a "living dead" state.

Are our end users comfortable plugging in usb-eth adapters? If so, we should
make that option available to them else they end up with a brick in this case,
which is only recoverable by getting into dev mode and flashing a good build from USB. 
 

Cc: quiche@chromium.org
I came cross this code and also wondered why ChromeosSupplicantProcessProxy::GetInterface() uses LOG(FATAL) while its caller uses LOG(ERROR):

  if (!supplicant_process_proxy_->CreateInterface(
      create_interface_args, &supplicant_interface_path_)) {
    // Interface might've already been created, attempt to retrieve it.
    if (!supplicant_process_proxy_->GetInterface(link_name(),
                                                 &supplicant_interface_path_)) {
      // TODO(quiche): Is it okay to crash here, if device is missing?
      LOG(ERROR) << __func__ << ": Failed to create interface with supplicant.";
      return;
    }
  }

+quiche seemed to puzzle about the same question before, perhaps he can offer more insights on this.


> As we change anything, we should be careful not to lose that current information that we are getting from the field.

I'd actually like to get more information if possible.  In the case of a missing network interface, the kernel log is far more useful than a shill minidump.

One option is to have shill trigger a kernel panic (from a known function) and let crash_reporter handle it on the next boot.  Although that probably won't leave the stateful partition in a clean state.

> Are our end users comfortable plugging in usb-eth adapters?

Yes, that is one possible workaround for a hardware failure.  I would assume this is primarily power users.  I've seen it mentioned on chromeos-discuss as a way to make use of Chromebooks that have broken wifi modules.

If the bug is infrequent enough, rebooting once on error (and then refusing to auto-reboot on the next attempt if the interface is still missing) might be an acceptable compromise.
>+quiche seemed to puzzle about the same question before, perhaps he can offer more insights on this.

pstew@ provided comments on this very related bug (Intel wifi driver couldn't talk to hardware, the interface went missing and shill crashed): https://bugs.chromium.org/p/chromium/issues/detail?id=594336#c10

"As a result, the WiFi interface disappeared briefly so that when shill asked for it supplicant didn't have it.  It is valid for this to be a fatal exception within shill, since there's no valid path forward for shill if the interface will unexpectedly toggle like this."

Personally, I'd prefer users have a path available to them to get to a stable point, where they can work or update image (which can be made possible if shill accepts loss of wifi but continues to manage eth0) than nothing at all. 

The typical user will fix a broken image via recovery.  So it's more of a question of how to support 1) power users running with broken hardware, and 2) developers.

I'm mostly interested in case (2) but it's worth thinking about (1) as well.
> So it's more of a question of how to support 1) power users running with broken > hardware, and 2) developers.
> I'm mostly interested in case (2) but it's worth thinking about (1) as well.

Adding since I overlooked it myself until now - for (2), we should also consider what happens to the test labs when shill refuses to work. In this case ( crbug.com/698394 ) I was told the servos take care of repairing, but I have no idea of the exact mechanism (is there somewhere the servo can query the last working image and fetch from to apply to the DUT?)


How do the test labs recover if we ship a broken kernel and it bootloops?
FWIW, the reason we didn't trigger a panic (or other reboot mechanism) is that I couldn't think of an obvious way to tell if a problem was transient, or permanent.

In the case of a transient failure (e.g. race condition in initialization), the best thing to do is for shill to restart. On the restart, things come up and the user disruption is minimized.

The downside is as you've observed... In the persistent case, shill's in a crash-loop. I'm not entirely sure that the kernel panic solves this, though. It runs the risk of escalating a shill crash-loop into a system crash-loop. That could be useful in some cases (force a fallback to the previous OS version following a bad update), but harmful in others (user just wants to play an Android game that doesn't need network access).
> ...else they end up with a brick in this case,
which is only recoverable by getting into dev mode and flashing a good build from USB.

dev mode is unnecessary in such a case. Users can run the ChromeOS recovery tool on some other system with the Chrome browser, to flash a USB recovery stick [1][2]. That stick can be used to recover a Chromebook without going into dev mode.

[1] https://support.google.com/chromebook/answer/1080595?hl=en
[2] https://chrome.google.com/webstore/detail/chromebook-recovery-utili/jndclpdbaamdhonoechobihbbiimdgai?hl=en
> FWIW, the reason we didn't trigger a panic (or other reboot mechanism) is that I couldn't think of an obvious way to tell if a problem was transient, or permanent.

I was thinking in terms of leaving a recognizable marker in console-ramoops (or elsewhere) so that on the next boot, we can tell that the system already tried to fix it through rebooting and that did not work.  Hopefully these problems are sufficiently rare that we don't need to reboot several times in a row to work around the bug.

Do we not have any other peripherals that occasionally come up in a brain-dead state?  Even across the whole fleet?  I worked on  bug 458878  a while back, in which the battery fuel gauge chip occasionally became inaccessible until the EC was reset.  It would be surprising if wifi and the fuel gauge were the only two cases of flaky peripherals.

One downside of this approach is that if you're running on permanently damaged hardware, you'll always see an extra panic/reboot.  The other question is how to plumb it up (maybe by having shill and crash_reporter talk to session_manager)?
Agreed that what you propose in #11 avoids a reboot-loop. But it may not do a great job of coping with failures due to race conditions.

My model of these problems is that they are rare in the sense that very few patches cause such problems. But, when such patches land, the failure itself is not rare.

That said, I should admit that "my model" is based on vague recollection, and not hard data.
Jumping a bit late into this, but:

1. I think this bug is about being resilient to the WiFi interface being present but somehow malfunctioning, right?  I'm pretty sure I can boot up a system that has no WiFi at all and it will boot up and shill will work.


NOTE: I'd agree that we should try to be resilient against a malfunctioning WiFi module.  Hardware (especially laptops) eventually breaks and if the user can still get some use out of the device with a $10 Ethernet adapter then that seems nice to let them do.

--

2. I'd probably vote against auto-rebooting the whole system in the case that something went wrong.  We should try to ensure that WiFi comes up robustly and if a small percentage of the time it doesn't then at least the user can reboot themselves.  This contrasts with failures to bring up the UI (where we _do_ reboot one time to see if the UI will come up the second time).  When the UI fails to come up most average users won't know what to do.  If WiFi doesn't come up then (hopefully) user can figure out to turn it off and on again.

--

Of course, the above is just my opinion...
For some reason I have this bug open (tab explosion FTW!), possibly as a potential bug for newcomers to work on? Anyway, I wanted to leave a few notes:

* There are similar problems still happening, e.g.:
https://b.corp.google.com/issues/36264732
It's unclear if many users are seeing this in the field, and recently it's been harder to reproduce for some reason, but it *does* happen. I'm not sure if this is something that needs to be addressed in shill. Flaky drivers are flaky drivers, and trying to force reboots and such seems a bit presumptuous to me.

* Like Doug said, I'm pretty sure shill doesn't always crash in theses cases.

* But I *do* know that if I rapidly add/remove a Wifi driver (e.g., rmmod/modprobe) I can (or at least could, didn't check in the last week or 2) get shill to crash/restart.

* I don't recall any recent cascading failures (where shill sticks in a restart loop) like described in the OP, even amidst some pretty nonstandard tests (rmmod/modprobe loops, reset loops, etc.)
Cc: sjg@chromium.org
Labels: Hotlist-GoodFirstBug

Comment 16 Deleted

Comment 17 Deleted

Cc: -yoshi@chromium.org
Owner: snanda@chromium.org
2.13%.1977
#7 Crash for M64 Beta: 10176.22.0

Reassigning since owner has not viewed in some time.
Owner: cernekee@chromium.org
raise-cc26c0f7 from c#22 and c#18 is an arc-networkd crash, not a shill crash.  I have updated crash.corp to point to bug 736684.
Issue 810195 has been merged into this issue.
Merged per feedback from benchan@.

#2 crash for stable ChromeOS (10176.68). 

10.93% of the crashes (count = 610)
Labels: -Pri-2 Pri-1
Ping?  Increasing impact on the stable release
I was tracing through the code to see if it might be a race condition between shill recognizing the device and supplicant recognizing the device, but I came up empty.  If it's failing both CreateInterface() and GetInterface(), that hints at the problem described in c#5 (wlan0 momentarily appearing, then disappearing).

Note that we'll only hit this crash if a WiFi (subclass of Device) object is created in the first place, so wlan0 has to exist at some point.  shill won't run WiFi::ConnectToSupplicant() until after wlan0 is detected.

Any way to get logs / feedback reports from a unit experiencing the issue?
Cc: abodenha@google.com
cernekee@ I get bit by this a couple times a week. I'll file feedback the next time it happens.

Comment 29 by kirtika@google.com, Feb 20 2018

Hi Albert,
I am very interested in having your device if its hitting this issue regularly. 
What model is it? Do you see the issue fresh after boot/reboot or after resume from suspend? Or is it occurring randomly in the middle of your work?


Sorry for noise. crash/ had a crash with a different signature pointing at this bug for some reason.  Looking closer I see the stacks are totally different.
Kirtika has graciously provided two feedback reports that show this LOG(FATAL) condition:

84957450952 says:

2018-01-20T20:11:33.536153-08:00 INFO shill[7581]: [INFO:modem_manager.cc(59)] Modem manager org.freedesktop.ModemManager1 appeared.
2018-01-20T20:11:33.536382-08:00 INFO shill[7581]: [INFO:wifi.cc(2207)] WPA supplicant appeared.
2018-01-20T20:11:33.536499-08:00 INFO shill[7581]: [INFO:wifi.cc(2289)] wlan0: enabled supplicant: present proxy: null
2018-01-20T20:11:36.352974-08:00 ERR wpa_supplicant[693]: Could not set interface wlan0 flags (UP): Input/output error
2018-01-20T20:11:36.352993-08:00 ERR wpa_supplicant[693]: nl80211: Could not set interface 'wlan0' UP
2018-01-20T20:11:36.353007-08:00 NOTICE wpa_supplicant[693]: nl80211: deinit ifname=wlan0 disabled_11b_rates=0
2018-01-20T20:11:36.353187-08:00 ERR wpa_supplicant[693]: wlan0: Failed to initialize driver interface
2018-01-20T20:11:36.353714-08:00 ERR shill[7581]: [ERROR:object_proxy.cc(582)] Failed to call method: fi.w1.wpa_supplicant1.CreateInterface: object_path= /fi/w1/wpa_supplicant1: fi.w1.wpa_supplicant1.UnknownError: wpa_supplicant couldn't grab this interface.
2018-01-20T20:11:36.353756-08:00 ERR shill[7581]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=fi.w1.wpa_supplicant1.UnknownError, Message=wpa_supplicant couldn't grab this interface.
2018-01-20T20:11:36.353791-08:00 ERR shill[7581]: [ERROR:chromeos_supplicant_process_proxy.cc(130)] Failed to create interface: fi.w1.wpa_supplicant1.UnknownError wpa_supplicant couldn't grab this interface.
2018-01-20T20:11:36.354698-08:00 ERR shill[7581]: [ERROR:object_proxy.cc(582)] Failed to call method: fi.w1.wpa_supplicant1.GetInterface: object_path= /fi/w1/wpa_supplicant1: fi.w1.wpa_supplicant1.InterfaceUnknown: wpa_supplicant knows nothing about this interface.
2018-01-20T20:11:36.354731-08:00 ERR shill[7581]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=fi.w1.wpa_supplicant1.InterfaceUnknown, Message=wpa_supplicant knows nothing about this interface.
2018-01-20T20:11:36.355092-08:00 CRIT shill[7581]: [FATAL:chromeos_supplicant_process_proxy.cc(168)] Failed to get interface wlan0: fi.w1.wpa_supplicant1.InterfaceUnknown wpa_supplicant knows nothing about this interface.#012/usr/lib64/libbase-core-395517.so(bas<IPv6: 9>ug<IPv6: 1>StackTr<IPv6: 10>StackTrace()+0x13) [0x7c0e0b5d1153]#012
2018-01-20T20:11:36.355167-08:00 ERR shill[7582]: [FATAL:chromeos_supplicant_process_proxy.cc(168)] Failed to get interface wlan0: fi.w1.wpa_supplicant1.InterfaceUnknown wpa_supplicant knows nothing about this interface.

syslog shows the same symptoms as http://b/65570878 and bug 667083:

syslog=<multiline>
---------- START ----------
[12] Replay Timer Timeout  
2018-01-20T20:15:31.370361-08:00 INFO kernel: [  805.733576] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.370373-08:00 ERR kernel: [  805.733588] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Transmitter ID)
2018-01-20T20:15:31.370378-08:00 ERR kernel: [  805.733593] pcieport 0000:00:1c.0:   device [8086:0f48] error status/mask=00003001/00002000
2018-01-20T20:15:31.370380-08:00 ERR kernel: [  805.733599] pcieport 0000:00:1c.0:    [ 0] Receiver Error        
2018-01-20T20:15:31.370383-08:00 ERR kernel: [  805.733603] pcieport 0000:00:1c.0:    [12] Replay Timer Timeout  
2018-01-20T20:15:31.370386-08:00 INFO kernel: [  805.733612] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.370389-08:00 DEBUG kernel: [  805.733623] pcieport 0000:00:1c.0: can't find device of ID00e0
2018-01-20T20:15:31.370392-08:00 INFO kernel: [  805.733628] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.370394-08:00 DEBUG kernel: [  805.733637] pcieport 0000:00:1c.0: can't find device of ID00e0
2018-01-20T20:15:31.387353-08:00 INFO kernel: [  805.750790] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.387365-08:00 ERR kernel: [  805.750802] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e0(Transmitter ID)
2018-01-20T20:15:31.387369-08:00 ERR kernel: [  805.750808] pcieport 0000:00:1c.0:   device [8086:0f48] error status/mask=00003000/00002000
2018-01-20T20:15:31.387372-08:00 ERR kernel: [  805.750813] pcieport 0000:00:1c.0:    [12] Replay Timer Timeout  
2018-01-20T20:15:31.400355-08:00 INFO kernel: [  805.763353] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.400366-08:00 ERR kernel: [  805.763367] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e0(Transmitter ID)
2018-01-20T20:15:31.400371-08:00 ERR kernel: [  805.763373] pcieport 0000:00:1c.0:   device [8086:0f48] error status/mask=00003000/00002000
2018-01-20T20:15:31.400374-08:00 ERR kernel: [  805.763378] pcieport 0000:00:1c.0:    [12] Replay Timer Timeout  
2018-01-20T20:15:31.400376-08:00 INFO kernel: [  805.764005] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
2018-01-20T20:15:31.400379-08:00 ERR kernel: [  805.764016] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Receiver ID)
2018-01-20T20:15:31.400382-08:00 ERR kernel: [  805.764021] pcieport 0000:00:1c.0:   device [8086:0f48] error status/mask=00000001/00002000
2018-01-20T20:15:31.400384-08:00 ERR kernel: [  805.764026] pcieport 0000:00:1c.0:    [ 0] Receiver Error        

The shill/supplicant handshake repeated itself a couple of times, then eventually succeeded after this:

2018-01-20T20:15:45.070453-08:00 INFO kernel: [    2.067007] Intel(R) Wireless WiFi driver for Linux
2018-01-20T20:15:45.070460-08:00 INFO kernel: [    2.067016] Copyright(c) 2003- 2015 Intel Corporation
2018-01-20T20:15:45.070467-08:00 WARNING kernel: [    2.070756] iwlwifi 0000:01:00.0: Direct firmware load for iwl-dbg-cfg.ini failed with error -2
2018-01-20T20:15:45.070474-08:00 INFO kernel: [    2.088639] usb 1-2.2: New USB device found, idVendor=0b95, idProduct=1790
2018-01-20T20:15:45.070481-08:00 INFO kernel: [    2.088648] usb 1-2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
2018-01-20T20:15:45.070494-08:00 INFO kernel: [    2.088654] usb 1-2.2: Product: AX88179
2018-01-20T20:15:45.070502-08:00 INFO kernel: [    2.088658] usb 1-2.2: Manufacturer: ASIX Elec. Corp.
2018-01-20T20:15:45.070509-08:00 INFO kernel: [    2.088663] usb 1-2.2: SerialNumber: 3
2018-01-20T20:15:45.070516-08:00 INFO kernel: [    2.103881] cfg80211: World regulatory domain updated:
2018-01-20T20:15:45.070521-08:00 INFO kernel: [    2.103891] cfg80211:  DFS Master region: unset
2018-01-20T20:15:45.070528-08:00 INFO kernel: [    2.103895] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
2018-01-20T20:15:45.070542-08:00 INFO kernel: [    2.103901] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
2018-01-20T20:15:45.070550-08:00 INFO kernel: [    2.103906] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
2018-01-20T20:15:45.070557-08:00 INFO kernel: [    2.103910] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
2018-01-20T20:15:45.070564-08:00 INFO kernel: [    2.103914] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
2018-01-20T20:15:45.070571-08:00 INFO kernel: [    2.103919] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz), (N/A, 2000 mBm), (0 s)
2018-01-20T20:15:45.070578-08:00 INFO kernel: [    2.103923] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
2018-01-20T20:15:45.070584-08:00 INFO kernel: [    2.103928] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
2018-01-20T20:15:45.070597-08:00 INFO kernel: [    2.103932] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
2018-01-20T20:15:45.070606-08:00 INFO kernel: [    2.109358] iwlwifi 0000:01:00.0: loaded firmware version 17.589903.0 op_mode iwlmvm
2018-01-20T20:15:45.116371-08:00 INFO crash_reporter[458]: Enabling user crash handling
2018-01-20T20:15:45.118083-08:00 INFO kernel: [    2.164106] iwlwifi 0000:01:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
2018-01-20T20:15:45.124108-08:00 INFO kernel: [    2.169627] iwlwifi 0000:01:00.0: L1 Enabled - LTR Disabled
2018-01-20T20:15:45.124142-08:00 INFO kernel: [    2.169948] iwlwifi 0000:01:00.0: L1 Enabled - LTR Disabled



77814601491 shows the shill error looping over and over again, never fixing itself.  I don't see any PCIe errors (or anything else that's terribly interesting regarding wlan0) in the kernel log.  Also FWIW this system has been up 8000+ seconds = >2 hours.

The key problem here seems to be that the Intel driver can register a "living dead" wlan0 interface which cannot be brought up:

    2017-10-25T22:55:05.596882-04:00 ERR wpa_supplicant[491]: nl80211: Could not set interface 'wlan0' UP

If the interface was missing entirely, shill and supplicant would be able to cope with that.  Since the interface appears but does not function in the expected way, it violates assumptions in the user code which results in this crash.

c#32: Yes, looks like the same problem.

The good news is that on M68 dev channel, this failure accounts for 97% of all shill crashes so far: https://goto.google.com/iwdlp

(It's good news because it means we fixed most of the other shill crashes, and all that's left is an assert that's known to be triggered by a hardware/driver issue.)
Cc: -quiche@chromium.org
Cc: -cernekee@chromium.org -abodenha@google.com briannorris@chromium.org
Labels: -Pri-1 Pri-2
Owner: ----
Status: Available (was: Assigned)
I think the following crash is the same issue:

https://crash.corp.google.com/browse?q=expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27shill%3A%3AWiFi%3A%3AOnWiFiDebugScopeChanged%27

Thread 0 (id: 0x38f) CRASHED [SIGSEGV @ 0x00000000 ]
0xb6e94406	(shill -wifi.cc:2248 )	shill::WiFi::OnWiFiDebugScopeChanged(bool)
0xb6e953d7	(shill -wifi.cc:2307 )	shill::WiFi::ConnectToSupplicant()
0xb6e9372d	(shill -wifi.cc:2225 )	shill::WiFi::OnSupplicantAppear()
0xb6ad1353	(libbase-core-395517.so -callback.h:397 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&)
0xb6ae88c1	(libbase-core-395517.so -message_loop.cc:478 )	base::MessageLoop::RunTask(base::PendingTask const&)
0xb6ae8af3	(libbase-core-395517.so -message_loop.cc:487 )	base::MessageLoop::DeferOrRunPendingTask(base::PendingTask const&)
0xb6ae8dcf	(libbase-core-395517.so -message_loop.cc:604 )	base::MessageLoop::DoWork()
0xb6aeae1b	(libbase-core-395517.so -message_pump_libevent.cc:224 )	base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)
0xb6ae85e3	(libbase-core-395517.so -message_loop.cc:442 )	base::MessageLoop::RunHandler()
0xb6b0244f	(libbase-core-395517.so -run_loop.cc:35 )	base::RunLoop::Run()
0xb6b6e457	(libbrillo-core-395517.so + 0x0001c457 )	

It seems like for some reason, this is seeing a bit of a spike, but a scattering of reports have gone back pretty far (so it's not an entirely new failure path). Maybe some new device flakiness?
On the bright side (in case the goal is to work on resilience, not fixing driver issues): it's super easy to repro the earlier crash with a loop like the following ('phy2' and 'wlan0' may depend on your system; I had already destroyed/recreated phy0, phy1, so I was left with phy2):

while :
do
  iw dev wlan0 del && iw phy phy2 interface add wlan0 type station || break
  echo "iteration $((i+=1))"
done

I got this:

(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xe72b25b2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#2  0xe72b337a in __GI_abort () at abort.c:89
#3  0xe76965b6 in base::debug::(anonymous namespace)::DebugBreak () at base/debug/debugger_posix.cc:219
#4  base::debug::BreakDebugger () at base/debug/debugger_posix.cc:249
#5  0xe76acdf8 in logging::LogMessage::~LogMessage (this=0xffb7d350) at base/logging.cc:755
#6  0x00906b2a in shill::ChromeosSupplicantProcessProxy::GetInterface (this=0x103b920, ifname=..., rpc_identifier=0xfe5348) at ../../../../../../../tmp/portage/chromeos-base/shill-9999/work/shill-9999/aosp/system/connectivity/shill/dbus/chromeos_supplicant_process_proxy.cc:168
#7  0x009680ca in shill::WiFi::ConnectToSupplicant (this=0xfe5158) at ../../../../../../../tmp/portage/chromeos-base/shill-9999/work/shill-9999/aosp/system/connectivity/shill/wifi/wifi.cc:2412
#8  0x009661aa in shill::WiFi::OnSupplicantAppear (this=0xfe5158) at ../../../../../../../tmp/portage/chromeos-base/shill-9999/work/shill-9999/aosp/system/connectivity/shill/wifi/wifi.cc:2317
#9  0xe7697678 in base::Callback<void (), (base::internal::CopyMode)1>::Run() const (this=<optimized out>) at ./base/callback.h:397
#10 base::debug::TaskAnnotator::RunTask (this=0xf28f98, queue_function=0xe772530d "MessageLoop::PostTask", pending_task=...) at base/debug/task_annotator.cc:51
#11 0xe76b2cc2 in base::MessageLoop::RunTask (this=0xf28ed8, pending_task=...) at base/message_loop/message_loop.cc:478
#12 0xe76b2e9e in base::MessageLoop::DeferOrRunPendingTask (this=0xf28ed8, pending_task=...) at base/message_loop/message_loop.cc:487
#13 0xe76b311a in base::MessageLoop::DoWork (this=0xf28ed8) at base/message_loop/message_loop.cc:604
#14 0xe76b6db0 in base::MessagePumpLibevent::Run (this=0xf28870, delegate=0xf28ed8) at base/message_loop/message_pump_libevent.cc:224
#15 0xe76b29d2 in base::MessageLoop::RunHandler (this=0xf28ed8) at base/message_loop/message_loop.cc:442
#16 0xe76d41d0 in base::RunLoop::Run (this=0xffb7d974) at base/run_loop.cc:35
#17 0xe77591c8 in brillo::BaseMessageLoop::Run (this=0xffb7db68) at ../../../../../../../tmp/portage/chromeos-base/libbrillo-0.0.1-r1327/work/libbrillo-0.0.1/platform2/libbrillo/brillo/message_loops/base_message_loop.cc:212
#18 0xe77526f0 in brillo::Daemon::Run (this=0xffb7db28) at ../../../../../../../tmp/portage/chromeos-base/libbrillo-0.0.1-r1327/work/libbrillo-0.0.1/platform2/libbrillo/brillo/daemons/daemon.cc:31
#19 0x007a6c6a in main (argc=<optimized out>, argv=0xffb7df04) at ../../../../../../../tmp/portage/chromeos-base/shill-9999/work/shill-9999/aosp/system/connectivity/shill/shill_main.cc:259
Labels: M-69
Labels: M-70
Labels: XAct
Labels: M-71
Cc: abhishekbh@chromium.org ejcaruso@chromium.org kirtika@chromium.org
Labels: -Hotlist-GoodFirstBug
Owner: briannorris@chromium.org
Status: Assigned (was: Available)
I'm sorta poking at this one.

AFAICT, it's fine for most cases to just make that LOG(FATAL) into a non-fatal. We could hit this if, e.g., the device is disappearing shortly after it was added -- e.g., with the artificial loop in #36. If the device comes back, we'd get a new Device instance, and things would roll along smoothly.

There are 2 caveats:

1. the crash in #35 shows that a very similar sort of race can yield a segfault -- not good -- which means there's probably another bug hiding in here. Looking through the code, I suspect maybe there's a lifetime issue, where although the WiFi device is being actively destroyed (the only way to destroy the supplicant_process_proxy_ AFAICT, which is evidently reset to NULL) while there's still a WiFi task callback in the message queue.

2. it's *also* super easy to force this crash by making a mac80211 driver's start() method fail -- basically, if the interface is there but fails the transition to IF_UP, then we hit this assertion repeatedly. e.g., this gives me a nice test for reproducing this on v4.4/ar10k:

diff --git a/drivers/net/wireless/ar10k/ath/ath10k/mac.c b/drivers/net/wireless/ar10k/ath/ath10k/mac.c
index f6b151edf8f5..7ad35dcd6a00 100644
--- a/drivers/net/wireless/ar10k/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ar10k/ath/ath10k/mac.c
@@ -3982,12 +3982,21 @@ static int ath10k_set_antenna(struct ieee80211_hw *hw, u32 tx_ant, u32 rx_ant)
 	return ret;
 }
 
+static unsigned int ath10k_failure_count = 1;
+module_param_named(failure_count, ath10k_failure_count, uint, 0644);
+
 static int ath10k_start(struct ieee80211_hw *hw)
 {
 	struct ath10k *ar = hw->priv;
 	u32 param;
 	int ret = 0;
 
+	if (ath10k_failure_count) {
+		WARN(1, "faking ath10k start failure: %u\n", ath10k_failure_count);
+		ath10k_failure_count--;
+		return -EBUSY;
+	}
+
 	/*
 	 * This makes sense only when restarting hw. It is harmless to call
 	 * uncoditionally. This is necessary to make sure no HTT/WMI tx


For #1: I have a decent suspicion as to where the error is, but my C++ isn't quite up there on how exactly to best solve this. (All these callbacks, Bind(), refptrs, ... this ain't C any more!)

For #2: The open question is -- do we expect start() failures to be transient (e.g., like my debug diff above)? If so, then the current behavior, while crude, would successfully bring us up eventually -- restart enough times, and the link may come up. If we just make this non-fatal, then the interface will stay there in the DOWN state yet shill will still think the device is available. The user would have to rescue themselves by toggling Wifi off/on...

So, to resolve stuff like #2 (if "transient failure" is a condition to worry about): then we might want to add yet another "delayed device" handling -- schedule another link-up attempt?

Tips, corrections, and insults are welcome.

Also, I'll hazard a claim that maybe this *isn't* a "first bug" type of issue.
Project Member

Comment 42 by bugdroid1@chromium.org, Dec 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/ee5a32322061d5055f5ab5cdfdf8890ce35155ca

commit ee5a32322061d5055f5ab5cdfdf8890ce35155ca
Author: Brian Norris <briannorris@chromium.org>
Date: Tue Dec 18 04:46:55 2018

shill: wifi: remove netlink handlers upon teardown

We add callbacks to the list but never remove them. This leaks memory if
our WiFi device comes and goes.

Noticed while investigating $BUG, although it's only tangentially
related.

BUG= chromium:698457 
TEST=unit tests; watch +rtnl scope messages

Change-Id: Ie4b3e5d21acc2c328e3cb9aa97d3da2851ee4a17
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1377575
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wifi_test.cc
[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wake_on_wifi.cc
[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wake_on_wifi.h
[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wifi.cc
[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wifi.h
[modify] https://crrev.com/ee5a32322061d5055f5ab5cdfdf8890ce35155ca/shill/wifi/wake_on_wifi_test.cc

Project Member

Comment 43 by bugdroid1@chromium.org, Dec 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/307508703b2deeca464a3e55243e96ae207ab6e7

commit 307508703b2deeca464a3e55243e96ae207ab6e7
Author: Brian Norris <briannorris@chromium.org>
Date: Tue Dec 18 04:46:55 2018

shill: manage lifetime of SupplicantProcessProxy callback

This proxy object maintains its lifetime with the WiFi object, but it
also posts tasks to the event loop using this Closure. This means
there's a possibility of queueing tasks to the event loop that will run
after we are destroyed.

Instead of an unmanaged Unretained(), give the callback a weak pointer,
so the run loop can decide whether or not to run the callback based on
the weak pointer's validity.

This should avoid some nullptr dereferences we've seen, where
OnSupplicantAppear() is called just after its WiFi object was destroyed.

BUG= chromium:698457 
TEST=unit tests; manual test - add/remove Wifi devices (e.g., with 'iw')

Change-Id: Ibd1b14ec51bee8a322ef36a586dd625273cd9a9f
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1377576
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/307508703b2deeca464a3e55243e96ae207ab6e7/shill/wifi/wifi.cc

Project Member

Comment 44 by bugdroid1@chromium.org, Dec 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/06bae3bc74aa2b7b50e98831aa79167120ed6fc6

commit 06bae3bc74aa2b7b50e98831aa79167120ed6fc6
Author: Brian Norris <briannorris@chromium.org>
Date: Tue Dec 18 04:46:54 2018

shill: wifi: construct WeakPtrFactory last

It's good practice to place WeakPtrFactory last in your class
declarations, so that it
(a) gets constructed last (i.e., is only available for farming out
    references after the rest of the object is (mostly) constructed)
(b) gets destroyed first (and so its pointers will be invalid before
    most of the rest of the object gets destroyed).

Do this in preparation of some other bugfixes to the use of weak
pointers.

BUG= chromium:698457 
TEST=unit tests

Change-Id: I36e03d2fe7cca62a7928a11730c1be41ed626db0
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1379312
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/06bae3bc74aa2b7b50e98831aa79167120ed6fc6/shill/wifi/wifi.cc
[modify] https://crrev.com/06bae3bc74aa2b7b50e98831aa79167120ed6fc6/shill/wifi/wifi.h

Project Member

Comment 45 by bugdroid1@chromium.org, Dec 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/c6a518bd91c3936a368cf1e1382c57de627b5ed3

commit c6a518bd91c3936a368cf1e1382c57de627b5ed3
Author: Brian Norris <briannorris@chromium.org>
Date: Mon Dec 24 03:19:27 2018

Revert "shill: manage lifetime of SupplicantProcessProxy callback"

This reverts commit 307508703b2deeca464a3e55243e96ae207ab6e7.
(CL:1377576)

While the use of weak pointers is valid in terms of memory safety, it
has unintended side effects, because I didn't notice we invalidate all
our weak pointers during Stop(). (Incidentally, doing that causes
several other bugs, but they are probably less severe.) In this case,
the result is: we no longer receive Supplicant{Appeared,Vanished}()
events after the first Stop(). This could happen, e.g., if
wpa_supplicant restarts.

We should revisit this an apply a similar bugfix while accounting for
the existing Stop() invalidation, but in the meantime, simply revert.

BUG= chromium:698457 
TEST=Wifi reconnects after `restart wpasupplicant`

Change-Id: I368aedf1f2af1d3bdc7e33c24cc39cc30c8dbcac
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1387993
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>

[modify] https://crrev.com/c6a518bd91c3936a368cf1e1382c57de627b5ed3/shill/wifi/wifi.cc

Project Member

Comment 46 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/d6fa83cc02448d7e78ef3cabeddc59f38fab4741

commit d6fa83cc02448d7e78ef3cabeddc59f38fab4741
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Jan 03 23:04:00 2019

shill: (retry) manage lifetime of SupplicantProcessProxy callback

This is a second stab at https://crrev.com/c/1377576 (reverted
https://crrev.com/c/1387993). Important change: use a new weak pointer
factory with a longer lifetime, since the original factory is
invalidated in Stop(). The supplicant callbacks (unlike many of our
others) need to survive beyond Stop()/(Re)Start().

Original description still applies, + a new (manual) TEST:

This proxy object maintains its lifetime with the WiFi object, but it
also posts tasks to the event loop using this Closure. This means
there's a possibility of queueing tasks to the event loop that will run
after we are destroyed.

Instead of an unmanaged Unretained(), give the callback a weak pointer,
so the run loop can decide whether or not to run the callback based on
the weak pointer's validity.

This should avoid some nullptr dereferences we've seen, where
OnSupplicantAppear() is called just after its WiFi object was destroyed.

BUG= chromium:698457 
TEST=unit tests
TEST=manual test - add/remove Wifi devices (e.g., with 'iw')
TEST=`restart wpasupplicant`; shill re-establishes Wifi services

Change-Id: Ibb7c2ecc25379bf5ed0a96107e9f2ec0e2ab85f6
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1389416
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/d6fa83cc02448d7e78ef3cabeddc59f38fab4741/shill/wifi/wifi.cc
[modify] https://crrev.com/d6fa83cc02448d7e78ef3cabeddc59f38fab4741/shill/wifi/wifi.h

Project Member

Comment 47 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/716218122c03673b3d78f8ca70ae4567b4fe81ae

commit 716218122c03673b3d78f8ca70ae4567b4fe81ae
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Jan 03 23:04:00 2019

shill: wifi: fix other callback lifetimes across Stop()/Restart()

The Mac80211Monitor and DebugScopeChanged callbacks should be available
even after a device Restart, but today, they use the weak pointer
factory that gets invalidated during Stop().

For OnWiFiDebugScopeChanged(), this callback is safe as long as the
process proxy object is valid, which is the same lifetime as the WiFi
object. So there's no need to invalidate it in Stop().

For Mac80211Monitor, it's currently safe enough to keep its lifetime
(the real work under StartScanTimer() is eventually gated by
'enabled()'), but it's much clearer if the callback itself is gated
appropriately.

Note that this uncovers a bad unit test:
ConnectToServiceWithoutRecentIssues was ensuring that we never touched
the debug level at all (rather than checking that we don't increase it).
This is violated in the test TearDown(), where we reset levels to
'-wifi'. This is a desireable behavior, because as long as we still can
talk to wpa_supplicant (even if we're disconnected), we should be
resetting its debug level. Convert this assertion to just checking for
whether we think we're debugging the connection.

BUG= chromium:698457 
TEST=unit tests
TEST=watch scan behavior while connected or disconnected, and restarting
     wpasupplicant

Change-Id: Ibca1acde34e56b9608557a759f71f5c4c04351d6
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1389417
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/716218122c03673b3d78f8ca70ae4567b4fe81ae/shill/wifi/wifi.cc
[modify] https://crrev.com/716218122c03673b3d78f8ca70ae4567b4fe81ae/shill/wifi/wifi.h
[modify] https://crrev.com/716218122c03673b3d78f8ca70ae4567b4fe81ae/shill/wifi/wifi_test.cc

Project Member

Comment 48 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/9b1b03f008484e18ade0a8b2265df759e4130ae2

commit 9b1b03f008484e18ade0a8b2265df759e4130ae2
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Jan 03 23:04:00 2019

shill: wifi: rename weak_ptr_factory_

Common weak pointer usage aligns an object's weak pointer lifetime with
the lifetime of the object itself, but the usage in WiFi is a bit
non-traditional, instead following the lifetime of device
Start()/Stop(). This has helped to introduce subtle bugs that only
appear after we disable/re-enable the device (e.g., in the network UI).

I plan to introduce a new weak pointer factory that follows the WiFi
object lifetime (and fixes several bugs), so let's rename the existing
one. I do this in a separate CL to separate the functional changes from
this noisy rename/refactor.

BUG= chromium:698457 
TEST=unit tests

Change-Id: Ice5f4fac5132fc180779ea6bd086eeffad9fcfa0
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1393908
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/9b1b03f008484e18ade0a8b2265df759e4130ae2/shill/wifi/wifi.cc
[modify] https://crrev.com/9b1b03f008484e18ade0a8b2265df759e4130ae2/shill/wifi/wifi.h
[modify] https://crrev.com/9b1b03f008484e18ade0a8b2265df759e4130ae2/shill/wifi/wifi_test.cc

Project Member

Comment 49 by bugdroid1@chromium.org, Jan 8

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e

commit 6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e
Author: Brian Norris <briannorris@chromium.org>
Date: Tue Jan 08 22:33:29 2019

shill: wifi: don't crash; just retry wpa_supplicant CreateInterface

We should not crash simply because wpa_supplicant failed to bring up the
interface. The supplicant can fail for a number reasons, including (most
commonly) a Wifi driver failing to reach IFF_UP. This is often permanent
(e.g., driver bug or hardware failure), and we definitely shouldn't put
shill into an abort/respawn loop, as this just prevents stable
connectivity via other means (e.g., ethernet or cellular).

Instead, let's just schedule an arbitrary interval at which shill will
retry bringing up the interface, and try that a few times.

If the device eventually disappears (e.g., Wifi firmware crashes and
kernel device completely unregisters itself), the entire WiFi device
will get removed and invalidate any outsanding callbacks.

If wpa_supplicant restarts, we'll see OnSupplicantVanish() which will
terminate the retry loop.

If the user disables WiFi in the tray, we'll terminate via
'Device::enabled() == false'.

If we don't succeed within this arbitrary retry loop, we
non-persistently disable the interface. This allows the user to
re-enable the device, if somehow the device recovers later.

I also add a metric (Network.Shill.WiFi.SupplicantAttempts) to log
either success-after-N-attempts or a large ("Max") value for failure.
This can help us continue to identify device populations that are having
lots of driver failures in the field, as well as evaluate whether this
is truly transient (e.g., succeeds after a few retries) or permanent.

NB: after studying more of the existing device issues that hit this
LOG(FATAL) behavior in shill, I suspect that these failures do *not*
tend to be transient and instead require some other intervention to
recover the device. (We're working on these interventions elsewhere.) If
the metrics tell us the retries are never useful, perhaps we can scale
this back to simply stop trying after the first failure.

BUG= chromium:698457 
TEST=force a mac80211 Wifi driver to fail its start() method a few times
     -- see shill retry to bring it up
TEST=manually re-enable Wifi (e.g., in UI) and see shill try to
     reconnect a failed device
TEST=unit tests
TEST=watch chrome://histograms

Change-Id: I1c735b474609dfe0de437f6eb65b1b74fae33aab
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1383253
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>

[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/metrics.cc
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/mock_metrics.h
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/wifi/wifi_test.cc
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/wifi/wifi.cc
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/wifi/wifi.h
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/dbus/chromeos_supplicant_process_proxy.cc
[modify] https://crrev.com/6cae1e7f9a85ddc9ada243761fbf4d6a83b9334e/shill/metrics.h

Cc: -abhishekbh@chromium.org
Project Member

Comment 51 by bugdroid1@chromium.org, Jan 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9cabe959346e10ba92042b1997bcdcec93c2dbff

commit 9cabe959346e10ba92042b1997bcdcec93c2dbff
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 09 01:31:20 2019

Create histogram for shill's attempts to connect to supplicant

We might retry setting up a WiFi device via WPA supplicant a few times.
This adds a metric to count how many attempts are required before
success -- or, if we give up, we report a "Max" attempt number to mean
failure. Implemented in CL:1383253.

Bug:  698457 
Change-Id: I091f49ca112d0e48de83c6662088927db1b72c51
Reviewed-on: https://chromium-review.googlesource.com/c/1400263
Commit-Queue: Brian Norris <briannorris@chromium.org>
Reviewed-by: Ilya Sherman <isherman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#620984}
[modify] https://crrev.com/9cabe959346e10ba92042b1997bcdcec93c2dbff/tools/metrics/histograms/histograms.xml

Labels: -M-67 -M-64 -M-69 -M-70 -M-71 M-73
Status: Fixed (was: Assigned)
There's one small update to the metric parameters still in flight, but otherwise, this should be Fixed [0].

This doesn't mean that all the underlying device/driver issues are gone [1], of course, but it should make the system a little more robust. We should look at the new Network.Shill.WiFi.SupplicantAttempts metric (once a Chrome uprev goes through) to monitor how devices are doing in the field, in lieu of watching the shill crash statistic.

Also, for the record: I have suspicion that if we fail to bring the device up once we never will (i.e., SupplicantAttempts will always either be 1 or MAX), but I'd like to watch the statistics for this. If we determine the retries aren't helpful at all, then we can simplify the logic and just disable the device the first time it fails like this.

Say goodbye to one of the top-20 crash reports on ChromeOS!

[0] For some definition of "wifi interface not being present", in $subject. This was centered around failure to start() the interface. There are other failures (e.g., missing nl80211 protocol) that we don't handle well still.

[1] Some sampling of outstanding bugs for device failures:
crbug.com/808058
crbug.com/904025
https://issuetracker.google.com/119271221
crbug.com/918190
Project Member

Comment 53 by bugdroid1@chromium.org, Jan 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/896b7f1e6f9d05269b343a5ad558133da943f0fd

commit 896b7f1e6f9d05269b343a5ad558133da943f0fd
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 09 19:55:38 2019

shill: update SupplicantAttempts bucketization

To properly record a linear histogram of [1, 10), we should also include
buckets for over- and under-flow.

BUG= chromium:698457 
TEST=watch chrome://histograms

Change-Id: If710be9c27f953612dca35505ff0e8e146c37a15
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1401351
Reviewed-by: Luigi Semenzato <semenzato@chromium.org>

[modify] https://crrev.com/896b7f1e6f9d05269b343a5ad558133da943f0fd/shill/metrics.cc

Sign in to add a comment