New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 918190 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 922232



Sign in to add a comment

[eve 11151.61.0]: [Assert] shill::ChromeosSupplicantProcessProxy::GetInterface - raise-2d5dffe3

Project Member Reported by dgagnon@google.com, Dec 28

Issue description

Stack Quality100%Show frame trust levels
0x00007dab4b80ddd2	(libc-2.23.so -raise.c:54 )	raise
0x00007dab4b80fbf5	(libc-2.23.so -abort.c:89 )	abort
0x00007dab4c34c924	(libbase-core-395517.so -debugger_posix.cc:249 )	base::debug::BreakDebugger()
0x00007dab4c3681bb	(libbase-core-395517.so -logging.cc:755 )	logging::LogMessage::~LogMessage()
0x000057e69471b3e1	(shill -chromeos_supplicant_process_proxy.cc:156 )	shill::ChromeosSupplicantProcessProxy::GetInterface(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)
0x000057e6947a4d0e	(shill -wifi.cc:2437 )	shill::WiFi::ConnectToSupplicant()
0x000057e6947a1c2c	(shill -wifi.cc:2342 )	shill::WiFi::OnSupplicantAppear()
0x00007dab4c34ef70	(libbase-core-395517.so -callback.h:397 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&)
0x00007dab4c36edbd	(libbase-core-395517.so -message_loop.cc:478 )	base::MessageLoop::RunTask(base::PendingTask const&)
0x00007dab4c36f10a	(libbase-core-395517.so -message_loop.cc:487 )	base::MessageLoop::DeferOrRunPendingTask(base::PendingTask const&)
0x00007dab4c36f52a	(libbase-core-395517.so -message_loop.cc:604 )	base::MessageLoop::DoWork()
0x00007dab4c3724e8	(libbase-core-395517.so -message_pump_libevent.cc:224 )	base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)
0x00007dab4c36e9b7	(libbase-core-395517.so -message_loop.cc:442 )	base::MessageLoop::RunHandler()
0x00007dab4c3919a4	(libbase-core-395517.so -run_loop.cc:35 )	base::RunLoop::Run()
0x00007dab4c5479a2	(libbrillo-core-395517.so -base_message_loop.cc:212 )	brillo::BaseMessageLoop::Run()
0x00007dab4c53e953	(libbrillo-core-395517.so -daemon.cc:31 )	brillo::Daemon::Run()
0x000057e69453dea0	(shill -shill_main.cc:247 )	main
0x00007dab4b7fa735	(libc-2.23.so -libc-start.c:289 )	__libc_start_main
0x000057e69453cf18	(shill + 0x00019f18 )	_start
0x00007ffde31822b7		
This issue currently accounts for 12%+ of all crashes on 11151.61.0 and affects 73 unique clients.
Labels: OS-Chrome
Owner: kirtika@chromium.org
Status: Assigned (was: Untriaged)
This is essentially a duplicate of  bug 698457 , but depending on what exactly you care about, perhaps we shouldn't mark this duplicate.

This should fix one related problem:

https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1389416

and this should "fix" another:

https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1383253

but that's really just papering over a symptom of an already-bad problem -- the Wifi driver must be malfunctioning such that it (temporarily, I hope?) doesn't work at all.

If you really care about the quality of the experience here, we probably shouldn't be worried only about the shill crash, but about the Wifi driver behavior.

> 12%+ of all crashes on 11151.61.0

That's big!

> and affects 73 unique clients.

That's tiny! So we probably have some small segment of devices whose Wifi is persistently bad. If the Wifi is non-recoverable, shill will continue to crash loop forever.

I'm going to assign this to Kirtika for now, to look at any Eve-specific driver problems.
BTW, do we have any logs for these? e.g., feedback reports? The crash reports themselves are essentially useless.
@5: I'm not sure. (I forgot crash/ links some reports like that.) Many of those don't have the right context still, as the "crash" occurred much earlier.

At least one pointed to non-Eve issues that I already reported here:

https://buganizer.corp.google.com/issues/119271221#comment6

I haven't yet found an Eve report that has the right context.
Regarding Eve, I could post details here but I'd have to mark it RVG. where do you want a follow-up? We have a known PCI drop-off there basically, and am going to add the rescan work-around that we have on Cyan. 

Cc: rajatja@chromium.org
Summary: Eve: [Assert] shill::ChromeosSupplicantProcessProxy::GetInterface - raise-2d5dffe3 (was: [Assert] shill::ChromeosSupplicantProcessProxy::GetInterface - raise-2d5dffe3)
BTW, if this is primarily a concern for Eve, then I'd look at these stats

https://crash.corp.google.com/browse?q=product_name%3D%27ChromeOS%27+AND+stable_signature%3D%27raise-2d5dffe3%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28productdata%29+WHERE+Key%3D%27hwclass%27+AND+STRPOS%28Value%2C+%27EVE%27%29+%3E+0%29

and say: this looks like a consistent problem that isn't occurring all that much more frequently now. i.e., not really a SW regression.

---

Also, Kirtika dug up a report that has enough log context on Eve, and it does seem to have a similar beahvior as [1]:

https://listnr.corp.google.com/product/208/report/85856927137

[15540.174625] iwlwifi 0000:01:00.0: Failed to wake NIC for hcmd
[15540.174719] iwlwifi 0000:01:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5

After the above, the device is basically non-responsive, and after a suspend/resume attempt, we fail to bring the link back up at all:

[16404.852340] iwlwifi 0000:01:00.0: Error, can not clear persistence bit

which puts us into the known LOG(FATAL) in shill

[16404.993367] init: shill main process (1642) killed by ABRT signal
[16404.993414] init: shill main process ended, respawning

The abort/respawn goes on ad infinum and never recovers.

> I could post details here but I'd have to mark it RVG. where do you want a follow-up?

I'm not sure we really need the hardware details here. Just the SW-relevant details (e.g., how can we work around this?). And those details seem perfectly fine to be on this bug IMO. Or you can file a partner bug to track the HW details, if that's really needed. (We ain't changing the HW now...)

> We have a known PCI drop-off there basically, and am going to add the rescan work-around that we have on Cyan. 

Are you sure this is the same SW behavior? I see your rescan rules look like this:

SUBSYSTEM=="pci", ACTION=="remove", ATTRS{vendor}=="0x8086", RUN+="/sbin/initctl emit --no-wait pci-rescan-to-revive-wifi"

which means the PCI device has to actually disappear ("remove"). In this case, the device still seems to be present, but it just fails to do anything useful, and instead puts us in a shill respawn/retry loop that goes nowhere.

I think I'd need to see that rescan logic actually work for this Eve issue before I'd encourage using it.

Do we have any devices on hand that see this issue often enough?

[1] https://buganizer.corp.google.com/issues/119271221
> which means the PCI device has to actually disappear ("remove"). In this case, the device still seems to be present, but it just fails to do anything useful, and instead puts us in a shill respawn/retry loop that goes nowhere.

I guess I should repeat my comment from here:

https://buganizer.corp.google.com/issues/119271221#comment11

This code actually causes the *driver* to self-remove the device, if we enable the right command line parameters:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/808978/9/drivers/net/wireless/iwl7000/iwlwifi/pcie/trans.c#2059

So I guess it would be up to Kirtika to investigate whether that is *really* the right thing for Eve. So far, I believe it might work, but I'd like to see real evidence. Even better, a comment from Intel (e.g., on a partner bug like https://buganizer.corp.google.com/issues/119271221).

---

A separate concern is: how high priority is this *really*? Basically, is this a true SW regression in functionality, or is this just exposed by a shift in how the same basic symptom (network failures due to bad PCIe designs).

Toward the former: if things limped along OK (and we could retain connectivity) before Core38 [1], and now Wifi is even more dead, then we might be in bad shape (i.e., high priority).

Toward the latter: PCIe may have always been misbehaving occasionally on these systems (i.e., no connectivity), but only now do those show up as shill crashes.

I'm leaning toward the latter based on my understanding, so I don't think this should be a release blocker. It probably should remain high priority.

Side note: if this was indeed exacerbated by Core38...that's been live on M70 already.

[1] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1183859
@dgagnon:

> We bumped up the priority as this needs to be resolved by the next stable.

What "priority" are you bumping, BTW? You only just filed the bug :D
Cc: -djkurtz@chromium.org -semenzato@chromium.org
Per #10 we created the bug with higher priority than normal given the impact and need to get it resolved.  The data originated for go/cros-crash on a search for the eve stable version...
Re: the counts on #3:  The counts were compiled shortly after a percentage push to eve-only devices.   Eve is only pushed at 20% as of last night, so I expect the absolute counts to be low (and perhaps misleading).  The percentage of the overall crash landscape is a more meaningful metric.

Current counts are 13.46%, 2763 crashes, 147 unique clients (again, we just went to 20% and this is an eve-only push).

And yep, we'd like to address the root cause rather than papering over a crash statistic.
Summary: [eve 11151.61.0]: [Assert] shill::ChromeosSupplicantProcessProxy::GetInterface - raise-2d5dffe3 (was: Eve: [Assert] shill::ChromeosSupplicantProcessProxy::GetInterface - raise-2d5dffe3)
Cc: stevehuang@chromium.org
Labels: XAct
Assume owners are out this week (anyone to cover?); bumping since trying to make an eventual stable refresh.  Thanks
Not really out yet -- just busy, and (like me) not agreeing with RBS :D

This behavior really doesn't seem to be new. This particular signature (restricted to Eve -- see the crash/ link in #9) is at 13% of 10895.56.0 and 10.9% of 11021.56.0. It's currently at 13% of 11151.61.0.

In the meantime, I'm idly seeing if I can reproduce this in the background without success. I'd rather not rush to add the workaround we discussed in #7/#8/#9 without a little more knowledge.

BTW, for impact context: this is probably fixed by a reboot. If it isn't fixed by a reboot, then the workaround we *might* try wouldn't help either.
Labels: -ReleaseBlock-Stable
Owner: briannorris@chromium.org
[Chatted with Kirtika about this enough. She won't be getting to this soon.]

I'm pretty confident this is not a regression. It existed before Core38 and has been present in multiple stable releases. The root cause is some sort of PCIe failures (there are many known forms of this, not necessarily specific to Eve; there may be special Eve issues as well), and they typically show up as a kernel signature like this:

https://crash.corp.google.com/browse?q=product_name%3D%27ChromeOS%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28productdata%29+WHERE+Key%3D%27exec_name%27+AND+Value%3D%27kernel-wifi-warning%27%29+AND+STRPOS%28expanded_custom_data.ChromeCrashProto.magic_signature_1.name%2C+%27iwl_trans_pcie_grab_nic_access%27%29+>+0+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28productdata%29+WHERE+Key%3D%27hwclass%27+AND+Value+LIKE+%27EVE%25%27%29

You'll notice that there are no spikes in this -- Eve's failure rate has been basically constant.

We still need to investigate more, and there's basically no way we're "fixing" this on M71. Removing R-B-S.

BTW, I managed to scrape another interesting failure out of an unrelated feedback report:

https://listnr.corp.google.com/product/208/report/85874209121

2018-12-28T19:57:31.790024-05:00 ERR kernel: [ 3190.825229] iwlwifi 0000:01:00.0: Failed to wake NIC for hcmd
2018-12-28T19:57:31.790866-05:00 ERR kernel: [ 3190.825640] iwlwifi 0000:01:00.0: Error sending CMD_DTS_MEASUREMENT_TRIGGER_WIDE: enqueue_hcmd failed: -5
2018-12-28T19:57:31.790876-05:00 ERR kernel: [ 3190.825649] iwlwifi 0000:01:00.0: Failed to get the temperature (err=-5)
2018-12-28T19:57:31.792317-05:00 NOTICE temp_logger[14376]: 00:42 02:29 03:29 04:35 05:32 06:42 07:0
2018-12-28T19:57:32.985597-05:00 ERR kernel: [ 3192.020779] iwlwifi 0000:01:00.0: Failed to wake NIC for hcmd
2018-12-28T19:57:32.985926-05:00 ERR kernel: [ 3192.020933] iwlwifi 0000:01:00.0: Error sending MAC_PM_POWER_TABLE: enqueue_hcmd failed: -5
...
2018-12-28T19:57:36.250926-05:00 WARNING kernel: [ 3195.284946] Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)

This failed immediately when the user was requesting feedback report logs. I believe somewhere along that process we request thermal zone information, and the WiFi NIC temperature reading failed? That seems to suggest that network connectivity might have been working OK until we poked the "wrong" thing.

---

A few more notes on pre-M70 (and pre-Core38) behavior:
I dug up several reports of the same shill "crash", and they have a slightly different mis-behavior of the iwlwifi driver. e.g.:

https://listnr.corp.google.com/product/208/report/85764427112

2018-11-04T12:20:56.847831-05:00 INFO kernel: [28329.709024] iwlwifi 0000:01:00.0: L1 Disabled - LTR Disabled
2018-11-04T12:20:56.847842-05:00 INFO kernel: [28329.709039] iwlwifi 0000:01:00.0: L1 substates: Enabled (L1SS_CTL1=40A0000F)
2018-11-04T12:20:57.605296-05:00 ERR kernel: [28330.466838] iwlwifi 0000:01:00.0: Could not load the [0] uCode section
2018-11-04T12:20:57.605306-05:00 ERR kernel: [28330.466847] iwlwifi 0000:01:00.0: Failed to start INIT ucode: -5
2018-11-04T12:20:57.605307-05:00 ERR kernel: [28330.466849] iwlwifi 0000:01:00.0: Failed to run INIT ucode: -5
2018-11-04T12:20:57.605307-05:00 ERR kernel: [28330.466850] iwlwifi 0000:01:00.0: Failed to start RT ucode: -5
...
2018-11-04T12:20:59.503244-05:00 CRIT shill[27826]: [FATAL:chromeos_supplicant_process_proxy.cc(168)] Failed to get interface wlan0: fi.w1.wpa_supplicant1.InterfaceUnknown wpa_supplicant knows nothing about this interface.#012/usr/lib64/libbase-core-395517.so(_ZN4base5debug10StackTraceC1Ev+0x13) [0x795ea2cd7b13]#012


Looking deeper into the log, it looks to have 1 hour previously dumped a Wifi crash with this ID:

https://crash.corp.google.com/browse?q=reportid=%27219ed3978a5bf660%27

which has a iwl_trans_pcie_grab_nic_access signature -- i.e., the same PCIe failure mode.

So all this means is that the only pre-Core38 vs Core38+ differences are that Intel added some new sequencing to the start_hw() phase [1], which means new failure modes in start_hw(). Pre-Core38, we seem to get all the way to start_fw() and fail there instead. This is no different from a user's perspective.



TL;DR: This is all a PCIe issue that looks similar to existing known issue and will not be resolved in the M-71 timeframe. Rebooting the device may fix the issue temporarily.


[1] The typical start() sequence for iwlwifi is start_hw() followed by start_fw().
The $subject signature should be gone now, on M-73:
https://bugs.chromium.org/p/chromium/issues/detail?id=698457#c52

The new Network.Shill.WiFi.SupplicantAttempts metric should help track the equivalent behavior (certain failures to bring up the Wifi interface) soon.

Kernel WARNINGs with a iwl_trans_pcie_grab_nic_access should also continue to be a good proxy for observing the incidence of this problem.
Blockedon: 922232

Sign in to add a comment