New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 13 users

Issue metadata

Status: Verified
Owner:
User never visited
Closed: Apr 2015
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

daisy: lucas loses network connection with high cpu+memory load

Reported by silberst@chromium.org, Mar 18 2015

Issue description

Version: R40
OS: ChromeOS

To Reproduce:
run the attached script:
(I copied the attached files to /root and ran)
bash loadedNetworkTest.sh 1 8 250 10000 2


What is the expected output? What do you see instead?
The connection drops - this is usually one of 3 cases
1 - link_monitor failure with an attempt to reassociate
2 - network watchdog triggers a mwifiex reset
3 - mwifiex experiences multiple tx timeouts and resets itself


Please use labels and text to provide additional information.

The device I am testing on is a lucas.  It has a R40 stable (test) image.
1. I did do: "USE=dyndebug emerge-daisy chromeos-kernel-3_8;" so I could enable mwifiex logging if needed.
2. I also have wpa_debug excessive (for some, but not all tests).

I have attached the files needed to reproduce the scenario.
 
balloon.arm
22.4 KB Download
loadedNetworkTest.sh
1.1 KB View Download
launchBalloons.sh
296 bytes View Download
Cc: semenzato@chromium.org

Comment 2 by quiche@chromium.org, Mar 18 2015

This is a total shot in the dark, but I just want to throw it out there...

LWN has had a couple of articles recently about low memory behavior. The interesting bit is that small allocations (1 page or less) are never allowed to fail. Instead, the kernel will just keep retrying, hoping that someone else has freed up some memory.

So one crazy possibility is that memory is getting low enough that the SDIO or WiFi driver code is busy looping in a memory allocation. Eventually, the allocation succeeds. But, by the time that happens, we've lost some interrupts.
Would you be so kind to add some more repro information?  Specifically, what we should look for in which logs, and how to enable mwifiex logging.  Also, is "wpa_debug excessive" the actual command used?  Finally, how long (roughly) does it take to reproduce?  Thanks.
I am in the process of gathering some of the output from a run I did yesterday.  :)

The script should output the three failure modes I have seen by catching the two card resets and also the link_monitor failure - as far as I can tell, which failure you get is random.  I think I have had a failure occur with every run with the parameters above (bash loadedNetworkTest.sh 1 8 250 10000 2)

wpa_debug excessive shouldn't be necessary to reproduce this - I only included that to be accurate about any of the changes I had made to the system.  ("wpa_debug excessive" is the command)

#2 That should be easy to verify with a printk.  I suspect it is NOT happening because the page walkers started by the script are 8 x 250 MB each = 2 GB which is enough to trigger zram swapping, but not to use a large amount of swap.  By default, the pages created by "balloon.arm" compress very well---their first byte has value 1, the others 0.  I think what we're seeing here is a lot of CPU contention: 8 balloon threads always runnable + 2 curl processes (don't know how many threads each) + kswapd + other background tasks.  So there may be one or more timeouts occurring and causing cascading failures.

Incidentally, is chrome running?  I am assuming it is, or you would have specified otherwise.
Here is a sample of what the errors I am grepping for look like (the file they came from is attached).

localhost / # grep 'mwifiex.*reset\|ERROR.*Disconnect\|ERROR:link_monitor' /var/log/messages
2015-03-17T13:31:45.850562-07:00 shill[1010]: [ERROR:link_monitor.cc(189)] Link monitor has reached the failure threshold with 5 broadcast failures and 0 unicast failures.
2015-03-17T13:44:58.405043-07:00 shill[1010]: [ERROR:wifi.cc(996)] Disconnected due to inability to connect to the service.
2015-03-17T13:45:01.902846-07:00 shill[1010]: [ERROR:wifi.cc(996)] Disconnected due to inability to connect to the service.
2015-03-17T13:45:09.222221-07:00 kernel: [52166.066799] WARNING: at /mnt/host/source/src/third_party/kernel/v3.8/drivers/net/wireless-3.4/mwifiex/sdio.c:1867 mwifiex_sdio_card_reset+0x34/0x5c [mwifiex_sdio]()
2015-03-17T13:45:09.222256-07:00 kernel: [52166.067120] [<c0123bb0>] (warn_slowpath_null+0x44/0x5c) from [<bf08bd00>] (mwifiex_sdio_card_reset+0x34/0x5c [mwifiex_sdio])
2015-03-17T13:45:09.336723-07:00 kernel: [52166.067147] [<bf08bd00>] (mwifiex_sdio_card_reset+0x34/0x5c [mwifiex_sdio]) from [<bf065e08>] (mwifiex_cmd_timeout_func+0x238/0x29c [mwifiex])
lNT_attach_messages
158 KB View Download
#5 yes, chrome is running.  I had 5 tabs running with this, 4 were local (file: or crosh) and 1 was gmail.

I just kicked off a run of the script after doing "stop ui"  I will update after that finishes.  (Before I got the script ironed out I hadn't been able to eliminate chrome as a factor due to randomness of the failure.)
Update on lucas test without chrome running:
Round 1:
  ran "stop ui"
  ran ifconfig to make sure I still had a connection
  ran test with params above ("bash loadedNetworkTest.sh 1 8 250 10000 2")
  Finished and the machine looked quite happy

Round 2:
  ran "stop ui"
  ran ifconfig to make sure I had a network connection
  ran test with "bash loadedNetworkTest.sh 1 12 250 10000 2"
  (it was still running...  probably going to kill it soon)
  saw error messages (link_monitor failures) in the output from the tail -F | grep
I have been playing with the parameters for the loadedNetworkTest script.

1 - I have run this without any curl requests (I can upload an updated version of the script if anyone is interested) and I still see link_monitor failures.

2 - I have also seen a new failure mode other than a link_monitor failure.  This is a test done with a machine in a R40 test image + reverts of 4 CLs related to sdio polling + dyndebug.  The parameters for this instance were:

  bash loadedNetworkTest.sh 1 8 300 500 2

The new failure is:
  2015-03-23T22:52:12.898878-07:00 shill[1002]: [ERROR:arp_client.cc(173)] Socket sendto failed: No buffer space available
2015-03-23T22:52:12.898932-07:00 shill[1002]: [ERROR:link_monitor.cc(333)] Failed to send ARP request.  Stopping.


I am now running two machines side by side to compare failures:
machine a: R40 from xbuddy - no mods
machine b: R40 from xbuddy + revert of sdio polling CLs + dyndebug

Side-by-side experiments with two lucas devices.  First I am just copying the results into this bug.

ui running, but no users are logged in
connected to GoogleGuest
both machines are plugged in

machine a: R40 from xbuddy - no mods
machine b: R40 from xbuddy + revert of sdio polling CLs + dyndebug

round 1 - params in test are loadedNetworkTest 1 8 250 500 2

machine a: started at 14:35:15 ended at 14:36:13
machine b: started at 14:35:15 ended at 14:36:29

round 2 - same params as above
ran /usr/sbin/chromeos-cleanup-logs on both machines

machine a: started at 14:38:50 ended at 14:40:14
machine b: started at 14:38:50 ended at 14:40:26

round 3 - same params as above
test user logged in:  chrome open with 1 tab - gmail
/usr/sbin/chromeos-cleanup-logs

machine a: started at 14:44:10 ended at 14:45:15
machine b: started at 14:44:10 ended at 14:45:25

round 4 - params in test are loadedNetworkTest 1 8 275 500 2
/usr/sbin/chromeos-cleanup-logs

machine a: started at 14:47:28 ended (killed) at 15:30:14
*link monitor failure @ 14:48:01 - reached failure threshold with 1 bcast and 4 unicast failures
*link monitor failure @ 14:55:41 - Failed to send ARP request.  Stopping
*link monitor failure @ 15:14:51 - reached failure threshold with 5 bcast and 0 unicast failures
*link monitor failure @ 15:21:25 - reached failure threshold with 1 bcast and 4 unicast failures
machine b: started at 14:47:28 ended at 15:07:35

round 5 - same params as round 4
/usr/sbin/chromeos-cleanup-logs

machine a: started at 15:34:32 ended at 15:44:01
machine b: started at 15:34:32 ended at 15:38:16

let the machines sit overnight.

checked logs in the morning, looked fine.

ran top on both - approximately the same memory used and cpu use looks about the same

now for more testing.

round 6 - repeat round 5 (same as round 4)
/usr/sbin/chromeos-cleanup-logs

machine a: started at 12:06:32 ended at 12:09:28
machine b: started at 12:06:33 ended at 12:11:02
*link_monitor failure @ 12:08:56 - Failed to send ARP request.  Stopping
*card reset @ 12:10:32 - tx_timeout_cnt exceeds threshold.  Triggering card reset

round 7 - repeat round 6
/usr/sbin/chromeos-cleanup-logs

machine a: started at 12:15:03 ended at 12:32:21
*link_monitor failure @ 12:16:02 - Failed to send ARP request. Stopping
machine b: started at 12:15:03 ended at 12:18:54


round 8 - params in test are loadedNetworkTest 1 4 550 500 2
/usr/sbin/chromeos-cleanup-logs

machine a: started at 13:10:28 ended (killed) at 15:02:00
*card reset @ 13:49:31 - mwifiex_cmd_timeout_func
*link_monitor failure @ 13:49:31 - reached failure threshold with 2 bcast and 3 unicast failures
*link_monitor failure @ 14:33:46 - reached failure threshold with 4 bcast and 1 unicast failures
*link_monitor failure @ 14:44:06 - reached failure threshold with 3 bcast and 2 unicast failures
*card reset @ 14:45:43 - tx_timeout_cnt
machine b: started at 13:10:28 ended at 14:22:17
*link_monitor failure @ 13:18:48 - reached failure threshold with 2 bcast and 3 unicast failures
*link_monitor failure @ 13:27:28 - Failed to send ARP request.  Stopping
*card reset @ 13:47:03 - mwifiex_cmd_timeout_func
*link_monitor failure @ 13:47:03 - Failed to send ARP request. Stopping


round 9 - params in test are loadedNetworkTest 1 16 137 500 2
/usr/sbin/chromeos-cleanup-logs

machine a: started at 15:05:51 ended at 15:23:33
machine b: started at 15:05:51 ended at 15:14:56


round 10 - params in test are loadedNetworkTest 1 8 275 500 4
/usr/sbin/chromeos-cleanup-logs

machine a: started at 15:29:27 ended at 15:31:01
machine b: started at 15:29:27 ended at 15:41:41
*link_monitor failure @ 15:35:25 - Failed to send ARP request.  Stopping


round 11 - repeat round 10
/usr/sbin/chromeos-cleanup-logs

machine a: started at 15:43:57 ended at 15:45:10
machine b: started at 15:43:57 ended at 16:14:31
*link_monitor failure @ 15:44:48 - Failed to send ARP request.  Stopping
*card reset @ 15:47:34 - mwifiex_cmd_timeout_func
*link_monitor failure @ 16:01:25 - reached failure threshold with 2 bcast and 3 unicast failures
*link_monitor failure @ 16:03:28 - Failed to send ARP request.  Stopping


round 12 - params in test are loadedNetworkTest 1 8 275 500 8
/usr/sbin/chromeos-cleanup-logs

machine a: started at 16:16:19 ended at 16:18:29
*link_monitor failure @ 16:16:36 - Failed to send ARP request. Stopping
machine b: started at 16:16:19 ended at 16:19:17


round 13 - params in test are loadedNetworkTest 1 8 275 500 16
/usr/sbin/chromeos-cleanup-logs

machine a: started at 17:00:33 ended at 17:03:03
machine b: started at 17:00:33 ended at 17:03:07
*link_monitor failure @ 17:01:19 - Failed to send ARP request. Stopping
*link_monitor failure @ 17:01:41 - Failed to send ARP request. Stopping

Labels: Hotlist-Enterprise
As an little experiment, I added another part to the test to grab the interrupt count while the system is under pressure.  (I used cat /proc/interrupts | grep "110: "  - please let me know if that isn't the right way to grab the info.  I have another script in the works with a little more light into the real time between grabbing the interrupt count - it is attached as well)

I have attached the scripts and the resulting graph (very simple gnuplot for the interrupts).  In the script I calculate the diff from the last count and append it to a file.  The plot is just the row num (x-axis) vs the interrupt count diff (y-axis).

The other graph is the shill log processor viz.  This graph shows the shill service state at that point in the log.  the x-axis is the elapsed time from the start of that network manager and the y-axis is the service state (states are recorded when they change, it was in the online state at the start of the experiment).

The failures are certainly correlated with the interrupts, but I haven't yet determined which one causes the other - working on that now.

test overview:
-start a loop grabbing the interrupt count (there is a second version of that also attached that I am working with now)
-start up x curl requests to generate network traffic - loops to keep network traffic in play
-have n balloons start up to place memory and cpu pressure on the system

This test was run on a lucas (machine b from above) on my home network with:
bash loadedNetworkTest.sh 1 8 300 1000 2

I ended up killing the test instead of letting it run its full course, but grabbed the logs.  Also note, do not attempt to line up the graphs - one is a count and the other is elapsed time.

The two failures were:

2015-03-26T22:19:19.521466-07:00 shill[999]: [ERROR:link_monitor.cc(189)] Link monitor has reached the failure threshold with 2 broadcast failures and 3 unicast failures.

2015-03-26T22:30:36.205081-07:00 shill[999]: [ERROR:link_monitor.cc(189)] Link monitor has reached the failure threshold with 4 broadcast failures and 1 unicast failures.

balloon.arm
22.4 KB Download
catInterruptCount2.sh
197 bytes View Download
loadedNetworkTest.sh
1.2 KB View Download
launchBalloons.sh
296 bytes View Download
catInterruptCounts.sh
290 bytes View Download
interruptGraph.png
89.3 KB View Download
Cc: diand...@chromium.org krisr@chromium.org
Update for today:

Thanks to dianders for chatting with me today!  I have something I still need to try from that conversation, but I do have an update on a test I ran tonight.

dianders suggested I try to repro the failures with pit.  I picked one up today and installed R40 (test image).  I had to play around with params, but I was able to get network drops using my loadedNetworkTest script with the device connected to my personal home AP.  My test account was logged in to the UI, but I only had 4 tabs open (gmail, crosh, file - net.log, file - messages)  I did bump up the debugging from supplicant...  not sure if that played a role.  I did that with:

wpa_debug excessive

the params in the test were:

bash loadedNetworkTest.sh 1 8 550 500 2


I saw the following errors:
link monitor failure  - reached failure threshold with 3 bcast and 2 unicast failures
link monitor failure  - reached failure threshold with 2 bcast and 3 unicast failures
link monitor failure  - reached failure threshold with 3 bcast and 2 unicast failures
link monitor failure  - reached failure threshold with 2 bcast and 3 unicast failures
Disconnected do to inability to connect to the service
Disconnected do to inability to connect to the service
link monitor failure  - reached failure threshold with 2 bcast and 3 unicast failures
(test is still running - I will update if any new types of failures pop up or if the card resets.
Have you tried the patch I suggested to keep polling for interrupts?
I am just getting started on it now.
Cc: pmccloud@chromium.org
Are these tests being run with M41 or M40? I just want to make sure that we're focusing on the current stable as c#13 mentioned M40 image.
I have been focusing on 40 since that is where the trouble started and where our users were (until recently).  From feedback reports, 41 was about the same for users.

I can do a run on 41 next.
Forgot to update here about #14 - I tried the patch on 3/31 and still saw the errors (card resets and link monitor failures).

I also added the debugging dianders suggested for daisy - did not see any errors pop up unless the card was reset.  Also added slub_debug - no change there either.

Pit:  After running more rounds of the test on pit, I have now seen card resets (most recent instance was a tx_timeout).

Cc: chron@chromium.org
I am currently testing a fix that changes the strategy that backs off in looping to look for interrupts to continue looping every second. (using R40 and my loadedNetworkTest script)

I tested it at home and had not seen any resets, in one of the loops on the machine with the fix I have seen a reset while at work on GoogleGuest.  I have re-enabled the higher level debugging and ftrace with this fix on the machine to see if there is anything useful to help with the issue.

+chron
I am still testing the fix.  I found an issue with the existing backoff calculation this morning, so I have fixed that and am testing it again.  I have observed that running with the polling fix slows down the loadedNetworkTest script, but this could be due to more productive network work to satisfy the curl requests.  I added logic to report the number of curl requests completed during the test and it is running now.
Cc: matthewyuan@chromium.org
Labels: M-43 M-42 Merge-Requested
Owner: josa...@chromium.org
Status: Assigned
Don't know where bugdroid is, but <https://chromium-review.googlesource.com/#/c/264492/> landed.  

This is a very safe change.  Rebecca has done a bunch of testing and it looks good, but even if we somehow find that the issue isn't fixed this change won't hurt anything.  We should merge to M-43 (and probably M-42) as soon as we're able.
Labels: -Merge-Requested Merge-Review Hotlist-Merge-Review
[Automated comment] No bugdroid (commit) comments found, couldn't auto-approve, needs manual review.

Comment 23 by dxie@google.com, Apr 8 2015

for josafat
Labels: -Merge-Review Merge-Approved
Approved for 42, but please let it bake a day before merging.
Can we get an update when this lands in M42? I want to try to get this validated out in the field with some enterprise customers. 
Owner: silberst@chromium.org
approved for M43 too 
Project Member

Comment 27 by bugdroid1@chromium.org, Apr 9 2015

Project  : chromiumos/third_party/kernel
Branch   : release-R43-6946.B-chromeos-3.8
Author   : Rebecca Silberstein <silberst@chromium.org>
Committer: Douglas Anderson <dianders@chromium.org>
Commit   : 86d29597807bd1404926c102139ddaeaba8d2749

Code-Review  0 : Rebecca Silberstein
Code-Review  +2: Douglas Anderson
Trybot-Ready 0 : Douglas Anderson, Rebecca Silberstein
Verified     0 : Rebecca Silberstein
Verified     +1: Douglas Anderson
Commit Queue   : Chumped
Change-Id      : I0149366044377f63586a6696a86c7f65b3cb686e
Reviewed-at    : https://chromium-review.googlesource.com/265036

dw_mmc: reduce max polling wait to one second

Under high memory pressure, SDIO interrupts are missed triggering card
resets and link monitor failures.  This is combated by reducing the
maximum time between polling loops to one second.  This commit also
fixes the increasing backoff time between loops.  The wait time is
checked after the calculation to verify it is not longer than 1 second.

BUG= chromium:468342 
TEST=Multiple runs of the loaded network test script that maintains high
memory pressure while performing curl requests - watched for link
monitor failures and card resets.

Signed-off-by: Rebecca Silberstein &lt;silberst@chromium.org&gt;
(cherry picked from commit 38b89697c8d236231098046e953920f70e10c2d7)

drivers/mmc/core/sdio_irq.c
Project Member

Comment 28 by bugdroid1@chromium.org, Apr 9 2015

Project  : chromiumos/third_party/kernel
Branch   : release-R42-6812.B-chromeos-3.8
Author   : Rebecca Silberstein <silberst@chromium.org>
Committer: Douglas Anderson <dianders@chromium.org>
Commit   : fa1542facbc5c9a5b5f072a87b35a2615ddbfbe6

Code-Review  0 : Rebecca Silberstein
Code-Review  +2: Douglas Anderson
Trybot-Ready 0 : Douglas Anderson, Rebecca Silberstein
Verified     0 : Rebecca Silberstein
Verified     +1: Douglas Anderson
Commit Queue   : Chumped
Change-Id      : I0149366044377f63586a6696a86c7f65b3cb686e
Reviewed-at    : https://chromium-review.googlesource.com/265006

dw_mmc: reduce max polling wait to one second

Under high memory pressure, SDIO interrupts are missed triggering card
resets and link monitor failures.  This is combated by reducing the
maximum time between polling loops to one second.  This commit also
fixes the increasing backoff time between loops.  The wait time is
checked after the calculation to verify it is not longer than 1 second.

BUG= chromium:468342 
TEST=Multiple runs of the loaded network test script that maintains high
memory pressure while performing curl requests - watched for link
monitor failures and card resets.

Signed-off-by: Rebecca Silberstein &lt;silberst@chromium.org&gt;
(cherry picked from commit 38b89697c8d236231098046e953920f70e10c2d7)

drivers/mmc/core/sdio_irq.c
Labels: -Merge-Approved Merge-Merged
Status: Fixed
OK, 27 hours > 1 day, so landed in M-43 and M-42.

Note that the M-42 builder has already kicked off for the day, so it probably won't be in the next M-42 but in the next next M-42.
This change is in M42 build 6812.69.0.
Status: Verified
Daisy R40-6457.107.0 and R44 6955.0.0
Cc: chirantan@chromium.org silberst@chromium.org
 Issue 360862  has been merged into this issue.

Comment 33 by olofj@google.com, Apr 27 2015

Cc: dtor@chromium.org olofj@chromium.org
With this fix in place we now see issues manifesting in a different way.  Please see  bug 481227 

Sign in to add a comment