New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

High number of Chrome browser hang in M65

Project Member Reported by vkhabarov@chromium.org, Mar 13 2018

Issue description

ChromeOS version: 65.0.3325.89 - 65.0.3325.150
ChromeOS device model: Cave and Chell 
Case#: 15118606

Description:
Devices crashing on M65 Betas, requiring hard reboot. There was a suspicion to be related to crbug.com/803594, but upgrading to 65.0.3325.150 didn't resolve it and crash doesn't look similar.

There were no crashes prior to M65 Beta and customer really concerned that update might break their devices.


Steps to reproduce: 
No known particular pattern. No peripherals, no network switch

Current Behavior / Reproduction: 
Device crashes

Expected Behavior: 
No crashes

Drive link to logs: 
crash report ID 47fa08a115fe579b (for version 150)
https://drive.google.com/file/d/11z51h7IzbU41R3I9M5eWiJbSt9NLWkrv/view?usp=sharing (version 150)

https://drive.google.com/file/d/1-skwk2HvJVrzO1a-BG-ou51FEnpeWiRf/view?usp=sharing (crash at 9:23 March 8)
Crash ID for one on previous crashes - 7d6a813c007f9788
 
Showing comments 78 - 177 of 177 Older
> The CL seems making the problem happen more frequently.

Can someone translate this into tentative reproduction steps?
> netlink packets can be dropped by the kernel under memory pressure conditions

This should probably fail the syscall with ENOBUFS?  AIUI the kernel always tries to warn netlink users that things have gotten out of sync.

Unless the sendto() is failing (which should be logged), I do find it strange that we don't see either ENOBUFS or a NLMSG_DONE response.  We shouldn't just get radio silence from the kernel.

When I paste this code into a test program and run it in an "empty" netns with lo up, it receives 3 replies from the kernel: one message with the IPv4 address, one message with the IPv6 address, and one message with type NLMSG_DONE.

> (chrome -address_tracker_linux.cc:209 ) net::internal::AddressTrackerLinux::Init()

I believe :209 means the first dump request, RTM_GETADDR, is blocking.  So it doesn't ever reach the second dump request (RTM_GETLINK).

If something is going wrong on the kernel side, it is possible that the kernel is unable to acquire rtnl_lock.  However, when I look at crash.corp data for M65 (10323.*) I don't see unusually high numbers of e.g. hung_tasks crashes.
> This should probably fail the syscall with ENOBUFS?  AIUI the kernel always tries to warn netlink users that things have gotten out of sync.

Not if it drops the packets before the recvcall, and the recv call is blocking, I think?
Btw, the fact that there are no hung tasks in the kernel tells me that this is probably not a kernel issue... That's why I suspect it's simply the blocking call in user space that's the root cause.
c#7:
> Crash: Uploaded Crash Report ID ec89295e5a8a18e7 (Local Crash ID: Chrome)

This hang is interesting because it is in a different part of the code.  Magic signature is "media_router::GetDiscoveryNetworkInfoList()" and it is calling getifaddrs() in libc.  The backtrace suggests that it is also stuck waiting for a netlink reply that never arrives.

Looking at the crash stats: https://goto.google.com/cwzal

 - The earliest recorded instance was on 64.0.3282.24 / 10176.13.1
 - This was the first M64 beta channel push
 - 99.94% of these crashes on Chrome OS are on 3.18 kernels; cyan hit hardest
 - It is plausible that only beta/stable channels have enough users to hit the bug

Some notable 3.18 kernel changes between M63 beta promotion and M64 beta promotion: https://goto.google.com/aploc

 - packet socket lock fixes (only seems to involve a spinlock though)
 - KEYS fixes
 - KAISER :-(
 - netlink dump start callback

Regarding the socket lock fixes, we seem to be missing a follow-up fix; unsure if it's related:

https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/1013410
Just gotta insert, during the deep dive into code, that both Edgar and Parrot have been rock stable since switching Named Servers to 1.1.1.1/1.0.0.1; two days running now.
The comments above make me think that the speed of these devices background page requests is masking the kernel problem  you all are investigating.

I'm just a Top Contributor CBC user, and sharing anecdotal, not diagnostic, data.
Appreciate you all opening up this bug so that we Top Contributors can watch, and comment.
f

Other notes:

I enabled lock debugging on cyan, did not see anything relevant in the logs:

$ git diff | grep "^+"
+++ b/chromeos/config/base.config
+CONFIG_DEBUG_LIST=y
+CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_DEBUG_MUTEXES=y
+CONFIG_DEBUG_RT_MUTEXES=y
+CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
+CONFIG_PROVE_LOCKING=y

If I start ~10,000 iterations of the attached test program in parallel with a bash for loop, none of them get stuck waiting for a reply.

When I looked up the affected client IDs with the media_router signature (c#83, c#7) in crash.corp, most only had the 1 (Chrome) crash logged.  A few had multiple instances of the Chrome crash.  None of them showed crashes in the kernel or in other modules.
netlink.c
1.3 KB View Download
cernekee@, found this interesting crash: http://crash/cfaec84753bccd8b

It you look at the "Threads" tab, chrome has 100 threads. 47 of them has a stack of net::HaveOnlyLoopbackAddresses() -> getifaddrs(). HaveOnlyLoopbackAddresses() is a posted job to task scheduler from HostResolverImpl::OnIPAddressChanged [1][2]. 47 of them probably means getifaddrs() is somehow stuck. WDYT?

[1] https://cs.chromium.org/chromium/src/net/dns/host_resolver_impl.cc?rcl=f9115e6c040cfe10c044dc45b75304170d09db67&l=2527
[2] https://cs.chromium.org/chromium/src/net/dns/host_resolver_impl.cc?rcl=f9115e6c040cfe10c044dc45b75304170d09db67&l=2448
Another possibility is that a socket fd somehow gets corrupted. I will land a CL (https://chromium-review.googlesource.com/c/chromium/src/+/1017867) that puts a close guard on AddressTrackerLinux's |netlink_fd_| to see if it could catch anything.

Hi guys!

I am user and was redirected to here from the Chromebook support forum. I am an unfortunate victim of the OS random crash without a traceable reason. Please check my forum question here: https://productforums.google.com/forum/?utm_medium=email&utm_source=footer#!msg/chromebook-central/gJIZreO52uY/h9XxQd9NCgAJ

I have a Samsung Chromebook 3 (Chromebook Samsung XE500C13 32Gb) with stable 65.0.3325.209 official 64 bit. The issue happens randomly one or twice a day.

Please let me know how can I help you in the fix of the issue as I am willing to contribute. If you need a log report or something let me know.

Thanks!

re:Comment 88

Please submit a bug report (shift + alt + i) with this information, here is not the place for this information.
Ok. My apologies. I was not reporting a bug (I did that in the Chromebook forum) I was just hoping to contribute.
My CL to move GetCurrentNetworkID out of IO thread.
https://chromium-review.googlesource.com/c/chromium/src/+/1020297

And the CL in #78:
https://chromium-review.googlesource.com/#/c/chromium/src/+/1012751

If we land either of these, we should at least not blocking IO thread and freeze the screen.
Moving GetCurrentNetworkID off of the IO thread seems like a good solution. If that's the path we take, I wonder if we can CHECK that somehow we're not running on the IO thread. Basically the opposite of DCHECK(io_thread_checker_.CalledOnValidThread());
Project Member

Comment 93 by bugdroid1@chromium.org, Apr 19 2018

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4917ffb987b4bdd2b202b99907d1771801691dac

commit 4917ffb987b4bdd2b202b99907d1771801691dac
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Apr 19 23:25:18 2018

UPSTREAM: net/packet: fix a race in packet_bind() and packet_notifier()

[ Upstream commit 15fe076edea787807a7cdc168df832544b58eba6 ]

syzbot reported crashes [1] and provided a C repro easing bug hunting.

When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.

This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.

Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.

[1]
dev_remove_pack: ffff8801bf16fa80 not found
------------[ cut here ]------------
kernel BUG at net/core/dev.c:7945!  ( BUG_ON(!list_empty(&dev->ptype_all)); )
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
device syz0 entered promiscuous mode
CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cc57a500 task.stack: ffff8801cc588000
RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945
RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293
RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2
RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810
device syz0 entered promiscuous mode
RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8
R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0
FS:  0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
 tun_detach drivers/net/tun.c:670 [inline]
 tun_chr_close+0x49/0x60 drivers/net/tun.c:2845
 __fput+0x333/0x7f0 fs/file_table.c:210
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x199/0x270 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x9bb/0x1ae0 kernel/exit.c:865
 do_group_exit+0x149/0x400 kernel/exit.c:968
 SYSC_exit_group kernel/exit.c:979 [inline]
 SyS_exit_group+0x1d/0x20 kernel/exit.c:977
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44ad19

Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

(cherry picked from commit 18f0f8c1e866a5c22d2bebcc368c5217670753cf)

BUG= chromium:821607 
TEST=buildbots

Change-Id: I4170dbf965371dc3ef84e745d5a5a59499665bf4
Reviewed-on: https://chromium-review.googlesource.com/1013410
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>

[modify] https://crrev.com/4917ffb987b4bdd2b202b99907d1771801691dac/net/packet/af_packet.c

Comment 94 by 0spor...@gmail.com, Apr 20 2018

Here are 2 file:///var/log/messages texts from when I get a browser hang.

Acer R11 (cyan), 65.0.3325.209 Stable. Yes I send alt+shift+i feedback.

I would send a recent one but the last time it happened, it happened twice in the same day, and now the file:///var/log/messages only downloads a corrupted file without a file extension when I try to access it, so I apologize for the lack of current files.

For apr82018.txt, see 2018-04-08T09:39:03 for when hang begins
For apr92018.txt, see 2018-04-09T16:32:09 for when hang begins
apr92018.txt
165 KB View Download
apr82018.txt
1.3 MB View Download
Re# 94: I saw problems of device rebooted after sleep instead of freezing in both logs.

Comment 96 by 0spor...@gmail.com, Apr 20 2018

Re# 95: I rebooted a little while after it stopped responding for both. Either holding down power button or hard reset. I didn't let it run through for long while frozen. Does it look like I rebooted before anything happened?
After some discussion, we will likely temporarily remove Media Router's use of getifaddrs (disabling our device caching mechanism) until it can be replaced with a non-blocking netlink socket directly.  We currently don't know how many of the hangs we may be responsible for, though, so it's hard to say how much this will help.  mfoltz@ may comment with more information later.
Do we know that the fix landed in comment #93 addresses this issue?
cernekee@ knows the best. His comment in #83 says:
  "Regarding the socket lock fixes, we seem to be missing a follow-up fix; unsure if it's related:"

I'll land a workaround to move a couple of get network id calls off the IO thread to make it less painful for the user. We would still get the shutdown hang crashes if the problem still happens (i.e. if #93 fix does not address the underlying issue).
Cave on DEV channel: I used to hit this freeze several times a week. It seems to have vanished about one month ago.
Cc: -derat@chromium.org
> Do we know that the fix landed in comment #93 addresses this issue?

I don't have any evidence that this will fix the issue.  It was just something I noticed missing when looking at the kernel code.
Project Member

Comment 103 by bugdroid1@chromium.org, Apr 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a

commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a
Author: Xiyuan Xia <xiyuan@chromium.org>
Date: Mon Apr 23 21:04:27 2018

cros: Move GetCurrentNetworkId call off IO thread.

Make DataReductionProxyConfig/NetworkQualityEstimator call
net::GetWifiSSID() on a worker thread instead of the IO thread
on ChromeOS as a work around for  https://crbug.com/821607 .

This CL does not solve the underlying problem that is still being
investigated. It gives the user a crippled system instead of a dead
one with a frozen screen.

Bug:  821607 
Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018
Reviewed-on: https://chromium-review.googlesource.com/1020297
Reviewed-by: Matt Menke <mmenke@chromium.org>
Reviewed-by: Tarun Bansal <tbansal@chromium.org>
Commit-Queue: Xiyuan Xia <xiyuan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#552828}
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/chrome/browser/profiles/profile_impl_io_data.cc
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/net/nqe/network_quality_estimator.cc
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/net/nqe/network_quality_estimator.h
[modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/services/network/network_service.cc

> This CL does not solve the underlying problem that is still being investigated. It gives the user a crippled system instead of a dead one with a frozen screen.

Is the "crippled system" condition logged in crash.corp, UMA, etc.?  Since we do not have a repro case for this, we'll want to figure out how often it is happening in the field.
#103 CL moves the blocking call off the IO thread to avoid having a frozen screen. We should still get a shutdown hang crash like in issue 806125.
To follow up on Media Router's use of getifaddrs, we are actually always calling it on a task runner that specifies MayBlock in its TaskTraits.  As a result, we think the previous shutdown hang fix is all we need to do.  Feel free to let us know if you want any further action from us.
Pinged at https://bugs.chromium.org/p/chromium/issues/detail?id=773764#c12, it mentions a crash (https://crash/ad6ad5d5bb265650) that is related to this issue.

Trying to dig more info, I looked at all crashes from the device:
https://crash.corp.google.com/browse?q=ClientID%3D%27ca6f25bc5e10461a8d2df6aae2e67268%27#samplereports

There are some kernel crashes and I wonder whether they could be relevant to this issue:

http://crash/ccca397c704d7e5d
http://crash/ccca397c704d7e5d
<1>[18559.910093] chrome: Corrupted page table at address 16700d0bf640

http://carsh/84a8c24003d6b9e5
<0>[32483.547370] PANIC: double fault, error_code: 0x0

http://crash/a5def7cf65292bff, wifi driver crash ?
e02db452-iwl_trans_pcie_send_hcmd+0x42d/0x54a [iwlwifi]()

http://crash/5ef643533fe3619d This one is more interesting because it has socket calls on stack
<4>[26891.633251] general protection fault: 0000 [#1] PREEMPT SMP 
<0>[26891.635500] gsmi: Log Shutdown Reason 0x03
<4>[26891.635513] Modules linked in: rfcomm ip6t_REJECT nf_reject_ipv6 ccm cmac uinput xt_nat snd_soc_dmic bridge snd_skl_nau88l25_max98357a snd_soc_hdac_hdmi snd_soc_skl snd_soc_skl_ipc snd_soc_sst_acpi snd_soc_sst_ipc snd_soc_sst_dsp memconsole_x86_legacy snd_hda_ext_core memconsole snd_hda_core stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 acpi_als industrialio_triggered_buffer kfifo_buf iptable_nat snd_soc_ssm4567 snd_soc_max98357a industrialio snd_soc_nau8825 nf_nat_ipv4 nf_nat zram xt_mark fuse ip6table_filter snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlmvm iwlwifi iwl7000_mac80211 cfg80211 uvcvideo btusb btrtl btbcm btintel bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_core joydev
<4>[26891.635726] CPU: 0 PID: 3317 Comm: chrome Tainted: G        W      3.18.0-16387-g09d1f8eebf5f #1
<4>[26891.635743] Hardware name: Google sentry/sentry, BIOS Google_Sentry.7820.314.0 06/08/2017
<4>[26891.635761] task: ffff88020e69bfe0 ti: ffff88020e4fc000 task.ti: ffff88020e4fc000
<4>[26891.635775] RIP: 0010:[<ffffffff8ac839e5>]  [<ffffffff8ac839e5>] ttwu_stat+0x79/0xd0
<4>[26891.635799] RSP: 0018:ffff88020e4ffa78  EFLAGS: 00010006
<4>[26891.635811] RAX: 4b5099c43f3f9a00 RBX: ffff8802751ce460 RCX: 00000000000f4240
<4>[26891.635826] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8802751ce460
<4>[26891.635839] RBP: ffff88020e4ffab8 R08: 0000000000000000 R09: 000000000009d9c0
<4>[26891.635853] R10: 00000000000bf88d R11: 0000000000019c94 R12: 0000000000013a40
<4>[26891.635866] R13: 0000000000000000 R14: ffff88027ec13a40 R15: 0000000000000003
<4>[26891.635881] FS:  000070870b84c780(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000
<4>[26891.635897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[26891.635909] CR2: 00003eb4e9698000 CR3: 000000020e54e000 CR4: 00000000003607f0
<4>[26891.635922] Stack:
<4>[26891.635928]  000000000000024d 00000000751ce460 ffff88027ed93a40 ffff8802751ce460
<4>[26891.635949]  0000000000000001 0000000000013a40 0000000000000000 0000000000000003
<4>[26891.635972]  ffff88020e4ffb18 ffffffff8ac86f28 0000000180100010 0000000000000046
<4>[26891.635993] Call Trace:
<4>[26891.636007]  [<ffffffff8ac86f28>] try_to_wake_up+0x1ce/0x1ed
<4>[26891.636023]  [<ffffffff8ac86f92>] default_wake_function+0x12/0x14
<4>[26891.636038]  [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b
<4>[26891.636053]  [<ffffffff8ac96a8e>] __wake_up_locked+0x13/0x15
<4>[26891.636068]  [<ffffffff8ad87ba1>] ep_poll_callback+0x106/0x145
<4>[26891.636082]  [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b
<4>[26891.636097]  [<ffffffff8ac96ee5>] __wake_up_sync_key+0x49/0x5f
<4>[26891.636113]  [<ffffffff8b1ab398>] sock_def_readable+0x5b/0x5d
<4>[26891.636130]  [<ffffffff8b26284e>] unix_stream_sendmsg+0x2d9/0x365
<4>[26891.636147]  [<ffffffff8b1a578a>] __sock_sendmsg_nosec+0x25/0x27
<4>[26891.636162]  [<ffffffff8b1a80d7>] sock_sendmsg+0x7d/0xb2
<4>[26891.636177]  [<ffffffff8ad6ade1>] ? __fget+0x70/0x7b
<4>[26891.636190]  [<ffffffff8ad6b045>] ? __fget_light+0x44/0x56
<4>[26891.636205]  [<ffffffff8b1a93d3>] SYSC_sendto+0x145/0x188
<4>[26891.636221]  [<ffffffff8ace0b66>] ? seccomp_phase1+0x48/0x95
<4>[26891.636238]  [<ffffffff8ac0eb2d>] ? syscall_trace_enter_phase1+0xf5/0x151
<4>[26891.636256]  [<ffffffff8b1a94b6>] SyS_sendto+0xe/0x10
<4>[26891.636272]  [<ffffffff8b2a8992>] system_call_fastpath+0x1c/0x21
<4>[26891.636284] Code: 4a 48 ff 87 78 01 00 00 89 55 cc e8 a4 49 02 00 48 63 55 cc 4c 89 e0 48 03 04 d5 00 5f 8f 8b 48 8b 80 40 09 00 00 48 85 c0 74 1b <4c> 0f a3 b8 28 01 00 00 19 d2 85 d2 74 08 ff 80 fc 00 00 00 eb 
<1>[26891.636434] RIP  [<ffffffff8ac839e5>] ttwu_stat+0x79/0xd0
<4>[26891.636449]  RSP <ffff88020e4ffa78>
<4>[26891.636459] ---[ end trace 63c7927576143327 ]---
<4>[26891.643658] general protection fault: 0000 [#2] PREEMPT SMP 
<4>[26891.643666] Modules linked in: rfcomm ip6t_REJECT nf_reject_ipv6 ccm cmac uinput xt_nat snd_soc_dmic bridge snd_skl_nau88l25_max98357a snd_soc_hdac_hdmi snd_soc_skl snd_soc_skl_ipc snd_soc_sst_acpi snd_soc_sst_ipc snd_soc_sst_dsp memconsole_x86_legacy snd_hda_ext_core memconsole snd_hda_core stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 acpi_als industrialio_triggered_buffer kfifo_buf iptable_nat snd_soc_ssm4567 snd_soc_max98357a industrialio snd_soc_nau8825 nf_nat_ipv4 nf_nat zram xt_mark fuse ip6table_filter snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlmvm iwlwifi iwl7000_mac80211 cfg80211 uvcvideo btusb btrtl btbcm btintel bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_core joydev
<4>[26891.643756] CPU: 1 PID: 1244 Comm: chrome Tainted: G      D W      3.18.0-16387-g09d1f8eebf5f #1
<4>[26891.643761] Hardware name: Google sentry/sentry, BIOS Google_Sentry.7820.314.0 06/08/2017
<4>[26891.643766] task: ffff880072422da0 ti: ffff880267dc4000 task.ti: ffff880267dc4000
<4>[26891.643771] RIP: 0010:[<ffffffff8ac8dfa8>]  [<ffffffff8ac8dfa8>] select_task_rq_fair+0x2d3/0x7e7
<4>[26891.643781] RSP: 0018:ffff880267dc79c8  EFLAGS: 00010006
<4>[26891.643786] RAX: 000000000000000f RBX: 00000000ffffffff RCX: ffff880275a4ed80
<4>[26891.643790] RDX: ffff88027ed13a40 RSI: 0000000000000004 RDI: 0000000000000002
<4>[26891.643794] RBP: ffff880267dc7a98 R08: ffff8801f797a4a0 R09: 0000000000000002
<4>[26891.643799] R10: 0000000000000029 R11: ffff880268af4340 R12: ffff88023f3f8e00
<4>[26891.643803] R13: f4b2d1dbf797a500 R14: ffff88019fa36d80 R15: f4b2d1dbf797a520
<4>[26891.643808] FS:  000076cba510a780(0000) GS:ffff88027ec80000(0000) knlGS:0000000000000000
<4>[26891.643813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[26891.643817] CR2: 0000112a9257b9f0 CR3: 00000002660ca000 CR4: 00000000003607e0
<4>[26891.643822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[26891.643826] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[26891.643829] Stack:
<4>[26891.643832]  ffff880267dc7a28 ffffffff8ac8ff4b 0000000000577476 0000000000000000
<4>[26891.643840]  000000000000049a 0000880200000001 ffff880267dc7a08 ffff88000179b000
<4>[26891.643848]  ffff880267dc7a28 ffffffff8ac8d2f0 ffff880134e67800 ffff88027ec13a40
<4>[26891.643856] Call Trace:
<4>[26891.643865]  [<ffffffff8ac8ff4b>] ? enqueue_entity+0x535/0x64f
<4>[26891.643873]  [<ffffffff8ac8d2f0>] ? cpu_overutilized+0x1d/0x43
<4>[26891.643881]  [<ffffffff8ac90142>] ? enqueue_task_fair+0xdd/0xe6
<4>[26891.643890]  [<ffffffff8ac86267>] select_task_rq+0x11/0x45
<4>[26891.643898]  [<ffffffff8ac86e4b>] try_to_wake_up+0xf1/0x1ed
<4>[26891.643907]  [<ffffffff8ad49e69>] ? __kmalloc_track_caller+0x78/0x135
<4>[26891.643916]  [<ffffffff8ac86f92>] default_wake_function+0x12/0x14
<4>[26891.643922]  [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b
<4>[26891.643929]  [<ffffffff8ac96a8e>] __wake_up_locked+0x13/0x15
<4>[26891.643937]  [<ffffffff8ad87ba1>] ep_poll_callback+0x106/0x145
<4>[26891.643943]  [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b
<4>[26891.643950]  [<ffffffff8ac96ee5>] __wake_up_sync_key+0x49/0x5f
<4>[26891.643958]  [<ffffffff8b1ab398>] sock_def_readable+0x5b/0x5d
<4>[26891.643966]  [<ffffffff8b26284e>] unix_stream_sendmsg+0x2d9/0x365
<4>[26891.643975]  [<ffffffff8b1a578a>] __sock_sendmsg_nosec+0x25/0x27
<4>[26891.643983]  [<ffffffff8b1a80d7>] sock_sendmsg+0x7d/0xb2
<4>[26891.643990]  [<ffffffff8ad55c9f>] ? __sb_end_write+0x2e/0x5d
<4>[26891.643997]  [<ffffffff8ad6ade1>] ? __fget+0x70/0x7b
<4>[26891.644004]  [<ffffffff8ad6b045>] ? __fget_light+0x44/0x56
<4>[26891.644013]  [<ffffffff8b1a93d3>] SYSC_sendto+0x145/0x188
<4>[26891.644022]  [<ffffffff8ad52cd3>] ? fsnotify_modify+0x57/0x5f
<4>[26891.644030]  [<ffffffff8ad52c4a>] ? fdput.isra.12+0xf/0x11
<4>[26891.644038]  [<ffffffff8ad52c75>] ? fdput_pos.isra.13+0x29/0x30
<4>[26891.644047]  [<ffffffff8b1a94b6>] SyS_sendto+0xe/0x10
<4>[26891.644056]  [<ffffffff8b2a8992>] system_call_fastpath+0x1c/0x21
<4>[26891.644060] Code: 8d ff ff 84 c0 74 da 89 df e8 c1 a4 ff ff 85 c0 74 cf 89 5d cc e9 67 01 00 00 49 8b 86 08 03 00 00 4d 8d 7d 20 83 cb ff 83 e0 0f <49> 85 45 20 75 30 4d 8b 6d 00 4d 3b 6c 24 10 75 de 4d 8b 64 24 
<1>[26891.644147] RIP  [<ffffffff8ac8dfa8>] select_task_rq_fair+0x2d3/0x7e7
<4>[26891.644155]  RSP <ffff880267dc79c8>
<4>[26891.644160] ---[ end trace 63c7927576143328 ]---
<0>[26891.663895] Kernel panic - not syncing: Fatal exception
<0>[26892.727364] Kernel Offset: 0x9c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<0>[26892.727516] gsmi: Log Shutdown Reason 0x02



> Trying to dig more info, I looked at all crashes from the device

When I checked earlier, most of the Client IDs reporting this crash only had one crash on file (i.e. no kernel issues).

If we can establish a pattern of similar kernel crashes across multiple devices that are experiencing this issue, that would be an interesting finding.

FWIW, crash ad6ad5d5bb265650 doesn't have the same signature as the other hangs we've been investigating on this bug.  It is in epoll_dispatch().


> <1>[18559.910093] chrome: Corrupted page table at address 16700d0bf640

> <0>[32483.547370] PANIC: double fault, error_code: 0x0

When I see reports like these, I usually assume that it's either one device with bad RAM / cooling / power / etc. OR that something is randomly scribbling on kernel memory.

It wouldn't be too surprising to see random hangs as one manifestation of corruption.
epoll_displatch() with SIGABRT is another incarnation of shutdown hang crash. If you look at the "Threads" tab of ad6ad5d5bb265650 crash, you will find AddressTrackerLinux::ReadMessages on its IO thread.
Cc: ericsheh@google.com
Hi team,

This is Eric from Shanghai Techstop, redirected from crbug/773764. Our local recruiting team has been somewhat heavily affected by this bug too (and more heavily by crbug/820307). 

This is the crash report for this bug, and I've sent a feedback through alt+shift+i after alt+volUp+x but not sure how to find a link to it or if you can locate it by using the crash report ID. It was sent via wezhao@'s Lenovo Chromebook 13".
https://crash.corp.google.com/browse?stbtiq=ad6ad5d5bb265650#1

It seems to be random, and what their behaviors have in common is that they all use the Hangouts App from Web Store https://chrome.google.com/webstore/detail/google-hangouts/knipolnnllmklapflnccelgolnpehhpl/related, and they make lots of video and audio / phone calls everyday, they use headsets connected to the 3.5mm audio jack, and they use Lenovo Chromebook 13" (TVCs).

These two bugs come to their Chromebooks in a random manner every now and then (multiple times per day), and when using Alt+VolUp+X to restore the browser, they usually can't hear anything when making calls through the Hangouts App, and a full system restart always restores the audio functionalities. 

Could you please shed some light? And we are happy to try what you suggest to help with the troubleshooting since they can easily reproduce this problem every day.

Thanks!
Oops sorry, just found that what I reported in comment#111 was already captured by comment#107 - #109.
I can confirm this is also affecting the 30+ Lenovo Chromebook 13's (Type 20GL) that I manage. Unlike Eric, these Chromebooks are only used for accessing EHR software. There is no video/audio calls or headsets connected. There are USB mice connected, and bluetooth is disabled. 

Comment 114 by mh...@gmrsd.com, Apr 25 2018

I can confirm we are getting multiple random reboots on HP 11 G4 and G5 units.
Labels: Merge-Request-67
Request merge https://chromium-review.googlesource.com/1020297 in #103 to M67.
Re #114 This issue has nothing to do with random reboots. This is about a hang.  

If you're seeing reboots/crashes, please report a new bug. Include a detailed description of the problem, version information from chrome://version and the contents of chrome://crashes from an affected machine.  Thanks.
Chell on 66.0.3359.117 crashed yesterday with https://crash.corp.google.com/browse?stbtiq=68e5c398aad7d913

It just hung, Alt+Vol_Up+X  did not unfreeze the screen.

Together with other crashes for recent days:
89b72ddb8e7489f9
0e36f0f2392351ee
afa448b3e6f7d67a
f8a2d99f6f8a16fd
bc016ba5927d4798
dd10e36dbfd3d381
c85be17d3a8a35af
305ea8bec3b40e56
fc37ed424823ba7b
CrashReports.png
223 KB View Download
Labels: -M-65 M-67
Project Member

Comment 119 by sheriffbot@chromium.org, Apr 26 2018

Labels: -Merge-Request-67 Merge-Approved-67 Hotlist-Merge-Approved
Your change meets the bar and is auto-approved for M67. Please go ahead and merge the CL to branch 3396 manually. Please contact milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 120 by bugdroid1@chromium.org, Apr 26 2018

Labels: -merge-approved-67 merge-merged-3396
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ca1a959283cab03aa00aaf036712c2eeb7ab00dc

commit ca1a959283cab03aa00aaf036712c2eeb7ab00dc
Author: Xiyuan Xia <xiyuan@chromium.org>
Date: Thu Apr 26 19:27:31 2018

Merge M67 "cros: Move GetCurrentNetworkId call off IO thread."

> Make DataReductionProxyConfig/NetworkQualityEstimator call
> net::GetWifiSSID() on a worker thread instead of the IO thread
> on ChromeOS as a work around for  https://crbug.com/821607 .
>
> This CL does not solve the underlying problem that is still being
> investigated. It gives the user a crippled system instead of a dead
> one with a frozen screen.
>
> Bug:  821607 
> Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
> Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018
> Reviewed-on: https://chromium-review.googlesource.com/1020297
> Reviewed-by: Matt Menke <mmenke@chromium.org>
> Reviewed-by: Tarun Bansal <tbansal@chromium.org>
> Commit-Queue: Xiyuan Xia <xiyuan@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#552828}

(cherry picked from commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a)

Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Change-Id: I6e01df1eb38996cb223e9e1105f21195fad7211e
Reviewed-on: https://chromium-review.googlesource.com/1030971
Reviewed-by: Xiyuan Xia <xiyuan@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#337}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/chrome/browser/profiles/profile_impl_io_data.cc
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/net/nqe/network_quality_estimator.cc
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/net/nqe/network_quality_estimator.h
[modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/services/network/network_service.cc

Comment 121 Deleted

I've experienced this frequently on lars under 65 and 66. 64 appeared to be fine. Seems to be network related. Hangs occur with Bluetooth either enabled or disabled.

For me, the issue occurs *only* in one specific location where wifi coverage is known to be patchy. I got a ERR_NAME_RESOLUTION_FAILED in an open browser tab about two seconds before the latest hang.

A list of crashes from my machine:

687bd09bb906aea3
0bbe035358fbe37d
48d3f4bdfc3e047b
a9d541a93746d468
2ec31561e9c002be
f490b44129739133
7b5cd7113f4e8229
Project Member

Comment 123 by sheriffbot@chromium.org, May 2 2018

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
cernekee@, found the following this morning:

https://groups.google.com/forum/#!topic/fa.linux.kernel/zoWnuxWdJFk

It seems inline with the symptoms of this issue (netlink socket read hangs). Could you help to check whether they are relevant and whether the patches should be applied to 3.18?
Cc: grundler@chromium.org
c#124:
> cernekee@, found the following this morning:

Interesting find, thanks for the info.

> https://patchwork.ozlabs.org/patch/519245/ 
> https://patchwork.ozlabs.org/patch/520824/

The first link (519245 - landed as kernel commit 1f770c0a09da8 "netlink: Fix autobind race condition that leads to zero port ID") says:

    The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID.  This led to kernel
    deadlocks that were observed by multiple people.

c0bb07df7d9 was introduced in 4.1-rc5.  Our 3.18 tree does not have the offending commit.

c0bb07df7d9 was a fix for c5adde9468b07 ("netlink: eliminate nl_sk_hash_lock") which landed in 4.0-rc1.  Our 3.18 tree DOES have this commit; it was backported to satisfy Jetstream requirements as part of  bug 556861 .

The second link (520824 - landed as kernel commit da314c9923fe "netlink: Replace rhash_portid with bound") fixed more races created by 1f770c0a09da.  This commit is also not in our tree.

> For 4.0.x, you _really_ need to update to 4.0.9 to get the following two patches. 
>
> cf8befcc1a55 netlink: Disable insertions/removals during rehash 
> 18889a4315a5 netlink: Reset portid after netlink_insert failure 

Neither of these are in our 3.18 tree, either.

cf8befcc1a55 says:

    netlink: Disable insertions/removals during rehash
    
    [ Upstream commit: Not applicable ]
    
    The current rhashtable rehash code is buggy and can't deal with
    parallel insertions/removals without corrupting the hash table.
    
    This patch disables it by partially reverting
    c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate
    nl_sk_hash_lock").

18889a4315a5 is a backport of c0bb07df7d9.

So, since we have a "special" 3.18 tree that includes a backport of c5adde9468b07 (3.18.y -stable does not), I think we may need to take the following fixes from 4.0.y -stable:

919d9db95218 netlink: Fix netlink_insert EADDRINUSE error
18889a4315a5 netlink: Reset portid after netlink_insert failure
cf8befcc1a55 netlink: Disable insertions/removals during rehash

However... 4.0.y -stable includes a backport of c0bb07df7d9 (18889a4315a5) but it does not include the follow-up fixes for the new bugs that c0bb07df7d9 created.  This might be due to the fact that 4.0.y reached EOL a few months before 1f770c0a09da landed upstream.  So we would also want to backport:

1f770c0a09da netlink: Fix autobind race condition that leads to zero port ID
da314c9923fe netlink: Replace rhash_portid with bound

The other option is to try to revert the backported changes that introduced deadlocks.  I'm guessing that stock 3.18.y doesn't have these issues.

Grant, WDYT?
Labels: M-66 Merge-Request-66
Request to merge https://chromium-review.googlesource.com/1020297 to M66.
Labels: -Merge-Request-66 Merge-Approved-66
Approved as per b/79122581 reports/validation
Project Member

Comment 128 by bugdroid1@chromium.org, May 3 2018

Labels: -merge-approved-66 merge-merged-3359
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/197d96429cbd24e22ed28faa6f370bd62992ea49

commit 197d96429cbd24e22ed28faa6f370bd62992ea49
Author: Xiyuan Xia <xiyuan@chromium.org>
Date: Thu May 03 01:45:50 2018

Merge M66 "cros: Move GetCurrentNetworkId call off IO thread."

> Make DataReductionProxyConfig/NetworkQualityEstimator call
> net::GetWifiSSID() on a worker thread instead of the IO thread
> on ChromeOS as a work around for  https://crbug.com/821607 .
>
> This CL does not solve the underlying problem that is still being
> investigated. It gives the user a crippled system instead of a dead
> one with a frozen screen.
>
> Bug:  821607 
> Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
> Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018
> Reviewed-on: https://chromium-review.googlesource.com/1020297
> Reviewed-by: Matt Menke <mmenke@chromium.org>
> Reviewed-by: Tarun Bansal <tbansal@chromium.org>
> Commit-Queue: Xiyuan Xia <xiyuan@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#552828}

(cherry picked from commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a)

Bug: b/79122581
Change-Id: Ib4e336fbaefc1bb6c1c3ab019f8b2ed7a35ed18b
Reviewed-on: https://chromium-review.googlesource.com/1041277
Reviewed-by: Xiyuan Xia <xiyuan@chromium.org>
Cr-Commit-Position: refs/branch-heads/3359@{#791}
Cr-Branched-From: 66afc5e5d10127546cc4b98b9117aff588b5e66b-refs/heads/master@{#540276}
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/chrome/browser/io_thread.cc
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/chrome/browser/profiles/profile_impl_io_data.cc
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/net/nqe/network_quality_estimator.cc
[modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/net/nqe/network_quality_estimator.h

Cc: -josephsih@chromium.org -ericsheh@google.com kevinhayes@google.com kyan@chromium.org kkunduru@chromium.org
re comment #125

Kevin, I have three thoughts:
1) This is going to be messy for any option. I'd prefer to pull in fixes from upstream and/or stable trees than revert the 106 rhashtable patches. I think your outline on what to backport is reasonable.

   But maybe 3.18 needs more than 5 netlink patches backported?
    git log --oneline v3.18.. -- net/netlink | fgrep netlink: | wc -l
    97

   I've added Kan Yan, Kishan Kunduru, and Kevin Hayes as FYI. One or more of them should be CCd on code reviews since this code is shared with "gale" kernel ("Google Wifi").


2) 3.18 is "special" because it's not running the native wireless stack. :/
   See the original bug why I pulled in most of the rhashtable support. Reverting most of these really isn't an option.
    https://bugs.chromium.org/p/chromium/issues/detail?id=556861

    "rhashtable.c in chromeos-3.18 branch .... won't work for USE=wireless42 builds."


3) This is a canonical example of why skylake chipset should update to a newer kernel version. We will be supporting these until "Nov 2022" (HP Chromebook 13 G1 == chell) and this will just get more painful every year.
Components: OS>Kernel
Labels: -Pri-0 Pri-1
cernekee@/grundler@, could either of you take over this or help to find an owner to figure the next step for the kernel fix?

Tag with OS>Kernel.
Drop to P1 since we had a workaround CL in M66, M67 and M68/ToT.
Owner: cernekee@chromium.org
Kevin said he would take a look at it. I promised to review any proposed changes.
Cc: xiy...@chromium.org
Thanks. cc myself since I am interested to learn. :)
Will EOL devices get an update to 65 that will fix this issue? I'm assuming 66 will not make it to EOL devices to address this. 
-CS

I'm getting reports of random hangs from my users (using Asus C302). Is there anything specific that I need to look out for or report back?

Comment 135 by scamdyn@gmail.com, May 14 2018

Reporting this JUST started happening on 68.0.3429.0 Canary 64-bit
Sent an Alt+Shift+i report stating the hang is similar to M65. Logs included.
This started happening and has the EXACT same symptoms as M65. 
Dell CB1C13 Wolf (Haswell)
-CS

Comment 136 by scamdyn@gmail.com, May 14 2018

Forgot to add, it happens about every minute and only Alt+VolumeUP+X or Refresh+Power will unfreeze. Unresponsive. Forced to move back to Beta. 
-CS
Re #135: The issue you observed in 68.0.3429.0 is issue 842505, where a debugging dump blocks the UI thread. The offending CL is reverted and should be fixed in the next dev build.

Comment 138 by scamdyn@gmail.com, May 15 2018

Re #137:

Thanks for that information. Will move back to Canary soon. 
-CS
Do we have a commit on this fix? Which Milestones will get it rolled out? Will it go Dev-->Beta-->Stable, or jump right to Stable?
Labels: Hotlist-ConOps-CrOS
(Bulk Edit) Adding the new conops Chrome OS hotlist to all open issues with the "#CBC-RS/TC-watchlist" tag, our former tracking tag.
Owner: ----
Status: Available (was: Assigned)
Kevin C no longer works for Google.
Owner: snanda@chromium.org
Status: Assigned (was: Available)
snanda@ any thoughts on who can pick this up? I know some workarounds landed, but it's not clear to me what remaining work needs to be done here.
FYI, I powerwashed on May 29 because of this issue.

I disabled this flag, chrome://flags/#arc-boot-completed-broadcast, on May 30 after the last crash.

It has stopped the browser hanging and crashes.

Uploaded Crash Report ID a4939159e859af03 (Local Crash ID: Chrome)
Crash report uploaded on Wednesday, May 30, 2018 at 8:06:31 PM

Uploaded Crash Report ID c01d4ead7c03b633 (Local Crash ID: Chrome)
Crash report uploaded on Wednesday, May 30, 2018 at 8:04:36 PM

Uploaded Crash Report ID dff7dc112ddd0079 (Local Crash ID: ChromeOS_ARC)
Crash report uploaded on Wednesday, May 30, 2018 at 6:28:36 PM

Uploaded Crash Report ID d7e679bcd6eaf2e2 (Local Crash ID: ChromeOS)
Crash report uploaded on Tuesday, May 29, 2018 at 4:20:36 PM


Google Chrome	67.0.3396.69 (Official Build) beta (32-bit)
Revision	dafe8337c251b6f1f248539bd06eeb3a685d865c-refs/branch-heads/3396@{#722}
Platform	10575.52.0 (Official Build) beta-channel veyron_minnie
Firmware Version	Google_Veyron_Minnie.6588.237.0
ARC	4808759
JavaScript	V8 6.7.288.43
Flash	29.0.0.171 /opt/google/chrome/pepper/libpepflashplayer.so
User Agent	Mozilla/5.0 (X11; CrOS armv7l 10575.52.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.69 Safari/537.36
Owner: briannorris@chromium.org
Cc: -r...@chromium.org -dmitrygr@google.com -mcchou@chromium.org
Components: OS>Systems>Network
Blockedon: 849872
Issue 832084 has been merged into this issue.
Project Member

Comment 149 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0cdaad0cc65125c1a0726ed347b627ff662ee77f

commit 0cdaad0cc65125c1a0726ed347b627ff662ee77f
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Jun 13 04:50:32 2018

UPSTREAM: netlink: Reset portid after netlink_insert failure

The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.

This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.

Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c0bb07df7d981e4091432754e30c9c720e2c0c78)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I5cfee0c833c70f1ad3b82e3f6d4cf5bee189256e
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091452
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/0cdaad0cc65125c1a0726ed347b627ff662ee77f/net/netlink/af_netlink.c

Project Member

Comment 150 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7c67f79118352aefa4f470136e799c3945ea944e

commit 7c67f79118352aefa4f470136e799c3945ea944e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Jun 13 04:50:33 2018

UPSTREAM: netlink: Use default rhashtable hashfn

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  As the key length is a multiple of 4, this means that
we will actually end up using jhash2.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 11b58ba146ccd7b105c4962c75f2e744053c85bc)

BUG= chromium:821607 ,  chromium:849872 
TEST=build and boot

Change-Id: Ifd74f8ccc3be372ede6105fee47d832e95bf73b7
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091453
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/7c67f79118352aefa4f470136e799c3945ea944e/net/netlink/af_netlink.c

Project Member

Comment 151 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/1f735d252607b26f34fd88aae42e2cd6471b7861

commit 1f735d252607b26f34fd88aae42e2cd6471b7861
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Jun 13 04:50:35 2018

UPSTREAM: netlink: Fix autobind race condition that leads to zero port ID

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1f770c0a09da855a2b51af6d19de97fb955eca85)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I065a53d0d8a897ce648e4a6e99b6fc28e3f46625
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091454
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/1f735d252607b26f34fd88aae42e2cd6471b7861/net/netlink/af_netlink.c
[modify] https://crrev.com/1f735d252607b26f34fd88aae42e2cd6471b7861/net/netlink/af_netlink.h

Project Member

Comment 152 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9

commit 663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Jun 13 04:50:36 2018

BACKPORT: FROMGIT: netlink: Disable insertions/removals during rehash

[ Upstream commit: Not applicable ]

The current rhashtable rehash code is buggy and can't deal with
parallel insertions/removals without corrupting the hash table.

This patch disables it by partially reverting
c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate
nl_sk_hash_lock").

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit cf8befcc1a5538b035d478424efcc2d50e66928e
 git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.0.y)

Conflicts:
   net/netlink/af_netlink.c
[rhashtable_remove_fast vs. rhashtable_remove]

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I6063076587c0a9ede57e319989a426ee6f6ebe61
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091455
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9/net/netlink/af_netlink.c

Project Member

Comment 153 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/32a47cf484dc30229a88cb0746076be99799bc3e

commit 32a47cf484dc30229a88cb0746076be99799bc3e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Jun 13 04:50:38 2018

UPSTREAM: netlink: Replace rhash_portid with bound

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit da314c9923fed553a007785a901fd395b7eb6c19)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I4baab91ca840fcb07a0844ac9f48dcc71fddd509
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091506
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/32a47cf484dc30229a88cb0746076be99799bc3e/net/netlink/af_netlink.c
[modify] https://crrev.com/32a47cf484dc30229a88cb0746076be99799bc3e/net/netlink/af_netlink.h

How confident are we these recent CLs will solve the crash?

If we are very confident we can merge them directly into 65 to try to stabilize the impacted boards.
Labels: -Hotlist-Merge-Approved
> How confident are we these recent CLs will solve the crash?

Point of clarity (since this is a large, noisy bug): IIUC this issue mostly (entirely?) doesn't actually produce crashes -- it produces hung threads. The main way this has showed up in crash reports is if someone force a Chrome restart (Alt-VolUp-X) to escape the hang. (Someone correct me if I'm wrong.)

But given the analysis that has happened so far (mostly without me; but I tried to validate what I could), it seems very likely that these hangs should be fixed.

> If we are very confident we can merge them directly into 65 to try to stabilize the impacted boards.

I'd be most worried about regressions. This is pretty critical code here, and it'd be a shame to introduce further regressions (the series of CLs even includes and reverts/modifies 2 different attempts that upstream developers made at fixing subtle race conditions). I would usually prefer this get a full cycle of testing and observing any additional reported issues (or lack thereof) to gain confidence.

But I also don't understand why the suggestion for a M65 merge -- isn't M65 long superseded, with M66 out for weeks, and M67 rolling to stable now? Feel free to educate me off-bug if needed.
R65 was the AUE milestone for Sandybridge systems (butterfly, parrot, lumpy, stumpy), and it was particularly unfortunate that R65 was hanging/crashing on them more than we would like. 

The R65 merge would be for a one off push just for these systems, as you point out 65 is otherwise deprecated. 

Agreed we should be confident in these before we do that, we can see how well they fare on 69 for a couple weeks and look into the 65 aspect after. 
> R65 was the AUE milestone for Sandybridge systems (butterfly, parrot, lumpy, stumpy), and it was particularly unfortunate that R65 was hanging/crashing on them more than we would like. 

Those aren't running the 3.18 kernel, which is where the above fixes were targeted (we believe that issue was specific to our kernel 3.18). If there were problems on 3.8 kernels, they were very likely a different issue, and we should probably fork a different bug, like I did for  bug 849872 . This one is already extremely noisy, with everybody and their mother/father dumping their issues. (It's possible that even  bug 849872  is somewhat of a sidetrack? But at least it was one clearly-identified problem in the mix here.)

If you can point me at specific points in this bug that apply to those systems, then I can try to extract details to a new bug.

I'm tempted to close this bug soon (still not sure if it needs to be pushed to pre-M69 at all). If archaeology can pull out additional actionable details for independent issues, then we can still file new follow-up bugs.
My apologies, I was confusing this with https://bugs.chromium.org/p/chromium/issues/detail?id=844256 which may or may not be related to some of the comments on this centibug, but I agree with your assessment that if there is nothing actionable left here we should close this and move on.
Labels: -M-67 -M-66 M-69
Status: Fixed (was: Assigned)
OK, let's say Fixed for M-69 (not sure how effective the mitigations for M-67/M-66 were?). Do holler (preferably on a new bug) if a related issue is still hanging around.
But if the bug is still present in EOL device with M-65 (I cannot access bug report 844256), why is this bug designed as Fixed? I own one of these devices and it has been freezing constantly for the last two months. Where can I find updated information about the resolution of this bug for EOL devices with M-65?
Giulio,
As explained above, this particular bug is focused on devices running a specific kernel (3.18 kernel). We're closing this bug and planning to file new, follow-up bugs if necessary for other devices.

There is a Google-restricted bug investigating the stability issues reported on recent EOL devices like yours. We don't have an update to share publicly yet and are still working on reproducing those issues reliably. <https://bugs.chromium.org/p/chromium/issues/detail?id=844256>. We'll likely post an update in the Chromebook Forum once we have some news to report.
Project Member

Comment 162 by bugdroid1@chromium.org, Jun 18 2018

Labels: merge-merged-release-R68-10718.B-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/5079028f7918373835f76e13e2825e35230524fc

commit 5079028f7918373835f76e13e2825e35230524fc
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon Jun 18 18:43:26 2018

UPSTREAM: netlink: Reset portid after netlink_insert failure

The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.

This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.

Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c0bb07df7d981e4091432754e30c9c720e2c0c78)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I5cfee0c833c70f1ad3b82e3f6d4cf5bee189256e
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091452
Reviewed-by: Grant Grundler <grundler@chromium.org>
(cherry picked from commit 0cdaad0cc65125c1a0726ed347b627ff662ee77f)
Reviewed-on: https://chromium-review.googlesource.com/1104942

[modify] https://crrev.com/5079028f7918373835f76e13e2825e35230524fc/net/netlink/af_netlink.c

Project Member

Comment 163 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e00564d51de68cca9f008bfc6963d79e9bb4e852

commit e00564d51de68cca9f008bfc6963d79e9bb4e852
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon Jun 18 18:43:33 2018

UPSTREAM: netlink: Use default rhashtable hashfn

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  As the key length is a multiple of 4, this means that
we will actually end up using jhash2.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 11b58ba146ccd7b105c4962c75f2e744053c85bc)

BUG= chromium:821607 ,  chromium:849872 
TEST=build and boot

Change-Id: Ifd74f8ccc3be372ede6105fee47d832e95bf73b7
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091453
Reviewed-by: Grant Grundler <grundler@chromium.org>
(cherry picked from commit 7c67f79118352aefa4f470136e799c3945ea944e)
Reviewed-on: https://chromium-review.googlesource.com/1104943

[modify] https://crrev.com/e00564d51de68cca9f008bfc6963d79e9bb4e852/net/netlink/af_netlink.c

Project Member

Comment 164 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/3520cd6a3ad97bb052dfa4a0928baab974117a9d

commit 3520cd6a3ad97bb052dfa4a0928baab974117a9d
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon Jun 18 18:43:51 2018

UPSTREAM: netlink: Fix autobind race condition that leads to zero port ID

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1f770c0a09da855a2b51af6d19de97fb955eca85)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I065a53d0d8a897ce648e4a6e99b6fc28e3f46625
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091454
Reviewed-by: Grant Grundler <grundler@chromium.org>
(cherry picked from commit 1f735d252607b26f34fd88aae42e2cd6471b7861)
Reviewed-on: https://chromium-review.googlesource.com/1104944

[modify] https://crrev.com/3520cd6a3ad97bb052dfa4a0928baab974117a9d/net/netlink/af_netlink.c
[modify] https://crrev.com/3520cd6a3ad97bb052dfa4a0928baab974117a9d/net/netlink/af_netlink.h

Project Member

Comment 165 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e7fd03b8b5253057ce756b9e854d46e2ae9e771e

commit e7fd03b8b5253057ce756b9e854d46e2ae9e771e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon Jun 18 18:43:58 2018

BACKPORT: FROMGIT: netlink: Disable insertions/removals during rehash

[ Upstream commit: Not applicable ]

The current rhashtable rehash code is buggy and can't deal with
parallel insertions/removals without corrupting the hash table.

This patch disables it by partially reverting
c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate
nl_sk_hash_lock").

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit cf8befcc1a5538b035d478424efcc2d50e66928e
 git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.0.y)

Conflicts:
   net/netlink/af_netlink.c
[rhashtable_remove_fast vs. rhashtable_remove]

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I6063076587c0a9ede57e319989a426ee6f6ebe61
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091455
Reviewed-by: Grant Grundler <grundler@chromium.org>
(cherry picked from commit 663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9)
Reviewed-on: https://chromium-review.googlesource.com/1104945

[modify] https://crrev.com/e7fd03b8b5253057ce756b9e854d46e2ae9e771e/net/netlink/af_netlink.c

Project Member

Comment 166 by bugdroid1@chromium.org, Jun 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/76eb986d89481d1fe9c9319eeeb77bab7d4afccb

commit 76eb986d89481d1fe9c9319eeeb77bab7d4afccb
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon Jun 18 18:44:02 2018

UPSTREAM: netlink: Replace rhash_portid with bound

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit da314c9923fed553a007785a901fd395b7eb6c19)

BUG= chromium:821607 ,  chromium:849872 
TEST=netlink send/recv repeatedly, on many threads - watch for timeouts;
     similar to this test code:
     https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

Change-Id: I4baab91ca840fcb07a0844ac9f48dcc71fddd509
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1091506
Reviewed-by: Grant Grundler <grundler@chromium.org>
(cherry picked from commit 32a47cf484dc30229a88cb0746076be99799bc3e)
Reviewed-on: https://chromium-review.googlesource.com/1104946

[modify] https://crrev.com/76eb986d89481d1fe9c9319eeeb77bab7d4afccb/net/netlink/af_netlink.c
[modify] https://crrev.com/76eb986d89481d1fe9c9319eeeb77bab7d4afccb/net/netlink/af_netlink.h

Have TWO Acer c710:

THIS one is:

Google Chrome	65.0.3325.209 (Official Build) (64-bit)
Revision	0
Platform	10323.67.0 (Official Build) stable-channel parrot
Firmware Version	Google_Parrot.2685.37.0
JavaScript	V8 6.5.254.43
Flash	29.0.0.113

Would have to come back here using other unit to post its info.

BUT, maybe not so relevant since THIS one only works in Guest-mode now (thanks to YET unfixed BLSOD bug -- the Black Login Screen Of Doom update) and it STILL does the freeze-restart now.

More specifically, both units freeze and/or restart now (and both usually always have mail.com in some tab or window); but the other one (logged in, but with long-time profile error) much more often, and often on Facebook; and this one, with possibly increasing frequency but on no site in particular, though often there is a Flash page (or two) in another tab or window.

More simply though, earlier today, with various tabs open, this one froze and restarted. Then, right after that restart, did it again while logging in to just mail.com in one tab and searching google in another.  Similar has happened before IIRC.

NOTE: Would have put this under the "new" bug per @161 but no such bug-link was left there -- and it took rather a long time just to RE-find this bug after that last crash, as guest-mode has no bookmarks or history of course (not even a simple session-history for re-opening mistakenly closed tabs, for some bizarre reason.)

It has now been one FULL day, and Week, and MONTH since question #160 and answer #161 concerning EOL-devices -- and since despite the status of THIS bug, the "High number of Chrome browser hang in M65" bug is certainly NOT fixed, even in GUEST-mode with only ONE website loaded (consistently, mail.com BTW) then would somebody in charge please consider:

1. How about updating THIS bug-title per #161 to "High number of Chrome browser hang in M65 with kernel 3.18"? 

2. How about starting a new "High number of Chrome browser hang in M65 with kernel OTHER than 3.18" bug? 

3. How about adding a brief explanation at the beginning of this bug about how to find your kernel version?

4. How about REVERTING our collective EOL devices back to M64 until you find some fix for OUR bug??



Having the text-cursor disappear consistently at certain columns in some text boxes, even address-bar, is a bit bothersome.

Having to keep the room real quiet and yet still struggle to hear some popular video sites, gets a little inconvenient.

Having webpage tasks routinely swell up in memory until the GUI gets sluggish, is somewhat annoying.

Being "permanently" stuck in Guest-mode now, thanks to BSOD-login bug, is rather distressing.

But repeatedly CRASHING now -- even in GUEST-MODE -- is REALLY AGGRAVATING.


PLEASE FIX or REVERT -- ASAP!

Concur: it has been more than four months that this has been unresolved. Why can't EOL devices be reversed to M64? This is sad.
Re c#169,

You can download the recovery file of M64 on this page:

https://cros-updates-serving.appspot.com

Please find the name of your device which can be found in the AUE Devices section at the bottom. I am not sure if there is a way to prevent it from updating back to M65, although I think there is a way.
#161: "There is a Google-restricted bug investigating the stability issues reported on recent EOL devices like yours. We don't have an update to share publicly yet and are still working on reproducing those issues reliably."

re: ..."still working on reproducing those issues reliably."

Again, as explained above, just loading and/or logging-in on Mail.com reliably causes hangs and/or reboots, even in GUEST-mode (which should eliminate a great deal of unit-specific config variables), and even right after a reboot or sometimes a wakeup.  But, once logged in, system may run as normal for many, many hours (except for annoying unrelated issues in #168.)

#170: "You can download the recovery file" -- but this is a chromebook and page and/or website gives no instructions about what to do with file.

It appears the problem may be specific to certain hardware, at least so far with the units we have, we have not see the problem, but we are procuring more. 

The recovery image file can be written to a USB drive using the recovery utility, there is a gear icon that should have the option 'Use local image'. 
Though this bug is not where the more recent instability issues were being debugged, since folks have been watching this bug I want to point out that we have found a suspected bad change and pushed a new R65 with a revert (10323.67.9). So far this appears to have stopped the crash types that were causing problems on SandyBridge devices in R65.

If you are still seeing this you can try out the new version by going to about://help and clicking on check for updates on stable channel. So far this is only live at 1%, but checking for updates manually will give you the update, if we don't find any other issues we will continue to ramp up to the rest of the SandyBridge fleet. 
Since there is no "check for updates" button on EOL devices, something that seem 
to have worked for me was going to chrome://help and then refresh the page multiple times until I received an update and I was asked to restart the system. :-o
Thank you for addressing this issue, much appreciated.  

IIUC it looks like omahaproxy & the update servers will grab version 10323.67.9 /
 65.0.3325.209 for these AUE devices but, from what I can tell, the recovery images are still kind of a mixed bag going from - 

- lumpy 10176.76.0 / 64.0.3282.190
- butterfly, parrot, stumpy: 10323.62.0 / 65.0.3325.184

This may not be all the devices affected but it's the ones I grabbed in a quick search.

Would it be possible & perhaps prudent to update recovery.conf with the fix for these devices and/or make them available at https://cros-updates-serving.appspot.com/ under the 'Recovery' column for download?

Just thinking out loud and trying to anticipate any problems we might run into when this fix gets more attention.

Thanx for your efforts and indulgence.
Screenshot 2018-08-30 at 11.37.56 AM.png
241 KB View Download
Yea the recovery images have not been updated yet, but even if they are used the devices should get an update shortly thereafter, after we are more confident in the build and it is rolled out more we can update the recovery images. 
#c176: That makes total sense, thanx for the feedback and explanation; the rollout strategy is complex for sure and I only know a piece of it.

As a point of reference, I recovered my Acer C7 parrot/sandybridge yesterday with 10323.62.0 / 65.0.3325.184. It didn't seem to update after repeatedly reloading the 'About Chrome OS' page so I switched to the beta channel which is on the same version and I eventually got a 'Restart' notification. I'm not sure what triggered the update but I did get it.

I have not experienced any 'hangs' so I believe the revert fixed it.

Thanx again.
Showing comments 78 - 177 of 177 Older

Sign in to add a comment