Issue metadata
Sign in to add a comment
|
High number of Chrome browser hang in M65 |
||||||||||||||||||||||||
Issue descriptionChromeOS version: 65.0.3325.89 - 65.0.3325.150 ChromeOS device model: Cave and Chell Case#: 15118606 Description: Devices crashing on M65 Betas, requiring hard reboot. There was a suspicion to be related to crbug.com/803594, but upgrading to 65.0.3325.150 didn't resolve it and crash doesn't look similar. There were no crashes prior to M65 Beta and customer really concerned that update might break their devices. Steps to reproduce: No known particular pattern. No peripherals, no network switch Current Behavior / Reproduction: Device crashes Expected Behavior: No crashes Drive link to logs: crash report ID 47fa08a115fe579b (for version 150) https://drive.google.com/file/d/11z51h7IzbU41R3I9M5eWiJbSt9NLWkrv/view?usp=sharing (version 150) https://drive.google.com/file/d/1-skwk2HvJVrzO1a-BG-ou51FEnpeWiRf/view?usp=sharing (crash at 9:23 March 8) Crash ID for one on previous crashes - 7d6a813c007f9788
Showing comments 78 - 177
of 177
Older ›
,
Apr 14 2018
what about somthing like https://chromium-review.googlesource.com/#/c/chromium/src/+/1012751
,
Apr 14 2018
> The CL seems making the problem happen more frequently. Can someone translate this into tentative reproduction steps?
,
Apr 14 2018
> netlink packets can be dropped by the kernel under memory pressure conditions This should probably fail the syscall with ENOBUFS? AIUI the kernel always tries to warn netlink users that things have gotten out of sync. Unless the sendto() is failing (which should be logged), I do find it strange that we don't see either ENOBUFS or a NLMSG_DONE response. We shouldn't just get radio silence from the kernel. When I paste this code into a test program and run it in an "empty" netns with lo up, it receives 3 replies from the kernel: one message with the IPv4 address, one message with the IPv6 address, and one message with type NLMSG_DONE. > (chrome -address_tracker_linux.cc:209 ) net::internal::AddressTrackerLinux::Init() I believe :209 means the first dump request, RTM_GETADDR, is blocking. So it doesn't ever reach the second dump request (RTM_GETLINK). If something is going wrong on the kernel side, it is possible that the kernel is unable to acquire rtnl_lock. However, when I look at crash.corp data for M65 (10323.*) I don't see unusually high numbers of e.g. hung_tasks crashes.
,
Apr 14 2018
> This should probably fail the syscall with ENOBUFS? AIUI the kernel always tries to warn netlink users that things have gotten out of sync. Not if it drops the packets before the recvcall, and the recv call is blocking, I think?
,
Apr 14 2018
Btw, the fact that there are no hung tasks in the kernel tells me that this is probably not a kernel issue... That's why I suspect it's simply the blocking call in user space that's the root cause.
,
Apr 14 2018
c#7: > Crash: Uploaded Crash Report ID ec89295e5a8a18e7 (Local Crash ID: Chrome) This hang is interesting because it is in a different part of the code. Magic signature is "media_router::GetDiscoveryNetworkInfoList()" and it is calling getifaddrs() in libc. The backtrace suggests that it is also stuck waiting for a netlink reply that never arrives. Looking at the crash stats: https://goto.google.com/cwzal - The earliest recorded instance was on 64.0.3282.24 / 10176.13.1 - This was the first M64 beta channel push - 99.94% of these crashes on Chrome OS are on 3.18 kernels; cyan hit hardest - It is plausible that only beta/stable channels have enough users to hit the bug Some notable 3.18 kernel changes between M63 beta promotion and M64 beta promotion: https://goto.google.com/aploc - packet socket lock fixes (only seems to involve a spinlock though) - KEYS fixes - KAISER :-( - netlink dump start callback Regarding the socket lock fixes, we seem to be missing a follow-up fix; unsure if it's related: https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/1013410
,
Apr 14 2018
Just gotta insert, during the deep dive into code, that both Edgar and Parrot have been rock stable since switching Named Servers to 1.1.1.1/1.0.0.1; two days running now. The comments above make me think that the speed of these devices background page requests is masking the kernel problem you all are investigating. I'm just a Top Contributor CBC user, and sharing anecdotal, not diagnostic, data. Appreciate you all opening up this bug so that we Top Contributors can watch, and comment. f
,
Apr 14 2018
Other notes: I enabled lock debugging on cyan, did not see anything relevant in the logs: $ git diff | grep "^+" +++ b/chromeos/config/base.config +CONFIG_DEBUG_LIST=y +CONFIG_DEBUG_LOCKING_API_SELFTESTS=y +CONFIG_DEBUG_LOCK_ALLOC=y +CONFIG_DEBUG_MUTEXES=y +CONFIG_DEBUG_RT_MUTEXES=y +CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y +CONFIG_PROVE_LOCKING=y If I start ~10,000 iterations of the attached test program in parallel with a bash for loop, none of them get stuck waiting for a reply. When I looked up the affected client IDs with the media_router signature (c#83, c#7) in crash.corp, most only had the 1 (Chrome) crash logged. A few had multiple instances of the Chrome crash. None of them showed crashes in the kernel or in other modules.
,
Apr 17 2018
cernekee@, found this interesting crash: http://crash/cfaec84753bccd8b It you look at the "Threads" tab, chrome has 100 threads. 47 of them has a stack of net::HaveOnlyLoopbackAddresses() -> getifaddrs(). HaveOnlyLoopbackAddresses() is a posted job to task scheduler from HostResolverImpl::OnIPAddressChanged [1][2]. 47 of them probably means getifaddrs() is somehow stuck. WDYT? [1] https://cs.chromium.org/chromium/src/net/dns/host_resolver_impl.cc?rcl=f9115e6c040cfe10c044dc45b75304170d09db67&l=2527 [2] https://cs.chromium.org/chromium/src/net/dns/host_resolver_impl.cc?rcl=f9115e6c040cfe10c044dc45b75304170d09db67&l=2448
,
Apr 18 2018
Another possibility is that a socket fd somehow gets corrupted. I will land a CL (https://chromium-review.googlesource.com/c/chromium/src/+/1017867) that puts a close guard on AddressTrackerLinux's |netlink_fd_| to see if it could catch anything.
,
Apr 19 2018
Hi guys! I am user and was redirected to here from the Chromebook support forum. I am an unfortunate victim of the OS random crash without a traceable reason. Please check my forum question here: https://productforums.google.com/forum/?utm_medium=email&utm_source=footer#!msg/chromebook-central/gJIZreO52uY/h9XxQd9NCgAJ I have a Samsung Chromebook 3 (Chromebook Samsung XE500C13 32Gb) with stable 65.0.3325.209 official 64 bit. The issue happens randomly one or twice a day. Please let me know how can I help you in the fix of the issue as I am willing to contribute. If you need a log report or something let me know. Thanks!
,
Apr 19 2018
re:Comment 88 Please submit a bug report (shift + alt + i) with this information, here is not the place for this information.
,
Apr 19 2018
Ok. My apologies. I was not reporting a bug (I did that in the Chromebook forum) I was just hoping to contribute.
,
Apr 19 2018
My CL to move GetCurrentNetworkID out of IO thread. https://chromium-review.googlesource.com/c/chromium/src/+/1020297 And the CL in #78: https://chromium-review.googlesource.com/#/c/chromium/src/+/1012751 If we land either of these, we should at least not blocking IO thread and freeze the screen.
,
Apr 19 2018
Moving GetCurrentNetworkID off of the IO thread seems like a good solution. If that's the path we take, I wonder if we can CHECK that somehow we're not running on the IO thread. Basically the opposite of DCHECK(io_thread_checker_.CalledOnValidThread());
,
Apr 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4917ffb987b4bdd2b202b99907d1771801691dac commit 4917ffb987b4bdd2b202b99907d1771801691dac Author: Eric Dumazet <edumazet@google.com> Date: Thu Apr 19 23:25:18 2018 UPSTREAM: net/packet: fix a race in packet_bind() and packet_notifier() [ Upstream commit 15fe076edea787807a7cdc168df832544b58eba6 ] syzbot reported crashes [1] and provided a C repro easing bug hunting. When/if packet_do_bind() calls __unregister_prot_hook() and releases po->bind_lock, another thread can run packet_notifier() and process an NETDEV_UP event. This calls register_prot_hook() and hooks again the socket right before first thread is able to grab again po->bind_lock. Fixes this issue by temporarily setting po->num to 0, as suggested by David Miller. [1] dev_remove_pack: ffff8801bf16fa80 not found ------------[ cut here ]------------ kernel BUG at net/core/dev.c:7945! ( BUG_ON(!list_empty(&dev->ptype_all)); ) invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: device syz0 entered promiscuous mode CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801cc57a500 task.stack: ffff8801cc588000 RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945 RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293 RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2 RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810 device syz0 entered promiscuous mode RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8 R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0 FS: 0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106 tun_detach drivers/net/tun.c:670 [inline] tun_chr_close+0x49/0x60 drivers/net/tun.c:2845 __fput+0x333/0x7f0 fs/file_table.c:210 ____fput+0x15/0x20 fs/file_table.c:244 task_work_run+0x199/0x270 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x9bb/0x1ae0 kernel/exit.c:865 do_group_exit+0x149/0x400 kernel/exit.c:968 SYSC_exit_group kernel/exit.c:979 [inline] SyS_exit_group+0x1d/0x20 kernel/exit.c:977 entry_SYSCALL_64_fastpath+0x1f/0x96 RIP: 0033:0x44ad19 Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 18f0f8c1e866a5c22d2bebcc368c5217670753cf) BUG= chromium:821607 TEST=buildbots Change-Id: I4170dbf965371dc3ef84e745d5a5a59499665bf4 Reviewed-on: https://chromium-review.googlesource.com/1013410 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> [modify] https://crrev.com/4917ffb987b4bdd2b202b99907d1771801691dac/net/packet/af_packet.c
,
Apr 20 2018
Here are 2 file:///var/log/messages texts from when I get a browser hang. Acer R11 (cyan), 65.0.3325.209 Stable. Yes I send alt+shift+i feedback. I would send a recent one but the last time it happened, it happened twice in the same day, and now the file:///var/log/messages only downloads a corrupted file without a file extension when I try to access it, so I apologize for the lack of current files. For apr82018.txt, see 2018-04-08T09:39:03 for when hang begins For apr92018.txt, see 2018-04-09T16:32:09 for when hang begins
,
Apr 20 2018
Re# 94: I saw problems of device rebooted after sleep instead of freezing in both logs.
,
Apr 20 2018
Re# 95: I rebooted a little while after it stopped responding for both. Either holding down power button or hard reset. I didn't let it run through for long while frozen. Does it look like I rebooted before anything happened?
,
Apr 20 2018
After some discussion, we will likely temporarily remove Media Router's use of getifaddrs (disabling our device caching mechanism) until it can be replaced with a non-blocking netlink socket directly. We currently don't know how many of the hangs we may be responsible for, though, so it's hard to say how much this will help. mfoltz@ may comment with more information later.
,
Apr 23 2018
Do we know that the fix landed in comment #93 addresses this issue?
,
Apr 23 2018
cernekee@ knows the best. His comment in #83 says: "Regarding the socket lock fixes, we seem to be missing a follow-up fix; unsure if it's related:" I'll land a workaround to move a couple of get network id calls off the IO thread to make it less painful for the user. We would still get the shutdown hang crashes if the problem still happens (i.e. if #93 fix does not address the underlying issue).
,
Apr 23 2018
Cave on DEV channel: I used to hit this freeze several times a week. It seems to have vanished about one month ago.
,
Apr 23 2018
,
Apr 23 2018
> Do we know that the fix landed in comment #93 addresses this issue? I don't have any evidence that this will fix the issue. It was just something I noticed missing when looking at the kernel code.
,
Apr 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a Author: Xiyuan Xia <xiyuan@chromium.org> Date: Mon Apr 23 21:04:27 2018 cros: Move GetCurrentNetworkId call off IO thread. Make DataReductionProxyConfig/NetworkQualityEstimator call net::GetWifiSSID() on a worker thread instead of the IO thread on ChromeOS as a work around for https://crbug.com/821607 . This CL does not solve the underlying problem that is still being investigated. It gives the user a crippled system instead of a dead one with a frozen screen. Bug: 821607 Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018 Reviewed-on: https://chromium-review.googlesource.com/1020297 Reviewed-by: Matt Menke <mmenke@chromium.org> Reviewed-by: Tarun Bansal <tbansal@chromium.org> Commit-Queue: Xiyuan Xia <xiyuan@chromium.org> Cr-Commit-Position: refs/heads/master@{#552828} [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/chrome/browser/profiles/profile_impl_io_data.cc [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/net/nqe/network_quality_estimator.cc [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/net/nqe/network_quality_estimator.h [modify] https://crrev.com/afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a/services/network/network_service.cc
,
Apr 23 2018
> This CL does not solve the underlying problem that is still being investigated. It gives the user a crippled system instead of a dead one with a frozen screen. Is the "crippled system" condition logged in crash.corp, UMA, etc.? Since we do not have a repro case for this, we'll want to figure out how often it is happening in the field.
,
Apr 23 2018
#103 CL moves the blocking call off the IO thread to avoid having a frozen screen. We should still get a shutdown hang crash like in issue 806125.
,
Apr 23 2018
To follow up on Media Router's use of getifaddrs, we are actually always calling it on a task runner that specifies MayBlock in its TaskTraits. As a result, we think the previous shutdown hang fix is all we need to do. Feel free to let us know if you want any further action from us.
,
Apr 24 2018
Pinged at https://bugs.chromium.org/p/chromium/issues/detail?id=773764#c12, it mentions a crash (https://crash/ad6ad5d5bb265650) that is related to this issue. Trying to dig more info, I looked at all crashes from the device: https://crash.corp.google.com/browse?q=ClientID%3D%27ca6f25bc5e10461a8d2df6aae2e67268%27#samplereports There are some kernel crashes and I wonder whether they could be relevant to this issue: http://crash/ccca397c704d7e5d http://crash/ccca397c704d7e5d <1>[18559.910093] chrome: Corrupted page table at address 16700d0bf640 http://carsh/84a8c24003d6b9e5 <0>[32483.547370] PANIC: double fault, error_code: 0x0 http://crash/a5def7cf65292bff, wifi driver crash ? e02db452-iwl_trans_pcie_send_hcmd+0x42d/0x54a [iwlwifi]() http://crash/5ef643533fe3619d This one is more interesting because it has socket calls on stack <4>[26891.633251] general protection fault: 0000 [#1] PREEMPT SMP <0>[26891.635500] gsmi: Log Shutdown Reason 0x03 <4>[26891.635513] Modules linked in: rfcomm ip6t_REJECT nf_reject_ipv6 ccm cmac uinput xt_nat snd_soc_dmic bridge snd_skl_nau88l25_max98357a snd_soc_hdac_hdmi snd_soc_skl snd_soc_skl_ipc snd_soc_sst_acpi snd_soc_sst_ipc snd_soc_sst_dsp memconsole_x86_legacy snd_hda_ext_core memconsole snd_hda_core stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 acpi_als industrialio_triggered_buffer kfifo_buf iptable_nat snd_soc_ssm4567 snd_soc_max98357a industrialio snd_soc_nau8825 nf_nat_ipv4 nf_nat zram xt_mark fuse ip6table_filter snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlmvm iwlwifi iwl7000_mac80211 cfg80211 uvcvideo btusb btrtl btbcm btintel bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_core joydev <4>[26891.635726] CPU: 0 PID: 3317 Comm: chrome Tainted: G W 3.18.0-16387-g09d1f8eebf5f #1 <4>[26891.635743] Hardware name: Google sentry/sentry, BIOS Google_Sentry.7820.314.0 06/08/2017 <4>[26891.635761] task: ffff88020e69bfe0 ti: ffff88020e4fc000 task.ti: ffff88020e4fc000 <4>[26891.635775] RIP: 0010:[<ffffffff8ac839e5>] [<ffffffff8ac839e5>] ttwu_stat+0x79/0xd0 <4>[26891.635799] RSP: 0018:ffff88020e4ffa78 EFLAGS: 00010006 <4>[26891.635811] RAX: 4b5099c43f3f9a00 RBX: ffff8802751ce460 RCX: 00000000000f4240 <4>[26891.635826] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8802751ce460 <4>[26891.635839] RBP: ffff88020e4ffab8 R08: 0000000000000000 R09: 000000000009d9c0 <4>[26891.635853] R10: 00000000000bf88d R11: 0000000000019c94 R12: 0000000000013a40 <4>[26891.635866] R13: 0000000000000000 R14: ffff88027ec13a40 R15: 0000000000000003 <4>[26891.635881] FS: 000070870b84c780(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000 <4>[26891.635897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[26891.635909] CR2: 00003eb4e9698000 CR3: 000000020e54e000 CR4: 00000000003607f0 <4>[26891.635922] Stack: <4>[26891.635928] 000000000000024d 00000000751ce460 ffff88027ed93a40 ffff8802751ce460 <4>[26891.635949] 0000000000000001 0000000000013a40 0000000000000000 0000000000000003 <4>[26891.635972] ffff88020e4ffb18 ffffffff8ac86f28 0000000180100010 0000000000000046 <4>[26891.635993] Call Trace: <4>[26891.636007] [<ffffffff8ac86f28>] try_to_wake_up+0x1ce/0x1ed <4>[26891.636023] [<ffffffff8ac86f92>] default_wake_function+0x12/0x14 <4>[26891.636038] [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b <4>[26891.636053] [<ffffffff8ac96a8e>] __wake_up_locked+0x13/0x15 <4>[26891.636068] [<ffffffff8ad87ba1>] ep_poll_callback+0x106/0x145 <4>[26891.636082] [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b <4>[26891.636097] [<ffffffff8ac96ee5>] __wake_up_sync_key+0x49/0x5f <4>[26891.636113] [<ffffffff8b1ab398>] sock_def_readable+0x5b/0x5d <4>[26891.636130] [<ffffffff8b26284e>] unix_stream_sendmsg+0x2d9/0x365 <4>[26891.636147] [<ffffffff8b1a578a>] __sock_sendmsg_nosec+0x25/0x27 <4>[26891.636162] [<ffffffff8b1a80d7>] sock_sendmsg+0x7d/0xb2 <4>[26891.636177] [<ffffffff8ad6ade1>] ? __fget+0x70/0x7b <4>[26891.636190] [<ffffffff8ad6b045>] ? __fget_light+0x44/0x56 <4>[26891.636205] [<ffffffff8b1a93d3>] SYSC_sendto+0x145/0x188 <4>[26891.636221] [<ffffffff8ace0b66>] ? seccomp_phase1+0x48/0x95 <4>[26891.636238] [<ffffffff8ac0eb2d>] ? syscall_trace_enter_phase1+0xf5/0x151 <4>[26891.636256] [<ffffffff8b1a94b6>] SyS_sendto+0xe/0x10 <4>[26891.636272] [<ffffffff8b2a8992>] system_call_fastpath+0x1c/0x21 <4>[26891.636284] Code: 4a 48 ff 87 78 01 00 00 89 55 cc e8 a4 49 02 00 48 63 55 cc 4c 89 e0 48 03 04 d5 00 5f 8f 8b 48 8b 80 40 09 00 00 48 85 c0 74 1b <4c> 0f a3 b8 28 01 00 00 19 d2 85 d2 74 08 ff 80 fc 00 00 00 eb <1>[26891.636434] RIP [<ffffffff8ac839e5>] ttwu_stat+0x79/0xd0 <4>[26891.636449] RSP <ffff88020e4ffa78> <4>[26891.636459] ---[ end trace 63c7927576143327 ]--- <4>[26891.643658] general protection fault: 0000 [#2] PREEMPT SMP <4>[26891.643666] Modules linked in: rfcomm ip6t_REJECT nf_reject_ipv6 ccm cmac uinput xt_nat snd_soc_dmic bridge snd_skl_nau88l25_max98357a snd_soc_hdac_hdmi snd_soc_skl snd_soc_skl_ipc snd_soc_sst_acpi snd_soc_sst_ipc snd_soc_sst_dsp memconsole_x86_legacy snd_hda_ext_core memconsole snd_hda_core stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 acpi_als industrialio_triggered_buffer kfifo_buf iptable_nat snd_soc_ssm4567 snd_soc_max98357a industrialio snd_soc_nau8825 nf_nat_ipv4 nf_nat zram xt_mark fuse ip6table_filter snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlmvm iwlwifi iwl7000_mac80211 cfg80211 uvcvideo btusb btrtl btbcm btintel bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_core joydev <4>[26891.643756] CPU: 1 PID: 1244 Comm: chrome Tainted: G D W 3.18.0-16387-g09d1f8eebf5f #1 <4>[26891.643761] Hardware name: Google sentry/sentry, BIOS Google_Sentry.7820.314.0 06/08/2017 <4>[26891.643766] task: ffff880072422da0 ti: ffff880267dc4000 task.ti: ffff880267dc4000 <4>[26891.643771] RIP: 0010:[<ffffffff8ac8dfa8>] [<ffffffff8ac8dfa8>] select_task_rq_fair+0x2d3/0x7e7 <4>[26891.643781] RSP: 0018:ffff880267dc79c8 EFLAGS: 00010006 <4>[26891.643786] RAX: 000000000000000f RBX: 00000000ffffffff RCX: ffff880275a4ed80 <4>[26891.643790] RDX: ffff88027ed13a40 RSI: 0000000000000004 RDI: 0000000000000002 <4>[26891.643794] RBP: ffff880267dc7a98 R08: ffff8801f797a4a0 R09: 0000000000000002 <4>[26891.643799] R10: 0000000000000029 R11: ffff880268af4340 R12: ffff88023f3f8e00 <4>[26891.643803] R13: f4b2d1dbf797a500 R14: ffff88019fa36d80 R15: f4b2d1dbf797a520 <4>[26891.643808] FS: 000076cba510a780(0000) GS:ffff88027ec80000(0000) knlGS:0000000000000000 <4>[26891.643813] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[26891.643817] CR2: 0000112a9257b9f0 CR3: 00000002660ca000 CR4: 00000000003607e0 <4>[26891.643822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[26891.643826] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[26891.643829] Stack: <4>[26891.643832] ffff880267dc7a28 ffffffff8ac8ff4b 0000000000577476 0000000000000000 <4>[26891.643840] 000000000000049a 0000880200000001 ffff880267dc7a08 ffff88000179b000 <4>[26891.643848] ffff880267dc7a28 ffffffff8ac8d2f0 ffff880134e67800 ffff88027ec13a40 <4>[26891.643856] Call Trace: <4>[26891.643865] [<ffffffff8ac8ff4b>] ? enqueue_entity+0x535/0x64f <4>[26891.643873] [<ffffffff8ac8d2f0>] ? cpu_overutilized+0x1d/0x43 <4>[26891.643881] [<ffffffff8ac90142>] ? enqueue_task_fair+0xdd/0xe6 <4>[26891.643890] [<ffffffff8ac86267>] select_task_rq+0x11/0x45 <4>[26891.643898] [<ffffffff8ac86e4b>] try_to_wake_up+0xf1/0x1ed <4>[26891.643907] [<ffffffff8ad49e69>] ? __kmalloc_track_caller+0x78/0x135 <4>[26891.643916] [<ffffffff8ac86f92>] default_wake_function+0x12/0x14 <4>[26891.643922] [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b <4>[26891.643929] [<ffffffff8ac96a8e>] __wake_up_locked+0x13/0x15 <4>[26891.643937] [<ffffffff8ad87ba1>] ep_poll_callback+0x106/0x145 <4>[26891.643943] [<ffffffff8ac96a4c>] __wake_up_common+0x4c/0x7b <4>[26891.643950] [<ffffffff8ac96ee5>] __wake_up_sync_key+0x49/0x5f <4>[26891.643958] [<ffffffff8b1ab398>] sock_def_readable+0x5b/0x5d <4>[26891.643966] [<ffffffff8b26284e>] unix_stream_sendmsg+0x2d9/0x365 <4>[26891.643975] [<ffffffff8b1a578a>] __sock_sendmsg_nosec+0x25/0x27 <4>[26891.643983] [<ffffffff8b1a80d7>] sock_sendmsg+0x7d/0xb2 <4>[26891.643990] [<ffffffff8ad55c9f>] ? __sb_end_write+0x2e/0x5d <4>[26891.643997] [<ffffffff8ad6ade1>] ? __fget+0x70/0x7b <4>[26891.644004] [<ffffffff8ad6b045>] ? __fget_light+0x44/0x56 <4>[26891.644013] [<ffffffff8b1a93d3>] SYSC_sendto+0x145/0x188 <4>[26891.644022] [<ffffffff8ad52cd3>] ? fsnotify_modify+0x57/0x5f <4>[26891.644030] [<ffffffff8ad52c4a>] ? fdput.isra.12+0xf/0x11 <4>[26891.644038] [<ffffffff8ad52c75>] ? fdput_pos.isra.13+0x29/0x30 <4>[26891.644047] [<ffffffff8b1a94b6>] SyS_sendto+0xe/0x10 <4>[26891.644056] [<ffffffff8b2a8992>] system_call_fastpath+0x1c/0x21 <4>[26891.644060] Code: 8d ff ff 84 c0 74 da 89 df e8 c1 a4 ff ff 85 c0 74 cf 89 5d cc e9 67 01 00 00 49 8b 86 08 03 00 00 4d 8d 7d 20 83 cb ff 83 e0 0f <49> 85 45 20 75 30 4d 8b 6d 00 4d 3b 6c 24 10 75 de 4d 8b 64 24 <1>[26891.644147] RIP [<ffffffff8ac8dfa8>] select_task_rq_fair+0x2d3/0x7e7 <4>[26891.644155] RSP <ffff880267dc79c8> <4>[26891.644160] ---[ end trace 63c7927576143328 ]--- <0>[26891.663895] Kernel panic - not syncing: Fatal exception <0>[26892.727364] Kernel Offset: 0x9c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) <0>[26892.727516] gsmi: Log Shutdown Reason 0x02
,
Apr 24 2018
> Trying to dig more info, I looked at all crashes from the device When I checked earlier, most of the Client IDs reporting this crash only had one crash on file (i.e. no kernel issues). If we can establish a pattern of similar kernel crashes across multiple devices that are experiencing this issue, that would be an interesting finding. FWIW, crash ad6ad5d5bb265650 doesn't have the same signature as the other hangs we've been investigating on this bug. It is in epoll_dispatch(). > <1>[18559.910093] chrome: Corrupted page table at address 16700d0bf640 > <0>[32483.547370] PANIC: double fault, error_code: 0x0 When I see reports like these, I usually assume that it's either one device with bad RAM / cooling / power / etc. OR that something is randomly scribbling on kernel memory. It wouldn't be too surprising to see random hangs as one manifestation of corruption.
,
Apr 24 2018
epoll_displatch() with SIGABRT is another incarnation of shutdown hang crash. If you look at the "Threads" tab of ad6ad5d5bb265650 crash, you will find AddressTrackerLinux::ReadMessages on its IO thread.
,
Apr 25 2018
,
Apr 25 2018
Hi team, This is Eric from Shanghai Techstop, redirected from crbug/773764. Our local recruiting team has been somewhat heavily affected by this bug too (and more heavily by crbug/820307). This is the crash report for this bug, and I've sent a feedback through alt+shift+i after alt+volUp+x but not sure how to find a link to it or if you can locate it by using the crash report ID. It was sent via wezhao@'s Lenovo Chromebook 13". https://crash.corp.google.com/browse?stbtiq=ad6ad5d5bb265650#1 It seems to be random, and what their behaviors have in common is that they all use the Hangouts App from Web Store https://chrome.google.com/webstore/detail/google-hangouts/knipolnnllmklapflnccelgolnpehhpl/related, and they make lots of video and audio / phone calls everyday, they use headsets connected to the 3.5mm audio jack, and they use Lenovo Chromebook 13" (TVCs). These two bugs come to their Chromebooks in a random manner every now and then (multiple times per day), and when using Alt+VolUp+X to restore the browser, they usually can't hear anything when making calls through the Hangouts App, and a full system restart always restores the audio functionalities. Could you please shed some light? And we are happy to try what you suggest to help with the troubleshooting since they can easily reproduce this problem every day. Thanks!
,
Apr 25 2018
Oops sorry, just found that what I reported in comment#111 was already captured by comment#107 - #109.
,
Apr 25 2018
I can confirm this is also affecting the 30+ Lenovo Chromebook 13's (Type 20GL) that I manage. Unlike Eric, these Chromebooks are only used for accessing EHR software. There is no video/audio calls or headsets connected. There are USB mice connected, and bluetooth is disabled.
,
Apr 25 2018
I can confirm we are getting multiple random reboots on HP 11 G4 and G5 units.
,
Apr 25 2018
Request merge https://chromium-review.googlesource.com/1020297 in #103 to M67.
,
Apr 25 2018
Re #114 This issue has nothing to do with random reboots. This is about a hang. If you're seeing reboots/crashes, please report a new bug. Include a detailed description of the problem, version information from chrome://version and the contents of chrome://crashes from an affected machine. Thanks.
,
Apr 25 2018
Chell on 66.0.3359.117 crashed yesterday with https://crash.corp.google.com/browse?stbtiq=68e5c398aad7d913 It just hung, Alt+Vol_Up+X did not unfreeze the screen. Together with other crashes for recent days: 89b72ddb8e7489f9 0e36f0f2392351ee afa448b3e6f7d67a f8a2d99f6f8a16fd bc016ba5927d4798 dd10e36dbfd3d381 c85be17d3a8a35af 305ea8bec3b40e56 fc37ed424823ba7b
,
Apr 26 2018
,
Apr 26 2018
Your change meets the bar and is auto-approved for M67. Please go ahead and merge the CL to branch 3396 manually. Please contact milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ca1a959283cab03aa00aaf036712c2eeb7ab00dc commit ca1a959283cab03aa00aaf036712c2eeb7ab00dc Author: Xiyuan Xia <xiyuan@chromium.org> Date: Thu Apr 26 19:27:31 2018 Merge M67 "cros: Move GetCurrentNetworkId call off IO thread." > Make DataReductionProxyConfig/NetworkQualityEstimator call > net::GetWifiSSID() on a worker thread instead of the IO thread > on ChromeOS as a work around for https://crbug.com/821607 . > > This CL does not solve the underlying problem that is still being > investigated. It gives the user a crippled system instead of a dead > one with a frozen screen. > > Bug: 821607 > Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo > Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018 > Reviewed-on: https://chromium-review.googlesource.com/1020297 > Reviewed-by: Matt Menke <mmenke@chromium.org> > Reviewed-by: Tarun Bansal <tbansal@chromium.org> > Commit-Queue: Xiyuan Xia <xiyuan@chromium.org> > Cr-Commit-Position: refs/heads/master@{#552828} (cherry picked from commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a) Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo Change-Id: I6e01df1eb38996cb223e9e1105f21195fad7211e Reviewed-on: https://chromium-review.googlesource.com/1030971 Reviewed-by: Xiyuan Xia <xiyuan@chromium.org> Cr-Commit-Position: refs/branch-heads/3396@{#337} Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428} [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/chrome/browser/profiles/profile_impl_io_data.cc [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/net/nqe/network_quality_estimator.cc [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/net/nqe/network_quality_estimator.h [modify] https://crrev.com/ca1a959283cab03aa00aaf036712c2eeb7ab00dc/services/network/network_service.cc
,
Apr 29 2018
I've experienced this frequently on lars under 65 and 66. 64 appeared to be fine. Seems to be network related. Hangs occur with Bluetooth either enabled or disabled. For me, the issue occurs *only* in one specific location where wifi coverage is known to be patchy. I got a ERR_NAME_RESOLUTION_FAILED in an open browser tab about two seconds before the latest hang. A list of crashes from my machine: 687bd09bb906aea3 0bbe035358fbe37d 48d3f4bdfc3e047b a9d541a93746d468 2ec31561e9c002be f490b44129739133 7b5cd7113f4e8229
,
May 2 2018
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 2 2018
cernekee@, found the following this morning: https://groups.google.com/forum/#!topic/fa.linux.kernel/zoWnuxWdJFk It seems inline with the symptoms of this issue (netlink socket read hangs). Could you help to check whether they are relevant and whether the patches should be applied to 3.18?
,
May 2 2018
c#124: > cernekee@, found the following this morning: Interesting find, thanks for the info. > https://patchwork.ozlabs.org/patch/519245/ > https://patchwork.ozlabs.org/patch/520824/ The first link (519245 - landed as kernel commit 1f770c0a09da8 "netlink: Fix autobind race condition that leads to zero port ID") says: The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink: Reset portid after netlink_insert failure") introduced a race condition where if two threads try to autobind the same socket one of them may end up with a zero port ID. This led to kernel deadlocks that were observed by multiple people. c0bb07df7d9 was introduced in 4.1-rc5. Our 3.18 tree does not have the offending commit. c0bb07df7d9 was a fix for c5adde9468b07 ("netlink: eliminate nl_sk_hash_lock") which landed in 4.0-rc1. Our 3.18 tree DOES have this commit; it was backported to satisfy Jetstream requirements as part of bug 556861 . The second link (520824 - landed as kernel commit da314c9923fe "netlink: Replace rhash_portid with bound") fixed more races created by 1f770c0a09da. This commit is also not in our tree. > For 4.0.x, you _really_ need to update to 4.0.9 to get the following two patches. > > cf8befcc1a55 netlink: Disable insertions/removals during rehash > 18889a4315a5 netlink: Reset portid after netlink_insert failure Neither of these are in our 3.18 tree, either. cf8befcc1a55 says: netlink: Disable insertions/removals during rehash [ Upstream commit: Not applicable ] The current rhashtable rehash code is buggy and can't deal with parallel insertions/removals without corrupting the hash table. This patch disables it by partially reverting c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock"). 18889a4315a5 is a backport of c0bb07df7d9. So, since we have a "special" 3.18 tree that includes a backport of c5adde9468b07 (3.18.y -stable does not), I think we may need to take the following fixes from 4.0.y -stable: 919d9db95218 netlink: Fix netlink_insert EADDRINUSE error 18889a4315a5 netlink: Reset portid after netlink_insert failure cf8befcc1a55 netlink: Disable insertions/removals during rehash However... 4.0.y -stable includes a backport of c0bb07df7d9 (18889a4315a5) but it does not include the follow-up fixes for the new bugs that c0bb07df7d9 created. This might be due to the fact that 4.0.y reached EOL a few months before 1f770c0a09da landed upstream. So we would also want to backport: 1f770c0a09da netlink: Fix autobind race condition that leads to zero port ID da314c9923fe netlink: Replace rhash_portid with bound The other option is to try to revert the backported changes that introduced deadlocks. I'm guessing that stock 3.18.y doesn't have these issues. Grant, WDYT?
,
May 3 2018
Request to merge https://chromium-review.googlesource.com/1020297 to M66.
,
May 3 2018
,
May 3 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/197d96429cbd24e22ed28faa6f370bd62992ea49 commit 197d96429cbd24e22ed28faa6f370bd62992ea49 Author: Xiyuan Xia <xiyuan@chromium.org> Date: Thu May 03 01:45:50 2018 Merge M66 "cros: Move GetCurrentNetworkId call off IO thread." > Make DataReductionProxyConfig/NetworkQualityEstimator call > net::GetWifiSSID() on a worker thread instead of the IO thread > on ChromeOS as a work around for https://crbug.com/821607 . > > This CL does not solve the underlying problem that is still being > investigated. It gives the user a crippled system instead of a dead > one with a frozen screen. > > Bug: 821607 > Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo > Change-Id: I8e4db5091e8b080ed2bd7f9bc4a3e04b6e6e8018 > Reviewed-on: https://chromium-review.googlesource.com/1020297 > Reviewed-by: Matt Menke <mmenke@chromium.org> > Reviewed-by: Tarun Bansal <tbansal@chromium.org> > Commit-Queue: Xiyuan Xia <xiyuan@chromium.org> > Cr-Commit-Position: refs/heads/master@{#552828} (cherry picked from commit afe6dab96bcc8f71d4879f8e8d7f5a8fd199da8a) Bug: b/79122581 Change-Id: Ib4e336fbaefc1bb6c1c3ab019f8b2ed7a35ed18b Reviewed-on: https://chromium-review.googlesource.com/1041277 Reviewed-by: Xiyuan Xia <xiyuan@chromium.org> Cr-Commit-Position: refs/branch-heads/3359@{#791} Cr-Branched-From: 66afc5e5d10127546cc4b98b9117aff588b5e66b-refs/heads/master@{#540276} [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/chrome/browser/io_thread.cc [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/chrome/browser/profiles/profile_impl_io_data.cc [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.cc [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/components/data_reduction_proxy/core/browser/data_reduction_proxy_config.h [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/net/nqe/network_quality_estimator.cc [modify] https://crrev.com/197d96429cbd24e22ed28faa6f370bd62992ea49/net/nqe/network_quality_estimator.h
,
May 3 2018
re comment #125
Kevin, I have three thoughts:
1) This is going to be messy for any option. I'd prefer to pull in fixes from upstream and/or stable trees than revert the 106 rhashtable patches. I think your outline on what to backport is reasonable.
But maybe 3.18 needs more than 5 netlink patches backported?
git log --oneline v3.18.. -- net/netlink | fgrep netlink: | wc -l
97
I've added Kan Yan, Kishan Kunduru, and Kevin Hayes as FYI. One or more of them should be CCd on code reviews since this code is shared with "gale" kernel ("Google Wifi").
2) 3.18 is "special" because it's not running the native wireless stack. :/
See the original bug why I pulled in most of the rhashtable support. Reverting most of these really isn't an option.
https://bugs.chromium.org/p/chromium/issues/detail?id=556861
"rhashtable.c in chromeos-3.18 branch .... won't work for USE=wireless42 builds."
3) This is a canonical example of why skylake chipset should update to a newer kernel version. We will be supporting these until "Nov 2022" (HP Chromebook 13 G1 == chell) and this will just get more painful every year.
,
May 4 2018
cernekee@/grundler@, could either of you take over this or help to find an owner to figure the next step for the kernel fix? Tag with OS>Kernel. Drop to P1 since we had a workaround CL in M66, M67 and M68/ToT.
,
May 4 2018
Kevin said he would take a look at it. I promised to review any proposed changes.
,
May 4 2018
Thanks. cc myself since I am interested to learn. :)
,
May 6 2018
Will EOL devices get an update to 65 that will fix this issue? I'm assuming 66 will not make it to EOL devices to address this. -CS
,
May 10 2018
I'm getting reports of random hangs from my users (using Asus C302). Is there anything specific that I need to look out for or report back?
,
May 14 2018
Reporting this JUST started happening on 68.0.3429.0 Canary 64-bit Sent an Alt+Shift+i report stating the hang is similar to M65. Logs included. This started happening and has the EXACT same symptoms as M65. Dell CB1C13 Wolf (Haswell) -CS
,
May 14 2018
Forgot to add, it happens about every minute and only Alt+VolumeUP+X or Refresh+Power will unfreeze. Unresponsive. Forced to move back to Beta. -CS
,
May 15 2018
Re #135: The issue you observed in 68.0.3429.0 is issue 842505, where a debugging dump blocks the UI thread. The offending CL is reverted and should be fixed in the next dev build.
,
May 15 2018
Re #137: Thanks for that information. Will move back to Canary soon. -CS
,
May 21 2018
Do we have a commit on this fix? Which Milestones will get it rolled out? Will it go Dev-->Beta-->Stable, or jump right to Stable?
,
Jun 4 2018
(Bulk Edit) Adding the new conops Chrome OS hotlist to all open issues with the "#CBC-RS/TC-watchlist" tag, our former tracking tag.
,
Jun 4 2018
Kevin C no longer works for Google.
,
Jun 4 2018
snanda@ any thoughts on who can pick this up? I know some workarounds landed, but it's not clear to me what remaining work needs to be done here.
,
Jun 4 2018
FYI, I powerwashed on May 29 because of this issue. I disabled this flag, chrome://flags/#arc-boot-completed-broadcast, on May 30 after the last crash. It has stopped the browser hanging and crashes. Uploaded Crash Report ID a4939159e859af03 (Local Crash ID: Chrome) Crash report uploaded on Wednesday, May 30, 2018 at 8:06:31 PM Uploaded Crash Report ID c01d4ead7c03b633 (Local Crash ID: Chrome) Crash report uploaded on Wednesday, May 30, 2018 at 8:04:36 PM Uploaded Crash Report ID dff7dc112ddd0079 (Local Crash ID: ChromeOS_ARC) Crash report uploaded on Wednesday, May 30, 2018 at 6:28:36 PM Uploaded Crash Report ID d7e679bcd6eaf2e2 (Local Crash ID: ChromeOS) Crash report uploaded on Tuesday, May 29, 2018 at 4:20:36 PM Google Chrome 67.0.3396.69 (Official Build) beta (32-bit) Revision dafe8337c251b6f1f248539bd06eeb3a685d865c-refs/branch-heads/3396@{#722} Platform 10575.52.0 (Official Build) beta-channel veyron_minnie Firmware Version Google_Veyron_Minnie.6588.237.0 ARC 4808759 JavaScript V8 6.7.288.43 Flash 29.0.0.171 /opt/google/chrome/pepper/libpepflashplayer.so User Agent Mozilla/5.0 (X11; CrOS armv7l 10575.52.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.69 Safari/537.36
,
Jun 5 2018
,
Jun 5 2018
,
Jun 5 2018
,
Jun 5 2018
,
Jun 8 2018
Issue 832084 has been merged into this issue.
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0cdaad0cc65125c1a0726ed347b627ff662ee77f commit 0cdaad0cc65125c1a0726ed347b627ff662ee77f Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed Jun 13 04:50:32 2018 UPSTREAM: netlink: Reset portid after netlink_insert failure The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock") breaks the autobind retry mechanism because it doesn't reset portid after a failed netlink_insert. This means that should autobind fail the first time around, then the socket will be stuck in limbo as it can never be bound again since it already has a non-zero portid. Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit c0bb07df7d981e4091432754e30c9c720e2c0c78) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I5cfee0c833c70f1ad3b82e3f6d4cf5bee189256e Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091452 Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/0cdaad0cc65125c1a0726ed347b627ff662ee77f/net/netlink/af_netlink.c
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7c67f79118352aefa4f470136e799c3945ea944e commit 7c67f79118352aefa4f470136e799c3945ea944e Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed Jun 13 04:50:33 2018 UPSTREAM: netlink: Use default rhashtable hashfn This patch removes the explicit jhash value for the hashfn parameter of rhashtable. As the key length is a multiple of 4, this means that we will actually end up using jhash2. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 11b58ba146ccd7b105c4962c75f2e744053c85bc) BUG= chromium:821607 , chromium:849872 TEST=build and boot Change-Id: Ifd74f8ccc3be372ede6105fee47d832e95bf73b7 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091453 Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/7c67f79118352aefa4f470136e799c3945ea944e/net/netlink/af_netlink.c
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/1f735d252607b26f34fd88aae42e2cd6471b7861 commit 1f735d252607b26f34fd88aae42e2cd6471b7861 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed Jun 13 04:50:35 2018 UPSTREAM: netlink: Fix autobind race condition that leads to zero port ID The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink: Reset portid after netlink_insert failure") introduced a race condition where if two threads try to autobind the same socket one of them may end up with a zero port ID. This led to kernel deadlocks that were observed by multiple people. This patch reverts that commit and instead fixes it by introducing a separte rhash_portid variable so that the real portid is only set after the socket has been successfully hashed. Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 1f770c0a09da855a2b51af6d19de97fb955eca85) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I065a53d0d8a897ce648e4a6e99b6fc28e3f46625 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091454 Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/1f735d252607b26f34fd88aae42e2cd6471b7861/net/netlink/af_netlink.c [modify] https://crrev.com/1f735d252607b26f34fd88aae42e2cd6471b7861/net/netlink/af_netlink.h
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9 commit 663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed Jun 13 04:50:36 2018 BACKPORT: FROMGIT: netlink: Disable insertions/removals during rehash [ Upstream commit: Not applicable ] The current rhashtable rehash code is buggy and can't deal with parallel insertions/removals without corrupting the hash table. This patch disables it by partially reverting c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock"). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit cf8befcc1a5538b035d478424efcc2d50e66928e git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.0.y) Conflicts: net/netlink/af_netlink.c [rhashtable_remove_fast vs. rhashtable_remove] BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I6063076587c0a9ede57e319989a426ee6f6ebe61 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091455 Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9/net/netlink/af_netlink.c
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/32a47cf484dc30229a88cb0746076be99799bc3e commit 32a47cf484dc30229a88cb0746076be99799bc3e Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed Jun 13 04:50:38 2018 UPSTREAM: netlink: Replace rhash_portid with bound On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote: > > store_release and load_acquire are different from the usual memory > barriers and can't be paired this way. You have to pair store_release > and load_acquire. Besides, it isn't a particularly good idea to OK I've decided to drop the acquire/release helpers as they don't help us at all and simply pessimises the code by using full memory barriers (on some architectures) where only a write or read barrier is needed. > depend on memory barriers embedded in other data structures like the > above. Here, especially, rhashtable_insert() would have write barrier > *before* the entry is hashed not necessarily *after*, which means that > in the above case, a socket which appears to have set bound to a > reader might not visible when the reader tries to look up the socket > on the hashtable. But you are right we do need an explicit write barrier here to ensure that the hashing is visible. > There's no reason to be overly smart here. This isn't a crazy hot > path, write barriers tend to be very cheap, store_release more so. > Please just do smp_store_release() and note what it's paired with. It's not about being overly smart. It's about actually understanding what's going on with the code. I've seen too many instances of people simply sprinkling synchronisation primitives around without any knowledge of what is happening underneath, which is just a recipe for creating hard-to-debug races. > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, > > } > > } > > > > - if (!nlk->portid) { > > + if (!nlk->bound) { > > I don't think you can skip load_acquire here just because this is the > second deref of the variable. That doesn't change anything. Race > condition could still happen between the first and second tests and > skipping the second would lead to the same kind of bug. The reason this one is OK is because we do not use nlk->portid or try to get nlk from the hash table before we return to user-space. However, there is a real bug here that none of these acquire/release helpers discovered. The two bound tests here used to be a single one. Now that they are separate it is entirely possible for another thread to come in the middle and bind the socket. So we need to repeat the portid check in order to maintain consistency. > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr, > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND)) > > return -EPERM; > > > > - if (!nlk->portid) > > + if (!nlk->bound) > > Don't we need load_acquire here too? Is this path holding a lock > which makes that unnecessary? Ditto. ---8<--- The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink: Fix autobind race condition that leads to zero port ID") created some new races that can occur due to inconcsistencies between the two port IDs. Tejun is right that a barrier is unavoidable. Therefore I am reverting to the original patch that used a boolean to indicate that a user netlink socket has been bound. Barriers have been added where necessary to ensure that a valid portid and the hashed socket is visible. I have also changed netlink_insert to only return EBUSY if the socket is bound to a portid different to the requested one. This combined with only reading nlk->bound once in netlink_bind fixes a race where two threads that bind the socket at the same time with different port IDs may both succeed. Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Nacked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit da314c9923fed553a007785a901fd395b7eb6c19) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I4baab91ca840fcb07a0844ac9f48dcc71fddd509 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091506 Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/32a47cf484dc30229a88cb0746076be99799bc3e/net/netlink/af_netlink.c [modify] https://crrev.com/32a47cf484dc30229a88cb0746076be99799bc3e/net/netlink/af_netlink.h
,
Jun 13 2018
How confident are we these recent CLs will solve the crash? If we are very confident we can merge them directly into 65 to try to stabilize the impacted boards.
,
Jun 13 2018
> How confident are we these recent CLs will solve the crash? Point of clarity (since this is a large, noisy bug): IIUC this issue mostly (entirely?) doesn't actually produce crashes -- it produces hung threads. The main way this has showed up in crash reports is if someone force a Chrome restart (Alt-VolUp-X) to escape the hang. (Someone correct me if I'm wrong.) But given the analysis that has happened so far (mostly without me; but I tried to validate what I could), it seems very likely that these hangs should be fixed. > If we are very confident we can merge them directly into 65 to try to stabilize the impacted boards. I'd be most worried about regressions. This is pretty critical code here, and it'd be a shame to introduce further regressions (the series of CLs even includes and reverts/modifies 2 different attempts that upstream developers made at fixing subtle race conditions). I would usually prefer this get a full cycle of testing and observing any additional reported issues (or lack thereof) to gain confidence. But I also don't understand why the suggestion for a M65 merge -- isn't M65 long superseded, with M66 out for weeks, and M67 rolling to stable now? Feel free to educate me off-bug if needed.
,
Jun 13 2018
R65 was the AUE milestone for Sandybridge systems (butterfly, parrot, lumpy, stumpy), and it was particularly unfortunate that R65 was hanging/crashing on them more than we would like. The R65 merge would be for a one off push just for these systems, as you point out 65 is otherwise deprecated. Agreed we should be confident in these before we do that, we can see how well they fare on 69 for a couple weeks and look into the 65 aspect after.
,
Jun 13 2018
> R65 was the AUE milestone for Sandybridge systems (butterfly, parrot, lumpy, stumpy), and it was particularly unfortunate that R65 was hanging/crashing on them more than we would like. Those aren't running the 3.18 kernel, which is where the above fixes were targeted (we believe that issue was specific to our kernel 3.18). If there were problems on 3.8 kernels, they were very likely a different issue, and we should probably fork a different bug, like I did for bug 849872 . This one is already extremely noisy, with everybody and their mother/father dumping their issues. (It's possible that even bug 849872 is somewhat of a sidetrack? But at least it was one clearly-identified problem in the mix here.) If you can point me at specific points in this bug that apply to those systems, then I can try to extract details to a new bug. I'm tempted to close this bug soon (still not sure if it needs to be pushed to pre-M69 at all). If archaeology can pull out additional actionable details for independent issues, then we can still file new follow-up bugs.
,
Jun 13 2018
My apologies, I was confusing this with https://bugs.chromium.org/p/chromium/issues/detail?id=844256 which may or may not be related to some of the comments on this centibug, but I agree with your assessment that if there is nothing actionable left here we should close this and move on.
,
Jun 13 2018
OK, let's say Fixed for M-69 (not sure how effective the mitigations for M-67/M-66 were?). Do holler (preferably on a new bug) if a related issue is still hanging around.
,
Jun 13 2018
But if the bug is still present in EOL device with M-65 (I cannot access bug report 844256), why is this bug designed as Fixed? I own one of these devices and it has been freezing constantly for the last two months. Where can I find updated information about the resolution of this bug for EOL devices with M-65?
,
Jun 13 2018
Giulio, As explained above, this particular bug is focused on devices running a specific kernel (3.18 kernel). We're closing this bug and planning to file new, follow-up bugs if necessary for other devices. There is a Google-restricted bug investigating the stability issues reported on recent EOL devices like yours. We don't have an update to share publicly yet and are still working on reproducing those issues reliably. <https://bugs.chromium.org/p/chromium/issues/detail?id=844256>. We'll likely post an update in the Chromebook Forum once we have some news to report.
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/5079028f7918373835f76e13e2825e35230524fc commit 5079028f7918373835f76e13e2825e35230524fc Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon Jun 18 18:43:26 2018 UPSTREAM: netlink: Reset portid after netlink_insert failure The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock") breaks the autobind retry mechanism because it doesn't reset portid after a failed netlink_insert. This means that should autobind fail the first time around, then the socket will be stuck in limbo as it can never be bound again since it already has a non-zero portid. Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit c0bb07df7d981e4091432754e30c9c720e2c0c78) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I5cfee0c833c70f1ad3b82e3f6d4cf5bee189256e Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091452 Reviewed-by: Grant Grundler <grundler@chromium.org> (cherry picked from commit 0cdaad0cc65125c1a0726ed347b627ff662ee77f) Reviewed-on: https://chromium-review.googlesource.com/1104942 [modify] https://crrev.com/5079028f7918373835f76e13e2825e35230524fc/net/netlink/af_netlink.c
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e00564d51de68cca9f008bfc6963d79e9bb4e852 commit e00564d51de68cca9f008bfc6963d79e9bb4e852 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon Jun 18 18:43:33 2018 UPSTREAM: netlink: Use default rhashtable hashfn This patch removes the explicit jhash value for the hashfn parameter of rhashtable. As the key length is a multiple of 4, this means that we will actually end up using jhash2. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 11b58ba146ccd7b105c4962c75f2e744053c85bc) BUG= chromium:821607 , chromium:849872 TEST=build and boot Change-Id: Ifd74f8ccc3be372ede6105fee47d832e95bf73b7 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091453 Reviewed-by: Grant Grundler <grundler@chromium.org> (cherry picked from commit 7c67f79118352aefa4f470136e799c3945ea944e) Reviewed-on: https://chromium-review.googlesource.com/1104943 [modify] https://crrev.com/e00564d51de68cca9f008bfc6963d79e9bb4e852/net/netlink/af_netlink.c
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/3520cd6a3ad97bb052dfa4a0928baab974117a9d commit 3520cd6a3ad97bb052dfa4a0928baab974117a9d Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon Jun 18 18:43:51 2018 UPSTREAM: netlink: Fix autobind race condition that leads to zero port ID The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink: Reset portid after netlink_insert failure") introduced a race condition where if two threads try to autobind the same socket one of them may end up with a zero port ID. This led to kernel deadlocks that were observed by multiple people. This patch reverts that commit and instead fixes it by introducing a separte rhash_portid variable so that the real portid is only set after the socket has been successfully hashed. Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 1f770c0a09da855a2b51af6d19de97fb955eca85) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I065a53d0d8a897ce648e4a6e99b6fc28e3f46625 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091454 Reviewed-by: Grant Grundler <grundler@chromium.org> (cherry picked from commit 1f735d252607b26f34fd88aae42e2cd6471b7861) Reviewed-on: https://chromium-review.googlesource.com/1104944 [modify] https://crrev.com/3520cd6a3ad97bb052dfa4a0928baab974117a9d/net/netlink/af_netlink.c [modify] https://crrev.com/3520cd6a3ad97bb052dfa4a0928baab974117a9d/net/netlink/af_netlink.h
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e7fd03b8b5253057ce756b9e854d46e2ae9e771e commit e7fd03b8b5253057ce756b9e854d46e2ae9e771e Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon Jun 18 18:43:58 2018 BACKPORT: FROMGIT: netlink: Disable insertions/removals during rehash [ Upstream commit: Not applicable ] The current rhashtable rehash code is buggy and can't deal with parallel insertions/removals without corrupting the hash table. This patch disables it by partially reverting c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock"). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit cf8befcc1a5538b035d478424efcc2d50e66928e git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.0.y) Conflicts: net/netlink/af_netlink.c [rhashtable_remove_fast vs. rhashtable_remove] BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I6063076587c0a9ede57e319989a426ee6f6ebe61 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091455 Reviewed-by: Grant Grundler <grundler@chromium.org> (cherry picked from commit 663fa53dcb6f4844a8c5ba29dc6bb65ebebfadf9) Reviewed-on: https://chromium-review.googlesource.com/1104945 [modify] https://crrev.com/e7fd03b8b5253057ce756b9e854d46e2ae9e771e/net/netlink/af_netlink.c
,
Jun 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/76eb986d89481d1fe9c9319eeeb77bab7d4afccb commit 76eb986d89481d1fe9c9319eeeb77bab7d4afccb Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon Jun 18 18:44:02 2018 UPSTREAM: netlink: Replace rhash_portid with bound On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote: > > store_release and load_acquire are different from the usual memory > barriers and can't be paired this way. You have to pair store_release > and load_acquire. Besides, it isn't a particularly good idea to OK I've decided to drop the acquire/release helpers as they don't help us at all and simply pessimises the code by using full memory barriers (on some architectures) where only a write or read barrier is needed. > depend on memory barriers embedded in other data structures like the > above. Here, especially, rhashtable_insert() would have write barrier > *before* the entry is hashed not necessarily *after*, which means that > in the above case, a socket which appears to have set bound to a > reader might not visible when the reader tries to look up the socket > on the hashtable. But you are right we do need an explicit write barrier here to ensure that the hashing is visible. > There's no reason to be overly smart here. This isn't a crazy hot > path, write barriers tend to be very cheap, store_release more so. > Please just do smp_store_release() and note what it's paired with. It's not about being overly smart. It's about actually understanding what's going on with the code. I've seen too many instances of people simply sprinkling synchronisation primitives around without any knowledge of what is happening underneath, which is just a recipe for creating hard-to-debug races. > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, > > } > > } > > > > - if (!nlk->portid) { > > + if (!nlk->bound) { > > I don't think you can skip load_acquire here just because this is the > second deref of the variable. That doesn't change anything. Race > condition could still happen between the first and second tests and > skipping the second would lead to the same kind of bug. The reason this one is OK is because we do not use nlk->portid or try to get nlk from the hash table before we return to user-space. However, there is a real bug here that none of these acquire/release helpers discovered. The two bound tests here used to be a single one. Now that they are separate it is entirely possible for another thread to come in the middle and bind the socket. So we need to repeat the portid check in order to maintain consistency. > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr, > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND)) > > return -EPERM; > > > > - if (!nlk->portid) > > + if (!nlk->bound) > > Don't we need load_acquire here too? Is this path holding a lock > which makes that unnecessary? Ditto. ---8<--- The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink: Fix autobind race condition that leads to zero port ID") created some new races that can occur due to inconcsistencies between the two port IDs. Tejun is right that a barrier is unavoidable. Therefore I am reverting to the original patch that used a boolean to indicate that a user netlink socket has been bound. Barriers have been added where necessary to ensure that a valid portid and the hashed socket is visible. I have also changed netlink_insert to only return EBUSY if the socket is bound to a portid different to the requested one. This combined with only reading nlk->bound once in netlink_bind fixes a race where two threads that bind the socket at the same time with different port IDs may both succeed. Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Nacked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit da314c9923fed553a007785a901fd395b7eb6c19) BUG= chromium:821607 , chromium:849872 TEST=netlink send/recv repeatedly, on many threads - watch for timeouts; similar to this test code: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 Change-Id: I4baab91ca840fcb07a0844ac9f48dcc71fddd509 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1091506 Reviewed-by: Grant Grundler <grundler@chromium.org> (cherry picked from commit 32a47cf484dc30229a88cb0746076be99799bc3e) Reviewed-on: https://chromium-review.googlesource.com/1104946 [modify] https://crrev.com/76eb986d89481d1fe9c9319eeeb77bab7d4afccb/net/netlink/af_netlink.c [modify] https://crrev.com/76eb986d89481d1fe9c9319eeeb77bab7d4afccb/net/netlink/af_netlink.h
,
Jun 30 2018
Have TWO Acer c710: THIS one is: Google Chrome 65.0.3325.209 (Official Build) (64-bit) Revision 0 Platform 10323.67.0 (Official Build) stable-channel parrot Firmware Version Google_Parrot.2685.37.0 JavaScript V8 6.5.254.43 Flash 29.0.0.113 Would have to come back here using other unit to post its info. BUT, maybe not so relevant since THIS one only works in Guest-mode now (thanks to YET unfixed BLSOD bug -- the Black Login Screen Of Doom update) and it STILL does the freeze-restart now. More specifically, both units freeze and/or restart now (and both usually always have mail.com in some tab or window); but the other one (logged in, but with long-time profile error) much more often, and often on Facebook; and this one, with possibly increasing frequency but on no site in particular, though often there is a Flash page (or two) in another tab or window. More simply though, earlier today, with various tabs open, this one froze and restarted. Then, right after that restart, did it again while logging in to just mail.com in one tab and searching google in another. Similar has happened before IIRC. NOTE: Would have put this under the "new" bug per @161 but no such bug-link was left there -- and it took rather a long time just to RE-find this bug after that last crash, as guest-mode has no bookmarks or history of course (not even a simple session-history for re-opening mistakenly closed tabs, for some bizarre reason.)
,
Jul 22
It has now been one FULL day, and Week, and MONTH since question #160 and answer #161 concerning EOL-devices -- and since despite the status of THIS bug, the "High number of Chrome browser hang in M65" bug is certainly NOT fixed, even in GUEST-mode with only ONE website loaded (consistently, mail.com BTW) then would somebody in charge please consider: 1. How about updating THIS bug-title per #161 to "High number of Chrome browser hang in M65 with kernel 3.18"? 2. How about starting a new "High number of Chrome browser hang in M65 with kernel OTHER than 3.18" bug? 3. How about adding a brief explanation at the beginning of this bug about how to find your kernel version? 4. How about REVERTING our collective EOL devices back to M64 until you find some fix for OUR bug?? Having the text-cursor disappear consistently at certain columns in some text boxes, even address-bar, is a bit bothersome. Having to keep the room real quiet and yet still struggle to hear some popular video sites, gets a little inconvenient. Having webpage tasks routinely swell up in memory until the GUI gets sluggish, is somewhat annoying. Being "permanently" stuck in Guest-mode now, thanks to BSOD-login bug, is rather distressing. But repeatedly CRASHING now -- even in GUEST-MODE -- is REALLY AGGRAVATING. PLEASE FIX or REVERT -- ASAP!
,
Jul 23
Concur: it has been more than four months that this has been unresolved. Why can't EOL devices be reversed to M64? This is sad.
,
Jul 23
Re c#169, You can download the recovery file of M64 on this page: https://cros-updates-serving.appspot.com Please find the name of your device which can be found in the AUE Devices section at the bottom. I am not sure if there is a way to prevent it from updating back to M65, although I think there is a way.
,
Aug 1
#161: "There is a Google-restricted bug investigating the stability issues reported on recent EOL devices like yours. We don't have an update to share publicly yet and are still working on reproducing those issues reliably." re: ..."still working on reproducing those issues reliably." Again, as explained above, just loading and/or logging-in on Mail.com reliably causes hangs and/or reboots, even in GUEST-mode (which should eliminate a great deal of unit-specific config variables), and even right after a reboot or sometimes a wakeup. But, once logged in, system may run as normal for many, many hours (except for annoying unrelated issues in #168.) #170: "You can download the recovery file" -- but this is a chromebook and page and/or website gives no instructions about what to do with file.
,
Aug 1
It appears the problem may be specific to certain hardware, at least so far with the units we have, we have not see the problem, but we are procuring more. The recovery image file can be written to a USB drive using the recovery utility, there is a gear icon that should have the option 'Use local image'.
,
Aug 30
Though this bug is not where the more recent instability issues were being debugged, since folks have been watching this bug I want to point out that we have found a suspected bad change and pushed a new R65 with a revert (10323.67.9). So far this appears to have stopped the crash types that were causing problems on SandyBridge devices in R65. If you are still seeing this you can try out the new version by going to about://help and clicking on check for updates on stable channel. So far this is only live at 1%, but checking for updates manually will give you the update, if we don't find any other issues we will continue to ramp up to the rest of the SandyBridge fleet.
,
Aug 30
Since there is no "check for updates" button on EOL devices, something that seem to have worked for me was going to chrome://help and then refresh the page multiple times until I received an update and I was asked to restart the system. :-o
,
Aug 30
Thank you for addressing this issue, much appreciated. IIUC it looks like omahaproxy & the update servers will grab version 10323.67.9 / 65.0.3325.209 for these AUE devices but, from what I can tell, the recovery images are still kind of a mixed bag going from - - lumpy 10176.76.0 / 64.0.3282.190 - butterfly, parrot, stumpy: 10323.62.0 / 65.0.3325.184 This may not be all the devices affected but it's the ones I grabbed in a quick search. Would it be possible & perhaps prudent to update recovery.conf with the fix for these devices and/or make them available at https://cros-updates-serving.appspot.com/ under the 'Recovery' column for download? Just thinking out loud and trying to anticipate any problems we might run into when this fix gets more attention. Thanx for your efforts and indulgence.
,
Aug 30
Yea the recovery images have not been updated yet, but even if they are used the devices should get an update shortly thereafter, after we are more confident in the build and it is rolled out more we can update the recovery images.
,
Aug 31
#c176: That makes total sense, thanx for the feedback and explanation; the rollout strategy is complex for sure and I only know a piece of it. As a point of reference, I recovered my Acer C7 parrot/sandybridge yesterday with 10323.62.0 / 65.0.3325.184. It didn't seem to update after repeatedly reloading the 'About Chrome OS' page so I switched to the beta channel which is on the same version and I eventually got a 'Restart' notification. I'm not sure what triggered the update but I did get it. I have not experienced any 'hangs' so I believe the revert fixed it. Thanx again.
Showing comments 78 - 177
of 177
Older ›
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||