btusb: blocked for more than 120 seconds |
||||||
Issue descriptionSeen on Scarlet rev2 Feedback report: https://listnr.corp.google.com/product/208/report/84948625653 CHROMEOS_RELEASE_DESCRIPTION 10319.0.0 (Official Build) canary-channel scarlet CHROMEOS_RELEASE_TRACK canary-channel CHROMEOS_RELEASE_VERSION 10319.0.0 From ramoops: [ 240.141248] INFO: task kworker/u13:0:101 blocked for more than 120 seconds. [ 240.141286] Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.141299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.141313] kworker/u13:0 D ffffffc0002130ec 0 101 2 0x00000000 [ 240.141427] Workqueue: hci0 hci_power_off [bluetooth] [ 240.141445] Call trace: [ 240.141474] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8 [ 240.141498] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8 [ 240.141515] [<ffffffc00092f2a4>] schedule+0x4c/0xb0 [ 240.141535] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118 [ 240.141552] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74 [ 240.141577] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb] [ 240.141597] [<ffffffbffc0ff5f4>] btusb_close+0x64/0xa8 [btusb] [ 240.141675] [<ffffffbffc002638>] hci_dev_do_close+0x284/0x318 [bluetooth] [ 240.141745] [<ffffffbffc0055e8>] hci_power_off+0x1c/0x28 [bluetooth] [ 240.141770] [<ffffffc0002db570>] worker_thread+0x23c/0x70c [ 240.141798] [<ffffffc00023afa0>] kthread+0xe0/0xf0 [ 240.141816] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 [ 240.141829] task PC stack pid father [ 240.141933] kworker/u13:0 D ffffffc0002130ec 0 101 2 0x00000000 [ 240.142018] Workqueue: hci0 hci_power_off [bluetooth] [ 240.142038] Call trace: [ 240.142056] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8 [ 240.142082] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8 [ 240.142187] [<ffffffc00092f2a4>] schedule+0x4c/0xb0 [ 240.142208] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118 [ 240.142224] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74 [ 240.142297] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb] [ 240.142371] [<ffffffbffc0ff5f4>] btusb_close+0x64/0xa8 [btusb] [ 240.142442] [<ffffffbffc002638>] hci_dev_do_close+0x284/0x318 [bluetooth] [ 240.142513] [<ffffffbffc0055e8>] hci_power_off+0x1c/0x28 [bluetooth] [ 240.142534] [<ffffffc0002db570>] worker_thread+0x23c/0x70c [ 240.142556] [<ffffffc00023afa0>] kthread+0xe0/0xf0 [ 240.142577] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 [ 240.142617] kworker/2:1 D ffffffc0002130ec 0 113 2 0x00000000 [ 240.142675] Workqueue: pm pm_runtime_work [ 240.142702] Call trace: [ 240.142762] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8 [ 240.142784] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8 [ 240.142806] [<ffffffc00092f2a4>] schedule+0x4c/0xb0 [ 240.142825] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118 [ 240.142859] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74 [ 240.142884] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb] [ 240.142920] [<ffffffbffc0febf0>] btusb_suspend+0xa0/0x114 [btusb] [ 240.142951] [<ffffffc000657b70>] usb_suspend_both+0xd4/0x238 [ 240.142973] [<ffffffc00065829c>] usb_runtime_suspend+0x38/0x70 [ 240.142995] [<ffffffc0007a4568>] rpm_suspend+0x310/0x4b8 [ 240.143013] [<ffffffc0005f65ec>] pm_runtime_work+0x74/0x94 [ 240.143028] [<ffffffc0002db570>] worker_thread+0x23c/0x70c [ 240.143045] [<ffffffc00023afa0>] kthread+0xe0/0xf0 [ 240.143061] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 [ 240.143427] rs:main Q:Reg D ffffffc0002130ec 0 533 1 0x00400000 [ 240.143458] Call trace: [ 240.143479] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8 [ 240.143496] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8 [ 240.143513] [<ffffffc00092f2a4>] schedule+0x4c/0xb0 [ 240.143530] [<ffffffc000932410>] schedule_timeout+0x48/0x5a0 [ 240.143548] [<ffffffc0009300a8>] io_schedule_timeout+0x70/0xb4 [ 240.143565] [<ffffffc000930610>] bit_wait_io+0x1c/0x68 [ 240.143581] [<ffffffc000930164>] __wait_on_bit+0x78/0xd4 [ 240.143599] [<ffffffc000930244>] out_of_line_wait_on_bit+0x84/0xb0 [ 240.143618] [<ffffffc000390688>] __wait_on_buffer+0x48/0x54 [ 240.143635] [<ffffffc0003917c8>] wait_on_buffer+0x34/0x40 [ 240.143653] [<ffffffc000393138>] bh_submit_read+0x5c/0x90 [ 240.143674] [<ffffffc00044397c>] ext4_ind_map_blocks+0x404/0x9a8 [ 240.143699] [<ffffffc00042fdb0>] ext4_map_blocks+0x8c/0x420 [ 240.143715] [<ffffffc0004459c8>] ext4_mpage_readpages+0x330/0x5d8 [ 240.143733] [<ffffffc0003c48ac>] ext4_readpages+0x40/0x60 [ 240.143754] [<ffffffc000343244>] __do_page_cache_readahead+0x148/0x240 [ 240.143771] [<ffffffc00033cd1c>] filemap_fault+0x19c/0x460 [ 240.143788] [<ffffffc0003c46d0>] ext4_filemap_fault+0x3c/0x5c [ 240.143805] [<ffffffc000351938>] __do_fault+0x64/0xec [ 240.143821] [<ffffffc000350bf0>] handle_mm_fault+0xa10/0x1558 [ 240.143839] [<ffffffc000216a74>] do_page_fault+0x184/0x2ec [ 240.143856] [<ffffffc000214fa0>] do_translation_fault+0x54/0xb8 [ 240.143871] [<ffffffc0002002cc>] do_mem_abort+0x5c/0xd8 [ 240.143886] Exception stack(0xffffffc0e6723ec0 to 0xffffffc0e6723ff0) [ 240.143904] 3ec0: 0000000000fa9a34 0000000000fb1280 000000000093ed6b 0000000000000d0a [ 240.143926] 3ee0: 0000000000fa96a8 0000000000fb1350 0000000000000000 00000000ec7fe850 [ 240.143946] 3f00: 0000000000fa9a44 00000000ec7fe8d8 00000000ec801ef0 0000000000fb1350 [ 240.143965] 3f20: 00000000ebe213ce 00000000ec7fe830 0000000000935f2f 0000000000000000 [ 240.143984] 3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.143999] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.144018] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.144033] 3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.144050] 3fc0: 000000000093ed6a 0000000060070030 0000000000000003 ffffffffffffffff [ 240.144067] 3fe0: 0000000000000000 0000000000000000 [ 240.144082] [<ffffffc000202ac0>] el0_ia+0x18/0x1c [ 240.144448] kschedfreq:4 D ffffffc0002130ec 0 2258 2 0x00000000 [ 240.144476] Call trace: [ 240.144495] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8 [ 240.144512] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8 [ 240.144529] [<ffffffc00092f2a4>] schedule+0x4c/0xb0 [ 240.144548] [<ffffffc000933070>] schedule_hrtimeout_range_clock+0x3b0/0x844 [ 240.144564] [<ffffffc000933538>] schedule_hrtimeout_range+0x34/0x44 [ 240.144579] [<ffffffc000932ac4>] usleep_range+0x78/0xa8 [ 240.144598] [<ffffffc00025a6b4>] cpufreq_sched_thread+0x180/0x1f0 [ 240.144616] [<ffffffc00023afa0>] kthread+0xe0/0xf0 [ 240.144642] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 [ 240.147796] Kernel panic - not syncing: hung_task: blocked tasks [ 240.147830] CPU: 0 PID: 40 Comm: khungtaskd Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.147844] Hardware name: Google Scarlet (DT) [ 240.147858] Call trace: [ 240.147884] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.147909] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.147933] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.147951] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.147967] [<ffffffc00021e1d4>] panic+0xd8/0x224 [ 240.147987] [<ffffffc0002a5bdc>] hung_task_panic+0x0/0x24 [ 240.148008] [<ffffffc00023afa0>] kthread+0xe0/0xf0 [ 240.148027] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 [ 240.148044] CPU1: stopping [ 240.148072] CPU: 1 PID: 2744 Comm: memory_kills_mo Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.148085] Hardware name: Google Scarlet (DT) [ 240.148096] Call trace: [ 240.148115] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.148133] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.148151] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.148168] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.148185] [<ffffffc00020db48>] ipi_cpu_stop+0x48/0x68 [ 240.148201] [<ffffffc00020dae4>] set_irq_regs+0x0/0x1c [ 240.148217] [<ffffffc0002006d4>] gic_handle_irq+0xf0/0x17c [ 240.148231] Exception stack(0xffffffc0c21bbec0 to 0xffffffc0c21bbff0) [ 240.148248] bec0: 0000000000d0b1fc 0000000000d443ac 0000000000000000 000000000500ca25 [ 240.148264] bee0: 00000000000000e4 0000000008ee717c 0000000010841520 00000000eaa436f8 [ 240.148281] bf00: 0000000000000200 00000000000000e4 000000000ed4a8f0 00000000eaa436e8 [ 240.148297] bf20: 00000000035a2414 00000000eaa43210 000000000500c7fb 0000000000000000 [ 240.148312] bf40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.148328] bf60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.148344] bf80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.148360] bfa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 240.148376] bfc0: 000000000844dc74 0000000060060010 00000000000000a7 ffffffffffffffff [ 240.148390] bfe0: 0000000000000000 0000000000000000 [ 240.148406] [<ffffffc000202c10>] el0_irq_naked+0x14/0x24 [ 240.148422] CPU3: stopping [ 240.148444] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.148457] Hardware name: Google Scarlet (DT) [ 240.148467] Call trace: [ 240.148486] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.148504] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.148521] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.148538] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.148554] [<ffffffc00020db48>] ipi_cpu_stop+0x48/0x68 [ 240.148570] [<ffffffc00020dae4>] set_irq_regs+0x0/0x1c [ 240.148585] [<ffffffc0002006d4>] gic_handle_irq+0xf0/0x17c [ 240.148598] Exception stack(0xffffffc0ef043dd0 to 0xffffffc0ef043f00) [ 240.148613] 3dc0: 0000000000000000 00000000ffffffff [ 240.148629] 3de0: 0000000000000000 0000000000000002 0000000000000000 0000000000000040 [ 240.148645] 3e00: 000000000000003f 0000000000000000 00000000000f4bdb 00000000000003e8 [ 240.148661] 3e20: 0006422ca8002540 0000000029aaaaab 0000000000000018 0000000000000002 [ 240.148677] 3e40: 0000000000000007 0000000000005800 000000000000002f 000000000000005e [ 240.148693] 3e60: 000000000000005e ffffffc0ee193200 0000000000000000 00000037e8346f63 [ 240.148710] 3e80: 00000037e8252388 0000000000000000 ffffffc001064b58 ffffffc00108c000 [ 240.148726] 3ea0: ffffffc00108ca50 ffffffc001069490 ffffffc00108c000 ffffffc0ef043f30 [ 240.148742] 3ec0: ffffffc0007b79c4 ffffffc0ef043f00 ffffffc0007b79e4 0000000080000145 [ 240.148758] 3ee0: 0000000000000000 ffffffc0ee193200 ffffffffffffffff ffffffc0007b79c4 [ 240.148774] [<ffffffc000202700>] el1_irq+0x80/0xf8 [ 240.148793] [<ffffffc0007b7b1c>] cpuidle_enter+0x30/0x40 [ 240.148812] [<ffffffc0002eed70>] cpu_startup_entry+0x11c/0x2a8 [ 240.148829] [<ffffffc00020d454>] secondary_start_kernel+0x144/0x150 [ 240.148843] [<0000000000400c0c>] 0x400c0c [ 240.148858] CPU5: stopping [ 240.148881] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.148892] Hardware name: Google Scarlet (DT) [ 240.148902] Call trace: [ 240.148925] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.148941] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.148957] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.148970] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.148986] [<ffffffc00020db48>] ipi_cpu_stop+0x48/0x68 [ 240.148998] [<ffffffc00020dae4>] set_irq_regs+0x0/0x1c [ 240.149009] [<ffffffc0002006d4>] gic_handle_irq+0xf0/0x17c [ 240.149021] Exception stack(0xffffffc0ef04bdd0 to 0xffffffc0ef04bf00) [ 240.149032] bdc0: 0000000000000000 00000000ffffffff [ 240.149045] bde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000040 [ 240.149058] be00: 000000000000003f 0000000000000000 0000000000047905 00000000000003e8 [ 240.149070] be20: 0006422ca8002540 0000000029aaaaab 0000000000000018 0000000000000002 [ 240.149081] be40: 0000000000000025 0000000000009c00 0000000000000015 000000000000002c [ 240.149093] be60: 000000000000003f ffffffc0ee193600 0000000000000000 00000037e83470da [ 240.149105] be80: 00000037e82ff7d5 0000000000000000 ffffffc001064b58 ffffffc00108c000 [ 240.149116] bea0: ffffffc00108ca50 ffffffc001069490 ffffffc00108c000 ffffffc0ef04bf30 [ 240.149130] bec0: ffffffc0007b79c4 ffffffc0ef04bf00 ffffffc0007b79e4 0000000080000145 [ 240.149142] bee0: 0000000000000000 ffffffc0ee193600 ffffffffffffffff ffffffc0007b79c4 [ 240.149154] [<ffffffc000202700>] el1_irq+0x80/0xf8 [ 240.149171] [<ffffffc0007b7b1c>] cpuidle_enter+0x30/0x40 [ 240.149186] [<ffffffc0002eed70>] cpu_startup_entry+0x11c/0x2a8 [ 240.149201] [<ffffffc00020d454>] secondary_start_kernel+0x144/0x150 [ 240.149215] [<0000000000400c0c>] 0x400c0c [ 240.149229] CPU4: stopping [ 240.149250] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.149265] Hardware name: Google Scarlet (DT) [ 240.149274] Call trace: [ 240.149292] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.149310] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.149328] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.149347] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.149366] [<ffffffc00020db48>] ipi_cpu_stop+0x48/0x68 [ 240.149381] [<ffffffc00020dae4>] set_irq_regs+0x0/0x1c [ 240.149395] [<ffffffc0002006d4>] gic_handle_irq+0xf0/0x17c [ 240.149406] Exception stack(0xffffffc0ef047dd0 to 0xffffffc0ef047f00) [ 240.149418] 7dc0: 0000000000000000 00000000ffffffff [ 240.149431] 7de0: 0000000000000000 0000000000000002 0000000000000000 0000000000000040 [ 240.149449] 7e00: 000000000000003f 0000000000000000 0000000000172110 00000000000003e8 [ 240.149467] 7e20: 0006422ca8002540 0000000029aaaaab 0000000000000018 0000000000000002 [ 240.149481] 7e40: 0000000000000001 0000000000007f40 ffffffffffffffff ffffffc001143d38 [ 240.149496] 7e60: 0000000000000028 ffffffc0ee193400 0000000000000000 00000037e83470da [ 240.149508] 7e80: 00000037e81d4fca 0000000000000000 ffffffc001064b58 ffffffc00108c000 [ 240.149519] 7ea0: ffffffc00108ca50 ffffffc001069490 ffffffc00108c000 ffffffc0ef047f30 [ 240.149536] 7ec0: ffffffc0007b79c4 ffffffc0ef047f00 ffffffc0007b79e4 0000000080000145 [ 240.149550] 7ee0: 0000000000000000 ffffffc0ee193400 ffffffffffffffff ffffffc0007b79c4 [ 240.149562] [<ffffffc000202700>] el1_irq+0x80/0xf8 [ 240.149576] [<ffffffc0007b7b1c>] cpuidle_enter+0x30/0x40 [ 240.149590] [<ffffffc0002eed70>] cpu_startup_entry+0x11c/0x2a8 [ 240.149605] [<ffffffc00020d454>] secondary_start_kernel+0x144/0x150 [ 240.149621] [<0000000000400c0c>] 0x400c0c [ 240.149637] CPU2: stopping [ 240.149661] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.4.111-12498-g58165e05c736-dirty #1 [ 240.149677] Hardware name: Google Scarlet (DT) [ 240.149690] Call trace: [ 240.149708] [<ffffffc0002091ac>] dump_backtrace+0x0/0x16c [ 240.149727] [<ffffffc0002091a0>] show_stack+0x20/0x2c [ 240.149744] [<ffffffc0004cc530>] __dump_stack+0x20/0x28 [ 240.149761] [<ffffffc0004cc4ec>] dump_stack+0x70/0x94 [ 240.149780] [<ffffffc00020db48>] ipi_cpu_stop+0x48/0x68 [ 240.149799] [<ffffffc00020dae4>] set_irq_regs+0x0/0x1c [ 240.149816] [<ffffffc0002006d4>] gic_handle_irq+0xf0/0x17c [ 240.149833] Exception stack(0xffffffc0ef03fdd0 to 0xffffffc0ef03ff00) [ 240.149847] fdc0: 0000000000000000 00000000ffffffff [ 240.149864] fde0: 0000000000000000 0000000000000002 0000000000000000 0000000000000040 [ 240.149880] fe00: 000000000000003f 0000000000000000 00000000000af075 00000000000003e8 [ 240.149902] fe20: 0006422ca8002540 0000000029aaaaab 0000000000000018 0000000000000002 [ 240.149921] fe40: 0000000000000011 000000000000a400 0000000000000059 00000000000000b2 [ 240.149939] fe60: 00000000000000b2 ffffffc0ee193000 0000000000000000 00000037e8346f63 [ 240.149958] fe80: 00000037e8297eee 0000000000000000 ffffffc001064b58 ffffffc00108c000 [ 240.149977] fea0: ffffffc00108ca50 ffffffc001069490 ffffffc00108c000 ffffffc0ef03ff30 [ 240.149993] fec0: ffffffc0007b79c4 ffffffc0ef03ff00 ffffffc0007b79e4 0000000080000145 [ 240.150015] fee0: 0000000000000000 ffffffc0ee193000 ffffffffffffffff ffffffc0007b79c4 [ 240.150037] [<ffffffc000202700>] el1_irq+0x80/0xf8 [ 240.150056] [<ffffffc0007b7b1c>] cpuidle_enter+0x30/0x40 [ 240.150076] [<ffffffc0002eed70>] cpu_startup_entry+0x11c/0x2a8 [ 240.150097] [<ffffffc00020d454>] secondary_start_kernel+0x144/0x150 [ 240.150116] [<0000000000400c0c>] 0x400c0c
,
Jan 22 2018
,
Jan 22 2018
I'll try to repro and understand why the kernel hangs in usb_kill_urb()
,
Jan 22 2018
FYI, there are some unexplained lockup issues here still: https://issuetracker.google.com/70632193 I believe it's possible for RK3399/PCIe to manage to lock things up asynchronously. It's been tough to pinpoint exactly why, but somewhere along the way we've found that certain stalls (e.g., on PCIe reads, where the endpoint is not responding) can cause other tasks to block on memory barriers, since the barrier is trying to enforce some kind of consistency. This sort of behavior doesn't always end up blaming PCIe code, IIRC. Anyway, that's just to say I'm not sure whether this is definitely a BT/USB bug, or yet another confounding symptom of a misbehaving PCIe device/driver, or some other sort of problem.
,
Jan 23 2018
I couldn't reproduce the issue in hundreds of reboot and suspend/resume cycles. Besides the occurrence reported by Brian the hang has occurred during RunInRebootSequence and RunInDozingStress. Do we have any details on these? Interestingly btusb_close() isn't called on a typical reboot, though it is called when the issue is encountered while running RunInRebootSequence.
,
Jan 23 2018
so there're 2 workqueue blocked on usb_kill_urb:
[ 240.141313] kworker/u13:0 D ffffffc0002130ec 0 101 2 0x00000000
[ 240.141427] Workqueue: hci0 hci_power_off [bluetooth]
[ 240.141445] Call trace:
[ 240.141474] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8
[ 240.141498] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8
[ 240.141515] [<ffffffc00092f2a4>] schedule+0x4c/0xb0
[ 240.141535] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118
[ 240.141552] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74
[ 240.141577] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb]
[ 240.141597] [<ffffffbffc0ff5f4>] btusb_close+0x64/0xa8 [btusb]
[ 240.141675] [<ffffffbffc002638>] hci_dev_do_close+0x284/0x318 [bluetooth]
[ 240.141745] [<ffffffbffc0055e8>] hci_power_off+0x1c/0x28 [bluetooth]
[ 240.141770] [<ffffffc0002db570>] worker_thread+0x23c/0x70c
[ 240.141798] [<ffffffc00023afa0>] kthread+0xe0/0xf0
[ 240.141816] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40
[ 240.142617] kworker/2:1 D ffffffc0002130ec 0 113 2 0x00000000
[ 240.142675] Workqueue: pm pm_runtime_work
[ 240.142702] Call trace:
[ 240.142762] [<ffffffc0002130ec>] __switch_to+0x9c/0xa8
[ 240.142784] [<ffffffc00092f6d4>] __schedule+0x3cc/0x8f8
[ 240.142806] [<ffffffc00092f2a4>] schedule+0x4c/0xb0
[ 240.142825] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118
[ 240.142859] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74
[ 240.142884] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb]
[ 240.142920] [<ffffffbffc0febf0>] btusb_suspend+0xa0/0x114 [btusb]
[ 240.142951] [<ffffffc000657b70>] usb_suspend_both+0xd4/0x238
[ 240.142973] [<ffffffc00065829c>] usb_runtime_suspend+0x38/0x70
[ 240.142995] [<ffffffc0007a4568>] rpm_suspend+0x310/0x4b8
[ 240.143013] [<ffffffc0005f65ec>] pm_runtime_work+0x74/0x94
[ 240.143028] [<ffffffc0002db570>] worker_thread+0x23c/0x70c
[ 240.143045] [<ffffffc00023afa0>] kthread+0xe0/0xf0
[ 240.143061] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40
[ 240.143427] rs:main Q:Reg D ffffffc0002130ec 0 533 1 0x00400000
which would be:
void usb_kill_urb(struct urb *urb)
{
might_sleep();
if (!(urb && urb->dev && urb->ep))
return;
atomic_inc(&urb->reject);
usb_hcd_unlink_urb(urb, -ENOENT); <--- call hcd->driver->urb_dequeue to unlink and giveback urb
wait_event(usb_kill_urb_queue, atomic_read(&urb->use_count) == 0); <--- wait for use_count == 0.
so it's likely the hcd still hold the urb(increased when submit, and decreased when giveback).
normally the usb_hcd_unlink_urb() would call urb_dequeue() to giveback them, so there might be something wrong in the xhci/ohci/ehci's(not sure which one the btusb is using) urb_dequeue()
if it's xhci, maybe somehow related to:
commit d3519b9d9606991a1305596348b6d690bfa3eb27
Author: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue Mar 28 15:55:30 2017 +0300
xhci: Manually give back cancelled URB if we can't queue it for cancel
commit d6169d04097fd9ddf811e63eae4e5cd71e6666e2
Author: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Wed Jan 11 17:10:34 2017 +0200
xhci: fix deadlock at host remove by running watchdog correctly
,
Jan 23 2018
Jeffy, thanks for the pointers to the XHCI fixes! These look indeed like potential fixes, however on Scarlet btusb is handled by the OHCI controller. Possibly it's a similar problem as with XHCI, I didn't see any related patches though.
,
Jan 23 2018
maybe we can write a script to do stress test using hcitool/hciconfig tools. and for ohci, i saw a patch might related: https://marc.info/?l=linux-usb&m=144048013527334&w=2 i don't see that patch applied, or maybe fixed in another way? anyway, if we can repro this issue, maybe we can add some logs to ohci driver to see is this the same case that patch tried to fix.
,
Jan 23 2018
it looks like the factory can repro it, maybe we can just ask them to retest with that patch(maybe also add a few logs to confirm) i uploaded the CLs here: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/880544
,
Jan 23 2018
and for the 2 blocked workers: this is hci dev auto power off(2 sec timeout after open): [ 240.141535] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118 [ 240.141552] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74 [ 240.141577] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb] [ 240.141597] [<ffffffbffc0ff5f4>] btusb_close+0x64/0xa8 [btusb] [ 240.141675] [<ffffffbffc002638>] hci_dev_do_close+0x284/0x318 [bluetooth] [ 240.141745] [<ffffffbffc0055e8>] hci_power_off+0x1c/0x28 [bluetooth] this is usb suspend(runtime?): [ 240.142825] [<ffffffc000654734>] usb_kill_urb+0xb0/0x118 [ 240.142859] [<ffffffc000654950>] usb_kill_anchored_urbs+0x40/0x74 [ 240.142884] [<ffffffbffc101b58>] btusb_stop_traffic+0x24/0x40 [btusb] [ 240.142920] [<ffffffbffc0febf0>] btusb_suspend+0xa0/0x114 [btusb] [ 240.142951] [<ffffffc000657b70>] usb_suspend_both+0xd4/0x238 [ 240.142973] [<ffffffc00065829c>] usb_runtime_suspend+0x38/0x70 [ 240.142995] [<ffffffc0007a4568>] rpm_suspend+0x310/0x4b8 [ 240.143013] [<ffffffc0005f65ec>] pm_runtime_work+0x74/0x94 it might be possible the usb start suspending while the hci is doing auto power off, then trigger the issue of: https://marc.info/?l=linux-usb&m=144048013527334&w=2 and for the log of the reboot issue: https://partnerissuetracker.corp.google.com/u/0/issues/72190941 <6>[ 240.128004] reboot D ffffffc000213028 0 2886 1 0x00400000 <0>[ 240.128030] Call trace: <4>[ 240.128051] [<ffffffc000213028>] __switch_to+0x9c/0xa8 <4>[ 240.128070] [<ffffffc00092df7c>] __schedule+0x3cc/0x840 <4>[ 240.128087] [<ffffffc00092db4c>] schedule+0x4c/0xb0 <4>[ 240.128114] [<ffffffc0005f453c>] __pm_runtime_barrier+0xb4/0x124 <4>[ 240.128162] [<ffffffc0005f4464>] pm_runtime_barrier+0x70/0x94 <4>[ 240.128192] [<ffffffc0005e62e8>] device_shutdown+0xac/0x1a8 <4>[ 240.128211] [<ffffffc00023c46c>] kernel_restart_prepare+0x3c/0x48 <4>[ 240.128229] [<ffffffc00023c640>] kernel_restart+0x1c/0x64 <4>[ 240.128245] [<ffffffc00023c8f4>] SyS_reboot+0x198/0x210 looks like reboot flow is blocked by the pm_runtime_barrier, waiting for usb suspend. guess we can try to repro it by reboot at about [ 30.831297] after booted
,
Jan 24 2018
I'd expect the reboot test would require particular timing. And perhaps the suspend/resume test too. For one, USB autosuspend only kicks in after some number of seconds. Also, bluetoothd doesn't quite start "immediately" at boot, so you might need to at least wait for some initial activity to start (or stop?) in order to find the "right" time to catch this bug. Also (FYI mka@), last I checked, I think there are plenty of instances of this crash on the crash reporter, in case it's helpful.
,
Jan 26 2018
I still could not repro this with a script similar to RunInDozingStress (h ttps://b.corp.google.com/issues/72190644) and a random delay to start around 30s uptime. One thing that strikes me is that every single hang I looked into so far was detected at 240.1xxxxx. Not sure what exactly make out of this, but it is almost certainly not a coincidence. Resting the 120s of delay for detecting a hung task it could indicate that the problem started at 120s, however this is 'long' after the system booted and also ~90s later than the actions that supposedly trigger the issue (suspend/reboot). It might or might not be a coincidence that 240s is precisely 2 x hung task delay.
,
Jan 27 2018
Hi mka, hmm, i did remember the magic 240 sec, it seems like all these kind of hang issues would end up like that. but i think the hang may happened way early than 120s(according to the failed suspend and not finished reboot), guess we can try to fake a boot time hang to confirm 240 second is not related.
,
Jan 27 2018
I agree, it would be odd if this was triggered always at 120s, the reboots/suspends are done way earlier. Was just wondering about that pattern. I actually already ran into another hang, which also is detected at 240 secs: [ 240.113247] usb 6-1-port1: Cannot enable. Maybe the USB cable is bad? [ 240.119224] INFO: task reboot:2483 blocked for more than 120 seconds. [ 240.119229] Not tainted 4.4.111 #168 [ 240.119231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.119238] reboot D ffffffc0002130ec 0 2483 1 0x00400000
,
Jan 29 2018
for the 240s, it's indeed "2 x hung task delay.":
vi kernel/hung_task.c
static int watchdog(void *dummy)
{
...
while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
timeout = sysctl_hung_task_timeout_secs;
...
check_hung_uninterruptible_tasks(timeout);
so it would check hung tasks every 120s.
and in:
static void check_hung_task(struct task_struct *t, unsigned long timeout)
{
unsigned long switch_count = t->nvcsw + t->nivcsw;
...
if (switch_count != t->last_switch_count) {
t->last_switch_count = switch_count;
return;
}
trace_sched_process_hang(t);
so it record the first last_switch_count at 120s, and found switch_count didn't changed at 240s (2x120s), and report hang.
,
Jan 29 2018
according to the log: <6>[ 1.396896] Bluetooth: hci0: using NVM file: qca/nvm_usb_00000302.bin ... <4>[ 240.120521] [<ffffffbffc0054fc>] hci_power_off+0x1c/0x28 [bluetooth] the hang should happened at about 4s after boot, because: 1/ hci_power_off would only be called at about 2s after hci_power_on(could be canceled if anyone try to open the hci) 2/ hci_power_on would only be called in hci_register_dev, which is called by btusb's probe at about [ 1.396896](way earlier than the marvel).
,
Jan 31 2018
Jeffy, thanks for digging this up. I still had no success reproducing this, neither with trying to emulate RunInRebootSequence nor RunInDozingStress. According to my logs hci_power_off() is only called once at boot around 15s uptime when Bluetooth is disabled, but not on suspend or reboot (w/ BT enabled). The first suspend of RunInDozingStress() occurs around 25s uptime, which shouldn't interfere with hci_power_off() 10s earlier. Maybe the issue only occurs when hci_power_off() is delayed for some reason. I added extra logging to my test script to see if there are occasional longer delays (on my device, which might not be representative ...).
,
Jan 31 2018
Hi mka, hmm, i made a mistake at #16, the power_on/power_off works could be enqueued by ./net/bluetooth/mgmt.c anyway, i think if the btusb probed at 1.4s, and bluetoothd started at about 12s, then we should saw the power_off been called at about 4-5s(auto power off 2s after probe btusb) and 14-15s(auto power off 2s after bluetoothed power on it) could you share your messages? maybe i can compare it with the factory's
,
Jan 31 2018
Hi Jeffy,
attached are a kernel log and the corresponding /var/log/messages. The kernel log includes additional traces for hci_power_on(), hci_power_off() and an indicator that hci_power_off() has finished ("NO HANG").
The logs don't include suspends or reboots, in earlier tests I started both around 30s with a fuzz factor.
,
Jan 31 2018
Hi mka, hmm, in factory log: ath10k bt registered at: 2018-01-26T10:51:22.527587+00:00 INFO kernel: [ 1.278212] usbcore: registered new interface driver btusb .. 2018-01-26T10:51:22.527593+00:00 INFO kernel: [ 1.340754] Bluetooth: hci0: using NVM file: qca/nvm_usb_00000302.bin ath10k wifi enabled at: 2018-01-26T10:51:26.288561+00:00 INFO kernel: [ 5.520648] ath10k_pci 0000:01:00.0: enabling device (0000 -> 0002) 2018-01-26T10:51:26.289537+00:00 INFO kernel: [ 5.521497] ath10k_pci 0000:01:00.0: pci irq msi-x interrupts 8 irq_mode 0 reset_mode 0 but in your log, the wifi is normal, but bt is much later: 2018-01-30T19:00:59.017035-08:00 INFO kernel: [ 4.440324] ath10k_pci 0000:01:00.0: enabling device (0000 -> 0002) 2018-01-30T19:00:59.017040-08:00 INFO kernel: [ 4.441128] ath10k_pci 0000:01:00.0: pci irq msi-x interrupts 8 irq_mode 0 reset_mode 0 2018-01-30T19:01:07.454538-08:00 INFO kernel: [ 12.881431] usbcore: registered new interface driver btusb and i saw a 1.4s gap here: 2018-01-30T19:00:59.014150-08:00 INFO kernel: [ 0.258328] ff1a0000.serial: ttyS2 at MMIO 0xff1a0000 (irq = 41, base_baud = 1500000) is a 16550A 2018-01-30T19:00:59.014155-08:00 INFO kernel: [ 1.615659] console [ttyS2] enabled maybe because you enabled the serial console: [ 0.000000] Kernel command line: cros_secure earlyprintk=ttyS2,115200n8 console=tty1 console=ttyS2,115200n8 loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=PARTUUID=1e9bdab9-09fa-024c-acba- but the factory didn't: [ 0.000000] Kernel command line: cros_secure console= loglevel=7 init=/sbin/init cros_secure but the 1.4s gap doesn't make sense to me...
,
Feb 6 2018
I reproduced the issue once with the following 'configuration':
- serial console disabled
- no USB to Ethernet adapter or hub connected
- WiFi connected to GoogleGuest
This was with Jeffy's additional logs, however I commented the stack dumps since it seemed they could make the problem harder to reproduce:
[ 3.747174] jeffy btusb_stop_traffic, 1105
[ 3.747195] start_ed_unlink, 492, next: (null), prev: (null), ed_rm_list: (null)
[ 3.748259] finish_unlinks, 1026, next: (null), last:ffffff80001410e0, ed_rm_list:ffffff80001410e0
[ 3.748266] MKA: hci_power_off
[ 3.748282] jeffy btusb_stop_traffic, 1105
[ 3.748292] start_ed_unlink, 492, next: (null), prev: (null), ed_rm_list:ffffff80001410e0
Apparently btusb_stop_traffic() is executed concurrently. The call from 3.747174 has not completed when the one at 3.748282 starts, otherwise there would be another trace with line number 1110.
btusb_stop_traffic() calls usb_kill_anchored_urbs() for the different endpoint types, so I suspect the concurrent execution could lead to the race condition described in the patch Jeffy mentioned in #8:
> There is a race condition between
> finish_unlinks->finish_urb() function and
> usb_kill_urb() in ohci controller case. The finish_urb
> calls spin_unlock(&ohci->lock) before
> usb_hcd_giveback_urb() function call, then if during
> this time, usb_kill_urb is called for another endpoint,
> then new ed will be added to ed_rm_list at beginning
> for unlink. and ed_rm_list will point to newly added
> ed.
>
> When finish_urb() is completed in finish_unlinks() and
> ed->td_list becomes empty as in below code (in finish_unlinks() function)
> if (list_empty(&ed->td_list)) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> ed_schedule(ohci, ed);
> }
> *last = ed->ed_next will make ed_rm_list to point to ed->ed_next and
> previously added ed by usb_kill_urb will be left unreferenced by
> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
> finish_unlink to remove added ed from ed_rm_list.
>
> The main reason for hang in this race condtion is addition and removal
> of ed from ed_rm_list in the beginning during usb_kill_urb and later last*
> is modified in finish_unlinks().
> As suggested by Alan Stern, the solution for proper handling of
> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
> any URBs. Then at the end, we can add ed back to the list if necessary.
> This properly handle the updated ohci->ed_rm_list in
> usb_kill_urb().
Since this is still hard to reproduce locally it would probably be best to get feedback from the factory on whether they still see the issue with the patch applied.
,
Feb 7 2018
Hi mka,
thanks for the log,
so it's likely we are hitting the race:
1/ the first time btusb_stop_traffic been called(likely by usb runtime auto suspend):
[ 3.747174] jeffy btusb_stop_traffic, 1105
[ 3.747195] start_ed_unlink, 492, next: (null), prev: (null), ed_rm_list: (null) <--- start unlink ed, acquire and release the ohci->lock
2/ doing the unlink while holding ohci->lock:
[ 3.748259] finish_unlinks, 1026, next: (null), last:ffffff80001410e0, ed_rm_list:ffffff80001410e0
3/ in finish_urb, we release the ohci->lock, and give back first urb:
spin_unlock (&ohci->lock);
usb_hcd_giveback_urb(ohci_to_hcd(ohci), urb, status);
spin_lock (&ohci->lock);
but we didn't finished the function, because no finish_urb log:
/* if URB is done, clean up */
if (urb_priv->td_cnt >= urb_priv->length) {
modified = completed = 1;
finish_urb(ohci, urb, 0);
if (ed == ed_to_kill)
printk("%s, %d, finish_urb\n",
__func__, __LINE__); <--- should see this log if finish_urb finished
}
guess the second start_ed_unlink cut in, and steal the lock for a while, and changed the ed_to_kill.
4/ the second btusb_stop_traffic been called(by auto power off):
[ 3.748266] MKA: hci_power_off
[ 3.748282] jeffy btusb_stop_traffic, 1105
[ 3.748292] start_ed_unlink, 492, next: (null), prev: (null), ed_rm_list:ffffff80001410e0 <---- try to unlink another urb, acquire and release the ohci->lock
so the second start_ed_unlink been called before the first finish_urb finished, and the rest part of finish_unlinks override the new ed_rm_list
unless, something wrong cause the finish_urb of the first finish_unlinks not even been called...
oops, i should have added a log before the finish_urb to check that...
anyway, i think the factory should have the test result now.
,
Feb 7 2018
For the record, there are these private bugs that correspond to this issue too: https://issuetracker.google.com/72190644 https://issuetracker.google.com/72190941
,
Mar 3 2018
We think this is fixed here: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/880543
,
Sep 26
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by briannorris@chromium.org
, Jan 22 2018