Crash on stable 8350.21.1 or later => shill::ExternalTask::OnTaskDied-8f16ef5d |
|||
Issue descriptionCrash with signature 'shill::ExternalTask::OnTaskDied-8f16ef5d' Has 26k+ crashes in the field, bulk on M52 stable: https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20stable_signature%3D%27shill%3A%3AExternalTask%3A%3AOnTaskDied-8f16ef5d%27%20OMIT%20RECORD%20IF%20SUM(ProductData.Key%3D%27exec_name%27)%20%3D%200%20OR%20SUM(ProductData.Value%3D%27shill%27)%20%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=#samplereports
,
Oct 17 2016
Thanks for the pointers. I'll follow up with you this week on #1-#4. For #5, its hard to conclude anything because usb->cellular dongles could also be involved (I assume those lead to the ppp daemon in shill being started). The bug affects a large and wide set of devices, most of which don't have LTE as per go/crconn (samus, snow, gnawty, peppy).
,
Nov 15 2016
+ ejcaruso who is working on improving task caller visibility.
,
Nov 15 2016
If this is happening in the L2TP driver, there's only really one option for what the external task is, and where it's set up from:
bool L2TPIPSecDriver::SpawnL2TPIPSecVPN(Error* error) {
...
std::unique_ptr<ExternalTask> external_task_local(
new ExternalTask(control_,
process_manager_,
weak_ptr_factory_.GetWeakPtr(),
Bind(&L2TPIPSecDriver::OnL2TPIPSecVPNDied,
weak_ptr_factory_.GetWeakPtr())));
...
external_task_ = std::move(external_task_local);
...
}
I'm not super sure if my task caller visibility stuff will help here since it looks like we already have a handle on where the problems are. If this is happening in the L2TP driver I'm not sure why we aren't just looking at L2TPIPSec::Notify.
If you really need to get information about where the death task is posted from, calls to DestroyLater just move the PostTask(death_callback_) out from EventDispatcher to ExternalTask since it does that weird double-dispatch. It might be more useful to expose a method you can get the death_callback_ from and then register that with the EventDispatcher directly. For example, base::RunLoop does this:
// Convenience method to get a closure that safely calls Quit (has no effect
// if the RunLoop instance is gone).
//
// Example:
// RunLoop run_loop;
// PostTask(run_loop.QuitClosure());
// run_loop.Run();
base::Closure QuitClosure();
,
Feb 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/connectivity/shill/+/13ac89430c37c63e4d619a17d9ea9d2d9739fd62 commit 13ac89430c37c63e4d619a17d9ea9d2d9739fd62 Author: Ben Chan <benchan@chromium.org> Date: Mon Feb 27 20:03:28 2017 shill: vpn: fix external task cleanup in L2TPIPSecDriver Upon termination of the 'l2tpipsec_vpn' external task launched by L2TPIPSecDriver, the following callback sequence is invoked, which leads to the destruction of the ExternalTask instance during the invocation of ExternalTask::OnTaskDied(): ExternalTask::OnTaskDied() -> L2TPIPSecDriver::OnL2TPIPSecVPNDied() -> L2TPIPSecDriver::FailService() -> L2TPIPSecDriver::Cleanup() -> ExternalTask::~ExternalTask() This CL changes the code to use ExternalTask::DestroyLater() to mitigate the issue. BUG= chromium:656410 TEST=Run unit tests. TEST=Run network_VPNConnect autotest. Change-Id: I0cc849b3586cfca282651d4d4df7627b458af6c9 Reviewed-on: https://chromium-review.googlesource.com/443884 Commit-Ready: Ben Chan <benchan@chromium.org> Tested-by: Ben Chan <benchan@chromium.org> Reviewed-by: Kevin Cernekee <cernekee@chromium.org> [modify] https://crrev.com/13ac89430c37c63e4d619a17d9ea9d2d9739fd62/vpn/l2tp_ipsec_driver.cc
,
Feb 27 2017
The crash should have been fixed by CL:443884. We can monitor crash reports after the next dev push and then determine if this particular crash vanishes.
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/connectivity/shill/+/e16449c881131df81196deb9dfbe385bed7b5833 commit e16449c881131df81196deb9dfbe385bed7b5833 Author: Ben Chan <benchan@chromium.org> Date: Wed Mar 01 13:04:05 2017 shill: vpn: fix memory leaks in L2TPIPSecDriverTest tests This CL fixes memory leaks in L2TPIPSecDriverTest tests detected by LeakSanitizer. Specifically, the ExternalTask instance initially held by a L2TPIPSecDriver instance could be scheduled to be destroyed after the L2TPIPSecDriver instance is destroyed (see CL:443884). To avoid leaking any ExternalTask instance when the L2TPIPSecDriverTest finishes, we modify L2TPIPSecDriverTest to ensure any pending destruction task of ExternalTask is executed when tearing down L2TPIPSecDriverTest. BUG= chromium:656410 TEST=`USE='asan clang' emerge-chell shill` Change-Id: I313442c7c01d47d5ab6ff65cd17eba774171aa30 Reviewed-on: https://chromium-review.googlesource.com/448159 Commit-Ready: Ben Chan <benchan@chromium.org> Tested-by: Ben Chan <benchan@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> [modify] https://crrev.com/e16449c881131df81196deb9dfbe385bed7b5833/vpn/l2tp_ipsec_driver_unittest.cc
,
Dec 12
Reopen if not really fixed. But crash stats say yes, as does comment #6. |
|||
►
Sign in to add a comment |
|||
Comment 1 by cernekee@chromium.org
, Oct 16 2016Continuing the conversation from the CL: This does not look like a NULL pointer dereference to me, and I do not think it is related to the callback pointer. Looking at the minidump_stackwalk and the disasm for report ID 11463ed900000000: Thread 0 (crashed) 0 shill + 0x2afc36 rax = 0x0000000000000000 rdx = 0x00007fd7abac8300 rcx = 0x00007fd7a9b0dc01 rbx = 0x00007fd7abb7de70 rsi = 0x0000000000000000 rdi = 0x2e6b726f7774654e rbp = 0x00007ffd2af25fe0 rsp = 0x00007ffd2af25e20 r8 = 0x00007fd7abac82e0 r9 = 0x0000000000000000 r10 = 0x0000000000000000 r11 = 0x0000000000000246 r12 = 0x0000000000000003 r13 = 0x00007ffd2af25e30 r14 = 0x00007ffd2af25e38 r15 = 0x00007ffd2af26280 rip = 0x00007fd7aa9f7c36 Found by: given as instruction pointer in context void std::swap<shill::RPCTask*>(shill::RPCTask*&, shill::RPCTask*&)(): /usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/move.h:178 2afc1e: 48 8b 7b 18 mov 0x18(%rbx),%rdi shill::ExternalTask::OnTaskDied(int)(): /build/winky/var/cache/portage/chromeos-base/shill/out/Default/../../../../../../../tmp/portage/chromeos-base/shill-0.0.3-r75/work/shill-0.0.3/aosp/system/connectivity/shill/external_task.cc:110 2afc22: c7 43 40 00 00 00 00 movl $0x0,0x40(%rbx) void std::swap<shill::RPCTask*>(shill::RPCTask*&, shill::RPCTask*&)(): /usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/move.h:179 2afc29: 48 c7 43 18 00 00 00 movq $0x0,0x18(%rbx) 2afc30: 00 std::unique_ptr<shill::RPCTask, std::default_delete<shill::RPCTask> >::reset(shill::RPCTask*)(): /usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:343 2afc31: 48 85 ff test %rdi,%rdi 2afc34: 74 06 je 2afc3c <std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x726c> std::default_delete<shill::RPCTask>::operator()(shill::RPCTask*) const(): /usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:76 (discriminator 1) 2afc36: 48 8b 07 mov (%rdi),%rax 2afc39: ff 50 08 callq *0x8(%rax) At 2afc36, %rdi should contain a pointer to an rpc_task so that the unique_ptr internals can call `delete` on it. But instead it contains the ASCII string "Network." It passes unique_ptr's NULL check, and crashes on dereference. Couple things to consider: 1) SIGCHLD delivery is asynchronous. AsynchronousSignalHandler is supposed to deliver the signal on the main thread so there are no concurrency issues. I don't think this is the problem, but it might be worth giving it a second look. 2) Is it possible that our ExternalTask object is getting deleted from the death_callback_? This seems unsafe. For instance, L2TPIPSecDriver::OnL2TPIPSecVPNDied -> FailService -> Cleanup -> external_task_.reset(). Also note that Cleanup can trigger metric reporting via service.cc, and the metrics code could be allocating a string that starts with our bad pointer value "Network." 3) If you can run shill under valgrind, connect to an L2TP VPN, then `kill -9` the child process, you might be able to see any possible illegal accesses. I can walk you through the L2TP setup this week. 4) base::Unretained is a possible red flag for dangling pointers, although this usage looks safe to me because no callbacks should ever happen after StopProcess is called. 5) You were asking about UMA stats for VPN the other day. Along those lines, maybe it would be useful to figure out whether this bug disproportionately affects cellular devices, or maybe devices that are used a lot in a corporate environment (VPN). And see whether it is reported on any non-cellular devices. I think Sameer has a spreadsheet showing which Chromebooks support which connectivity options.