New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 656410 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Dec 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Crash on stable 8350.21.1 or later => shill::ExternalTask::OnTaskDied-8f16ef5d

Project Member Reported by kirtika@chromium.org, Oct 16 2016

Issue description

Continuing the conversation from the CL:

This does not look like a NULL pointer dereference to me, and I do not think it is related to the callback pointer.  Looking at the minidump_stackwalk and the disasm for report ID 11463ed900000000:

Thread 0 (crashed)
 0  shill + 0x2afc36
    rax = 0x0000000000000000   rdx = 0x00007fd7abac8300
    rcx = 0x00007fd7a9b0dc01   rbx = 0x00007fd7abb7de70
    rsi = 0x0000000000000000   rdi = 0x2e6b726f7774654e
    rbp = 0x00007ffd2af25fe0   rsp = 0x00007ffd2af25e20
     r8 = 0x00007fd7abac82e0    r9 = 0x0000000000000000
    r10 = 0x0000000000000000   r11 = 0x0000000000000246
    r12 = 0x0000000000000003   r13 = 0x00007ffd2af25e30
    r14 = 0x00007ffd2af25e38   r15 = 0x00007ffd2af26280
    rip = 0x00007fd7aa9f7c36
    Found by: given as instruction pointer in context


void std::swap<shill::RPCTask*>(shill::RPCTask*&, shill::RPCTask*&)():
/usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/move.h:178
  2afc1e:       48 8b 7b 18             mov    0x18(%rbx),%rdi
shill::ExternalTask::OnTaskDied(int)():
/build/winky/var/cache/portage/chromeos-base/shill/out/Default/../../../../../../../tmp/portage/chromeos-base/shill-0.0.3-r75/work/shill-0.0.3/aosp/system/connectivity/shill/external_task.cc:110
  2afc22:       c7 43 40 00 00 00 00    movl   $0x0,0x40(%rbx)
void std::swap<shill::RPCTask*>(shill::RPCTask*&, shill::RPCTask*&)():
/usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/move.h:179
  2afc29:       48 c7 43 18 00 00 00    movq   $0x0,0x18(%rbx)
  2afc30:       00
std::unique_ptr<shill::RPCTask, std::default_delete<shill::RPCTask> >::reset(shill::RPCTask*)():
/usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:343
  2afc31:       48 85 ff                test   %rdi,%rdi
  2afc34:       74 06                   je     2afc3c <std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x726c>
std::default_delete<shill::RPCTask>::operator()(shill::RPCTask*) const():
/usr/lib/gcc/x86_64-cros-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:76 (discriminator 1)
  2afc36:       48 8b 07                mov    (%rdi),%rax
  2afc39:       ff 50 08                callq  *0x8(%rax)


At 2afc36, %rdi should contain a pointer to an rpc_task so that the unique_ptr internals can call `delete` on it.  But instead it contains the ASCII string "Network."  It passes unique_ptr's NULL check, and crashes on dereference.

Couple things to consider:

1) SIGCHLD delivery is asynchronous.  AsynchronousSignalHandler is supposed to deliver the signal on the main thread so there are no concurrency issues.  I don't think this is the problem, but it might be worth giving it a second look.

2) Is it possible that our ExternalTask object is getting deleted from the death_callback_?  This seems unsafe.  For instance, L2TPIPSecDriver::OnL2TPIPSecVPNDied -> FailService -> Cleanup -> external_task_.reset().  Also note that Cleanup can trigger metric reporting via service.cc, and the metrics code could be allocating a string that starts with our bad pointer value "Network."

3) If you can run shill under valgrind, connect to an L2TP VPN, then `kill -9` the child process, you might be able to see any possible illegal accesses.  I can walk you through the L2TP setup this week.

4) base::Unretained is a possible red flag for dangling pointers, although this usage looks safe to me because no callbacks should ever happen after StopProcess is called.

5) You were asking about UMA stats for VPN the other day.  Along those lines, maybe it would be useful to figure out whether this bug disproportionately affects cellular devices, or maybe devices that are used a lot in a corporate environment (VPN).  And see whether it is reported on any non-cellular devices.  I think Sameer has a spreadsheet showing which Chromebooks support which connectivity options.
Thanks for the pointers. I'll follow up with you this week on #1-#4.

For #5, its hard to conclude anything because usb->cellular dongles could also be involved (I assume those lead to the ppp daemon in shill being started). The bug affects a large and wide set of devices, most of which don't have LTE as per go/crconn (samus, snow, gnawty, peppy).

Comment 3 by kirtika@google.com, Nov 15 2016

Cc: ejcaruso@chromium.org
+ ejcaruso who is working on improving task caller visibility. 

If this is happening in the L2TP driver, there's only really one option for what the external task is, and where it's set up from:

bool L2TPIPSecDriver::SpawnL2TPIPSecVPN(Error* error) {
...
  std::unique_ptr<ExternalTask> external_task_local(
      new ExternalTask(control_,
                       process_manager_,
                       weak_ptr_factory_.GetWeakPtr(),
                       Bind(&L2TPIPSecDriver::OnL2TPIPSecVPNDied,
                            weak_ptr_factory_.GetWeakPtr())));
...
    external_task_ = std::move(external_task_local);
...
}

I'm not super sure if my task caller visibility stuff will help here since it looks like we already have a handle on where the problems are. If this is happening in the L2TP driver I'm not sure why we aren't just looking at L2TPIPSec::Notify.

If you really need to get information about where the death task is posted from, calls to DestroyLater just move the PostTask(death_callback_) out from EventDispatcher to ExternalTask since it does that weird double-dispatch. It might be more useful to expose a method you can get the death_callback_ from and then register that with the EventDispatcher directly. For example, base::RunLoop does this:

  // Convenience method to get a closure that safely calls Quit (has no effect
  // if the RunLoop instance is gone).
  //
  // Example:
  //   RunLoop run_loop;
  //   PostTask(run_loop.QuitClosure());
  //   run_loop.Run();
  base::Closure QuitClosure();
Project Member

Comment 5 by bugdroid1@chromium.org, Feb 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/aosp/platform/system/connectivity/shill/+/13ac89430c37c63e4d619a17d9ea9d2d9739fd62

commit 13ac89430c37c63e4d619a17d9ea9d2d9739fd62
Author: Ben Chan <benchan@chromium.org>
Date: Mon Feb 27 20:03:28 2017

shill: vpn: fix external task cleanup in L2TPIPSecDriver

Upon termination of the 'l2tpipsec_vpn' external task launched by
L2TPIPSecDriver, the following callback sequence is invoked, which leads
to the destruction of the ExternalTask instance during the invocation of
ExternalTask::OnTaskDied():

    ExternalTask::OnTaskDied()
      -> L2TPIPSecDriver::OnL2TPIPSecVPNDied()
        -> L2TPIPSecDriver::FailService()
          -> L2TPIPSecDriver::Cleanup()
            -> ExternalTask::~ExternalTask()

This CL changes the code to use ExternalTask::DestroyLater() to mitigate
the issue.

BUG= chromium:656410 
TEST=Run unit tests.
TEST=Run network_VPNConnect autotest.

Change-Id: I0cc849b3586cfca282651d4d4df7627b458af6c9
Reviewed-on: https://chromium-review.googlesource.com/443884
Commit-Ready: Ben Chan <benchan@chromium.org>
Tested-by: Ben Chan <benchan@chromium.org>
Reviewed-by: Kevin Cernekee <cernekee@chromium.org>

[modify] https://crrev.com/13ac89430c37c63e4d619a17d9ea9d2d9739fd62/vpn/l2tp_ipsec_driver.cc

The crash should have been fixed by CL:443884. We can monitor crash reports after the next dev push and then determine if this particular crash vanishes.
Project Member

Comment 7 by bugdroid1@chromium.org, Mar 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/aosp/platform/system/connectivity/shill/+/e16449c881131df81196deb9dfbe385bed7b5833

commit e16449c881131df81196deb9dfbe385bed7b5833
Author: Ben Chan <benchan@chromium.org>
Date: Wed Mar 01 13:04:05 2017

shill: vpn: fix memory leaks in L2TPIPSecDriverTest tests

This CL fixes memory leaks in L2TPIPSecDriverTest tests detected by
LeakSanitizer. Specifically, the ExternalTask instance initially held by
a L2TPIPSecDriver instance could be scheduled to be destroyed after the
L2TPIPSecDriver instance is destroyed (see CL:443884). To avoid leaking
any ExternalTask instance when the L2TPIPSecDriverTest finishes, we
modify L2TPIPSecDriverTest to ensure any pending destruction task of
ExternalTask is executed when tearing down L2TPIPSecDriverTest.

BUG= chromium:656410 
TEST=`USE='asan clang' emerge-chell shill`

Change-Id: I313442c7c01d47d5ab6ff65cd17eba774171aa30
Reviewed-on: https://chromium-review.googlesource.com/448159
Commit-Ready: Ben Chan <benchan@chromium.org>
Tested-by: Ben Chan <benchan@chromium.org>
Reviewed-by: Ben Chan <benchan@chromium.org>

[modify] https://crrev.com/e16449c881131df81196deb9dfbe385bed7b5833/vpn/l2tp_ipsec_driver_unittest.cc

Status: Fixed (was: Started)
Reopen if not really fixed. But crash stats say yes, as does comment #6.

Sign in to add a comment