New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 702388 link

Starred by 3 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

minidumps don't seem to handle stack overflows well

Project Member Reported by diand...@chromium.org, Mar 16 2017

Issue description

Looking at crash: 8a438fed80000000

It appears that we got a stack overflow.

--

Since we have the "stack guards" fix (see bug #665083), the memory ranges in /proc/maps are now correct.  AKA from the /proc/maps from the minidump:

a92e1000-a92e2000 r-xp 00000000 00:00 0          [sigpage]
a92e2000-a92e3000 r--p 00020000 fe:00 16054      /lib/ld-2.23.so
a92e3000-a92e4000 rw-p 00021000 fe:00 16054      /lib/ld-2.23.so
be61c000-bee1c000 rw-p 00000000 00:00 0 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

--

...but the minidump still doesn't actually contain data from the stack.  

Running "info reg" shows:

sp             0xbe61bd28       0xbe61bd28


And trying to look at memory in gdb shows:

(gdb) x /32x $sp
0xbe61bd28:     0x00000000      0x00000000      0x00000000      0x00000000
0xbe61bd38:     0x00000000      0x00000000      0x00000000      0x00000000
0xbe61bd48:     0x00000000      0x00000000      0x00000000      0x00000000
0xbe61bd58:     0x00000000      0x00000000      0x00000000      0x00000000
...
...
0xbe61bfd8:     0x00000000      0x00000000      0x00000000      0x00000000
0xbe61bfe8:     0x00000000      0x00000000      0x00000000      0x00000000
0xbe61bff8:     0x00000000      0x00000000      Cannot access memory at address 0xbe61c000

--

It looks to me (not knowning anything about how minidumps are created) like minidump tried to dump memory near the stack pointer but then stopped when it got to a "boundary".  Any memory it couldn't read it shows as 0x0.


Assuming I'm guessing the minidump behavior correct, it is most notably unhelpful for debugging stack overflows.  Any time where we "allocate" space on the stack first and then write to it the SP will be "outside" the memory range allocated to the stack.  Trying to save off this memory will always hit a boundary once it gets to the real stack memory and stop dumping.

--

Maybe someone can point to the code that actually creates the minidump and we can confirm my guesses are correct?  Then we can see if we can think of a better way to handle stack overflows.

--

NOTE: we will still do the right thing if the straw the overflows the stack was a barf instruction, like:

  push    {r4, r5, r6, r7, r8, r9, r10, r11, lr}

If we get lucky and that overflows the stack we'll get a SIGSEGV but the stack will still be in the correct place.  In the case of 8a438fed80000000, though, we did the allocation and storage in two steps rather than a single barf instruction.
 

Comment 1 by laszio@chromium.org, Mar 16 2017

Cc: laszio@chromium.org

Comment 2 by ivanpe@chromium.org, Mar 16 2017

Cc: jperaza@chromium.org

Comment 3 by laszio@chromium.org, Mar 17 2017

Blocking: 665083
I believe this blocks crbug/665083, unless someone can reproduce the problem.
@3: Not convinced it blocks.  The "__divdi3" was crashing on a push and the actual sp would have been in the right range.  Assuming my theories (guesses) are right I think we'll be able to get a stack crawl.

...but until we have proof, doesn't hurt to leave the blocker.

Comment 5 by vapier@chromium.org, Mar 17 2017

Chrome itself takes care of creating the minidump (with help of breakpad libs) when chrome crashes.  maybe thestig@ will have pointers to where that logic lives in the chromium tree.

do you see this behavior when standalone code overflows it stack ?
When a chrome process is crashing, minidump is generated by breakpad/src/client/linux/minidump_writer/minidump_writer.cc with breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc.
FWIW if you inspect a minidump generated as a result of stack overflow with minidump_stackwalk, it actually says the minidump doesn't contain the stack contents of the crashing thread.

~/chrome/src % out/gn/minidump_stackwalk 0341f999-4734-299d-7e9b221d-0d09d3eb.dmp /tmp/sym
2017-03-17 19:35:51: minidump.cc:1425: ERROR: MinidumpThread has a memory region problem, 0x7ffe74e1ffd0+0x0, RVA 0x0x518
2017-03-17 19:35:51: minidump_processor.cc:255: ERROR: No memory region for 0341f999-4734-299d-7e9b221d-0d09d3eb.dmp:0/23 id 0x1b4ff
2017-03-17 19:35:51: stackwalker_amd64.cc:286: ERROR: Can't get caller frame without memory or stack

Comment 8 by laszio@chromium.org, Mar 17 2017

Blocking: -665083
@4: I was worried because the new crash has a huge frame and would probably dominate all the crashes. The compiler for chrome migrated from gcc and clang two weeks ago. I was thinking if it is the inlining difference caused the huge frame, which would probably lower the probability of crashes being debuggable.

Anyway, a debuggable crash, which is the same as what you found in R54 and R56, starts to appear in R57 after the kernel fix: 6f23039480000000 :)

Comment 9 by vapier@chromium.org, Mar 17 2017

i would guess the missing stack contents are due to the sp being outside any known maps, and due to the maps not having a [stack] label.  which means from userspace, the stack is indistinguishable from a heap mapping, and we explicitly don't want to dump heaps.

we could try and add some logic like "if no stack is found, and SP is within X KiB of a map, then assume that map is the stack".  but it seems like it'd be better to just go with doug's backports that fix the guard page & stack mappings, and then this issue doesn't come up anymore.
vapier@: Unless I messed up this is different because 8a438fed80000000 was supposed to be _after_ the fix.  Unless it's the standard "wrong kernel version" issue ( bug #590757 ).

...maybe it's just that, though...
...so is the issue here that sometimes the kernel is getting confused and not labeling the stack as "stack"?
i was def looking at reports in issue 665083 (e.g. 849a296480000000)

looking at http://crash/8a438fed80000000:
Chrome version = 59.0.3040.0
CrOS version = 9369.0.0 (Official Build) canary-channel veyron_minnie

thread[0]
MDRawThread
  thread_id                   = 0x418f
MDRawContextARM
  iregs[13]  sp        = 0xbe61bd28
No stack

Stream MD_LINUX_MAPS:
a92e3000-a92e4000 rw-p 00021000 fe:00 16054      /lib/ld-2.23.so
be61c000-bee1c000 rw-p 00000000 00:00 0 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

so in this crash, the stack is not labeled as [stack], and sp is not within the stack region.

was that build supposed to contain the kernel fixes ?  or are we seeing delayed upload crashes (e.g.  issue 590757 ) ?
Yeah, maybe it was just  issue 590757  ?  The reason I didn't think it was is that we got a full 8MB allocated to the stack (not 8MB - 4K), which seemed to indicate that we were on the new kernel.

Another one to look at is de6da26640000000, maybe?

Maybe the whole problem here is how many of the reports seem to be missing the [stack] annotation?  It seems like that's not related to the broken stack guard problem???

Sign in to add a comment