New issue
Advanced search Search tips

Issue 706744 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows , Mac
Pri: 3
Type: Bug



Sign in to add a comment

minidump contains incorrect stack with symbol_level=1

Project Member Reported by jochen@chromium.org, Mar 30 2017

Issue description

from https://codereview.chromium.org/2782603002#msg7 and https://codereview.chromium.org/2782603002#msg12

it looks like when navigating to chrome://crash on a win7 instance, and looking at the resulting minidump using cdb (from depot tools), the stack is incorrect.
 
Cc: brucedaw...@chromium.org
fyi +Bruce. We tracked down a bad stackwalk dump to using the latest cdb+dbgeng.dll from the Win 10 SDK on a Win7 VM. It just goes completely off the rails on Win7, but works fine on 10. I didn't file a bug against Microsoft yet.
Uggh.

The latest Windows Performance Toolkit doesn't even *run* on Windows 7, but having dbgeng.dll run on Windows 7 but give incorrect results seems particularly horrid.

I guess we could package the toolchain with two versions and then get_toolchain_if_necessary.py could grab the appropriate version depending on the OS. Layers of hacks...

Yeah, it's pretty awful. 

A while back I had to check in the "last dbghelp that worked on XP" to use on bots. :(

We need to keep the current most-recent windbg/dbgeng for fastlink, right? If we can find a version that works for 7 and mostly works otherwise (i.e. on Win10 too, but doesn't support /fastlink) so that it's good enough for bots, then I'd lean towards just packaging up both of them with the win7-safe one in a different subdirectory. Then we can use that for swarming/isolates that are sent to bots, vs. used by developers interactively. But, I don't think we should have dynamic behaviour in toolchain downloading, as that is already a bit too fiddly.

I'll see if I can identify an SDK version that has a Win7 friendly dbgeng.

I had the Win7 SDK installed so I tried that one ("6.12.0002.633 AMD64" whatever that is).

It takes forever (unusually long hang, even for windbg) and then fails to extract any data from content.dll.pdb for a test crash.

So something between there and newest Win10. I guess. :/ Log below.

Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [c:\users\scott\Desktop\9db361a0-b406-470a-be94-9b6362dd1e03.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available

WARNING: Minidump contains unknown stream type 0x15
WARNING: Minidump contains unknown stream type 0x16
Symbol search path is: ..;srv*C:\symbolcache*http://msdl.microsoft.com/download/symbols;srv*C:\symbolcache*https://chromium-browser-symsrv.commondatastorage.googleapis.com
Executable search path is: 
Windows 7 Version 7601 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Thu Mar 30 10:16:23.000 2017 (UTC - 7:00)
System Uptime: not available
Process Uptime: 0 days 0:00:03.000
................................................................
................................................................
.........................................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(f68.d5c): Access violation - code c0000005 (first/second chance not available)
eax=00000000 ebx=0018ba94 ecx=b2e9615b edx=dddddf29 esi=00000002 edi=00000000
eip=77a8014d esp=0018ba44 ebp=0018bae0 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
ntdll!NtWaitForMultipleObjects+0x15:
77a8014d 83c404          add     esp,4
0:000> !sym noisy
noisy mode - symbol prompts on
0:000> .reload
..
DBGHELP: c:\symbolcache\ntdll.dll\4CE7BA58180000\ntdll.dll - OK
DBGENG:  c:\symbolcache\ntdll.dll\4CE7BA58180000\ntdll.dll - Mapped image memory
DBGHELP: ..\wntdll.pdb - file not found
DBGHELP: ..\dll\wntdll.pdb - file not found
DBGHELP: ..\symbols\dll\wntdll.pdb - file not found
DBGHELP: ntdll - public symbols  
         c:\symbolcache\wntdll.pdb\DCCFF2D483FA4DEE81DC04552C73BB5E2\wntdll.pdb
..............................................................
................................................................
.........................................................
0:000> .ecxr
eax=00000000 ebx=00000001 ecx=b2e9615b edx=dddddf29 esi=012a1717 edi=00000000
eip=129e040c esp=0018c234 ebp=0018c238 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
DBGHELP: ..\content.dll - OK
DBGENG:  ..\content.dll - Mapped image memory
*** WARNING: Unable to verify checksum for content.dll
DBGHELP: content - public symbols & lines 
         ..\content.dll.pdb
content+0x29e040c:
129e040c c70000000000    mov     dword ptr [eax],0    ds:002b:00000000=????????
0:000> k5
ChildEBP RetAddr  
0018c238 129e68e3 content+0x29e040c
0018c814 129f2ac7 content+0x29e68e3
0018ca74 129e6f9c content+0x29f2ac7
0018cffc 129ef27e content+0x29e6f9c
0018d100 129bcc40 content+0x29ef27e

Oh wait, shoot. I have is_win_fastlink=true set. I'll try again.
Good news (maybe?), it works fine on Win7 as long as is_win_fastlink isn't set (i.e. with the packaged Win10 SDK cdb, built on Win10, but running on Win7 VM).

But... maybe it's bad news because I don't think that should be getting set on the try run that failed per https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin_chromium_rel_ng%2F410592%2F%2B%2Frecipes%2Fsteps%2Fgenerate_build_files__with_patch_%2F0%2Fstdout .
Maybe I'm confusing myself by using a local debug build. I'll try to mimic the exact settings where that bots failing instead.
OK, so I can reproduce in a release build mimicking those settings.

However, none of the cdb.exes from the Win7 SDK, the Win8.1 SDK, nor the Win10 SDK seem to work, they all print the wrong stack when run on Win7.

Given that, I copied the dump from Win7 to my Win10 machine, and lo! it's "broken" there too.

So now I think the fault actually lies in content_shell_crash_service (which is still breakpad, and so is using MiniDumpWriteDump() and so depends on the version of dbghelp.dll. i.e. it's at capture time, not at dump time where the problem is.

Comment 9 by jochen@chromium.org, Mar 30 2017

Labels: OS-Mac
Hum, I had hoped that when we started to user crashpad in chrome that contrnt_shell would be moved to the new system as well

Apparently that hasn't happened :/
Cc: mark@chromium.org
No, it hasn't happened yet. I think this is independent of that the more I dig into it, but we definitely need to switch content_shell to using Crashpad and get rid of the last remnants of Breakpad on the client side.
Summary: minidump contains incorrect stack with symbol_level=1 (was: minidump contains incorrect stack on win7)
OK, I now think this is worse than it originally appeared, and might be affecting symbol_level=1 in general (nothing to do with Win7). I can repro on Win10 with the following steps:

rmdir /s/q out\testy

gn gen out\testy --args="dcheck_always_on=true ffmpeg_branding=\"Chrome\" is_component_build=false is_debug=false proprietary_codecs=true symbol_level=1 target_cpu=\"x86\""

ninja -C out\testy content_shell content_shell_crash_service copy_cdb_to_output

python content\shell\tools\breakpad_integration_test.py --build-dir=out\testy --verbose --binary out\testy\content_shell.exe

I did have successful runs before on Win10, but I think I confused myself by using accidentally using symbol_level=2, where maybe things are fine.

I also tried simply running that content_shell.exe as windbg -o content_shell.exe and manually navigating to chrome://crash, and the stack is wrong there too.

Comment 12 by mark@chromium.org, Mar 30 2017

 Bug 466890 : content_shell’s still on Breakpad. It was annoying to switch while we had GYP and GN builds going in parallel. That problem is behind us now, so it should be “easy.”
Following on from #c11, changing (only) symbol_level=1 to symbol_level=2, the test works fine on both Win10 and Win7. So, yeah. (╯°□°)╯︵ ┻━┻
would crashpad work here with symbol_level = 1 as well? Alternatively, I guess we should run the bots with symbol_level = 2.

fwiw, gdb also can't use linux binaries that are compiled with symbol_level = 1. yay.
Is the stack walking failing or the symbol lookup? I would think that stack walking on x86 would mostly be a matter of following the EBP chain, but the stack walking code does like to be "clever" and will sometimes outsmart itself.

There is some metadata to help with stack walking that is stored in the PDB but one would hope that that would always be stored, instead of relying on symbol_level = 2. But, I guess if the metadata is generated by the compiler then it is conceivable that it might be omitted entirely and then not be available at link time in the symbol_level = 1 case. I think that symbol_level = 1 is probably not a well tested scenario at Microsoft.

> I notice that despite saying ".lines" to cdb you're not getting any line number information.

That is expected for symbol_level = 1

Anyway, just wanted to clarify whether it was symbol lookup that is failing or stack walking. They are completely separate steps.

I doubt that crashpad versus breakpad is the problem. All that these tools need to do (for the purposes of stack walking) is record the contents of the stack (just a memcpy), module information, and registers. These are all-or-nothing operations - either you get the contents of the stack or not, and either you get the module information (and therefore PDBs load) or you don't. The stack walking is all done at cdb/windbg/dbghelp/VS time, by looking at the stack contents (and registers, and some metadata) to generate a call stack, and looking at PDBs to generate symbols.

> > I notice that despite saying ".lines" to cdb you're not getting any line number information.
> That is expected for symbol_level = 1

Oh, right.

> Anyway, just wanted to clarify whether it was symbol lookup that is failing or stack walking

The very top of the stack is wrong, so I don't think it has anything to do with stack walking. Just looking up eip in this case ought to be in CrashIntentionally(), but it's being reported as at content_shell!mojo::internal::InterfacePtrState<media::mojom::RemoterFactory>::ConfigureProxyIfNecessary+0x2ad.

Yup, I agree Breakpad vs. Crashpad doesn't matter here. Between #c8 and #c11 I realized Win7 wasn't relevant, so dbghelp doesn't matter either. And I think both should be able to reliably grab EIP. :)

I thought it might be related to inlining, but CrashIntentionally() is specifically marked NOINLINE.

The top two entries (the incorrect ones) have quite large offsets from their entry point:

2:030> k3
 # ChildEBP RetAddr  
00 00b3dd44 034406d4 content_shell!mojo::internal::InterfacePtrState<media::mojom::RemoterFactory>::ConfigureProxyIfNecessary+0x2ad
01 00b3de10 034468ed content_shell!content::RenderFrameImpl::MaybeEnableMojoBindings+0x2d3
02 00b3defc 03440ad0 content_shell!content::RenderFrameImpl::PrepareRenderViewForNavigation+0x60

so I wonder if might have a problem similar to the one we had in dump_syms. Perhaps there's only a symbol emitted for the entry point to the function, but not for any secondary blocks of the function. Then if they get interleaved, the function is attributed.

On the other hand `x *!*CrashIntentionally*` doesn't report anything, so maybe it's just ... busted, and dropping some symbols.
Er,

> Then if they get interleaved, the function is attributed.

*mis-*attributed.


Anyway, back to the problem at hand, I don't really know what to suggest for this bot now.

symbol_level=2 is a non-starter because:
=> We can't use fastlink when it's being shuttled to swarming, because:
=> We can't use goma + symbol_level=2 without fastlink, because:
=> We can't have a win CQ/trybot that's not using goma because it'll collapse under load

There might be some combination of noinline/extern/disable optimize that makes it sorta-work. I guess I'll play with that for a bit to see if I can find a workaround.

Or we could split it out to a separate bot/test that builds less, but with symbol_level=2, but that seems like quite a pain, since we really want to test content_shell.

Or, time to ship clang&lld I guess. :/
yeah, the test is supposed to exercise how layout tests symbolize crashes, so ideally this would just work on all the bots :/
Any thoughts on what has triggered this? It seems like it should have been happening all the time.

Is it possible that this is an incremental linking bug, or triggered by some other linker setting? /opt:ref? /opt:icf? There really aren't that many that seem plausible.

The most recent incremental linking bug was apparently associated with cfg (control-flow-guard), FWIW.
Aha!

It's that the function's in an anonymous namespace. Moving just CrashIntentionally() out to content (rather than content::`anon`), makes CrashIntentionally() correct, but MaybeHandleDebugURL() still incorrect.

Moving both out, makes the stack as expected.

That almost makes some kind of sense (i.e. I could see why they'd drop anon namespace functions from the symbols in this mode). So moving either just CrashIntentionally() out, or possibly MaybeHandleDebugURL() and its dependents, along with a comment about symbol_level=1 and this bug seems like a relatively happy resolution.
Not sure about what caused it. I just tried turning off /OPT:REF, /OPT:ICF, and /guard:cf but didn't to make any difference.

It's plausible we just weren't exercising any crash tests on symbol_level=1 bots before Jochen tried to land this CL. :(
Okay, yeah, that sort of makes sense. Static functions and anonymous namespace functions are invisible for linking purposes, and symbol_level == 1 says to not generate symbols at compile time, which means that all the linker has to work with is the names that are exported by the compiler, which is only the ones needed for linking purposes.

That explains why it only happens here rather than everywhere, although it's still surprising that we haven't hit this before. This should probably be documented somewhere - maybe in the .gn file that sets symbol_level = 1.

Cc: -scottmg@chromium.org
Owner: scottmg@chromium.org
Status: WontFix (was: Unconfirmed)
I'm going to close this and blithely pretend all is well with the world and I didn't get go down a bunch of ratholes to get to this point. :)
Project Member

Comment 25 by bugdroid1@chromium.org, Mar 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/de6de9d3f907de08bce627cb096586c65eabc89a

commit de6de9d3f907de08bce627cb096586c65eabc89a
Author: brucedawson <brucedawson@chromium.org>
Date: Fri Mar 31 20:49:58 2017

Clarify the implications of symbol_level = 1

It is obvious in hindsight that symbol_level = 1 means that functions
with internal linkage won't show up in back traces.

R=scottmg@chromium.org
BUG= 706744 

Review-Url: https://codereview.chromium.org/2785263003
Cr-Commit-Position: refs/heads/master@{#461218}

[modify] https://crrev.com/de6de9d3f907de08bce627cb096586c65eabc89a/build/config/compiler/compiler.gni

Sign in to add a comment