minidump contains incorrect stack with symbol_level=1 |
|||||
Issue descriptionfrom https://codereview.chromium.org/2782603002#msg7 and https://codereview.chromium.org/2782603002#msg12 it looks like when navigating to chrome://crash on a win7 instance, and looking at the resulting minidump using cdb (from depot tools), the stack is incorrect.
,
Mar 30 2017
Uggh. The latest Windows Performance Toolkit doesn't even *run* on Windows 7, but having dbgeng.dll run on Windows 7 but give incorrect results seems particularly horrid. I guess we could package the toolchain with two versions and then get_toolchain_if_necessary.py could grab the appropriate version depending on the OS. Layers of hacks...
,
Mar 30 2017
Yeah, it's pretty awful. A while back I had to check in the "last dbghelp that worked on XP" to use on bots. :( We need to keep the current most-recent windbg/dbgeng for fastlink, right? If we can find a version that works for 7 and mostly works otherwise (i.e. on Win10 too, but doesn't support /fastlink) so that it's good enough for bots, then I'd lean towards just packaging up both of them with the win7-safe one in a different subdirectory. Then we can use that for swarming/isolates that are sent to bots, vs. used by developers interactively. But, I don't think we should have dynamic behaviour in toolchain downloading, as that is already a bit too fiddly. I'll see if I can identify an SDK version that has a Win7 friendly dbgeng.
,
Mar 30 2017
I had the Win7 SDK installed so I tried that one ("6.12.0002.633 AMD64" whatever that is).
It takes forever (unusually long hang, even for windbg) and then fails to extract any data from content.dll.pdb for a test crash.
So something between there and newest Win10. I guess. :/ Log below.
Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [c:\users\scott\Desktop\9db361a0-b406-470a-be94-9b6362dd1e03.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available
WARNING: Minidump contains unknown stream type 0x15
WARNING: Minidump contains unknown stream type 0x16
Symbol search path is: ..;srv*C:\symbolcache*http://msdl.microsoft.com/download/symbols;srv*C:\symbolcache*https://chromium-browser-symsrv.commondatastorage.googleapis.com
Executable search path is:
Windows 7 Version 7601 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Thu Mar 30 10:16:23.000 2017 (UTC - 7:00)
System Uptime: not available
Process Uptime: 0 days 0:00:03.000
................................................................
................................................................
.........................................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(f68.d5c): Access violation - code c0000005 (first/second chance not available)
eax=00000000 ebx=0018ba94 ecx=b2e9615b edx=dddddf29 esi=00000002 edi=00000000
eip=77a8014d esp=0018ba44 ebp=0018bae0 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
ntdll!NtWaitForMultipleObjects+0x15:
77a8014d 83c404 add esp,4
0:000> !sym noisy
noisy mode - symbol prompts on
0:000> .reload
..
DBGHELP: c:\symbolcache\ntdll.dll\4CE7BA58180000\ntdll.dll - OK
DBGENG: c:\symbolcache\ntdll.dll\4CE7BA58180000\ntdll.dll - Mapped image memory
DBGHELP: ..\wntdll.pdb - file not found
DBGHELP: ..\dll\wntdll.pdb - file not found
DBGHELP: ..\symbols\dll\wntdll.pdb - file not found
DBGHELP: ntdll - public symbols
c:\symbolcache\wntdll.pdb\DCCFF2D483FA4DEE81DC04552C73BB5E2\wntdll.pdb
..............................................................
................................................................
.........................................................
0:000> .ecxr
eax=00000000 ebx=00000001 ecx=b2e9615b edx=dddddf29 esi=012a1717 edi=00000000
eip=129e040c esp=0018c234 ebp=0018c238 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
DBGHELP: ..\content.dll - OK
DBGENG: ..\content.dll - Mapped image memory
*** WARNING: Unable to verify checksum for content.dll
DBGHELP: content - public symbols & lines
..\content.dll.pdb
content+0x29e040c:
129e040c c70000000000 mov dword ptr [eax],0 ds:002b:00000000=????????
0:000> k5
ChildEBP RetAddr
0018c238 129e68e3 content+0x29e040c
0018c814 129f2ac7 content+0x29e68e3
0018ca74 129e6f9c content+0x29f2ac7
0018cffc 129ef27e content+0x29e6f9c
0018d100 129bcc40 content+0x29ef27e
,
Mar 30 2017
Oh wait, shoot. I have is_win_fastlink=true set. I'll try again.
,
Mar 30 2017
Good news (maybe?), it works fine on Win7 as long as is_win_fastlink isn't set (i.e. with the packaged Win10 SDK cdb, built on Win10, but running on Win7 VM). But... maybe it's bad news because I don't think that should be getting set on the try run that failed per https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin_chromium_rel_ng%2F410592%2F%2B%2Frecipes%2Fsteps%2Fgenerate_build_files__with_patch_%2F0%2Fstdout .
,
Mar 30 2017
Maybe I'm confusing myself by using a local debug build. I'll try to mimic the exact settings where that bots failing instead.
,
Mar 30 2017
OK, so I can reproduce in a release build mimicking those settings. However, none of the cdb.exes from the Win7 SDK, the Win8.1 SDK, nor the Win10 SDK seem to work, they all print the wrong stack when run on Win7. Given that, I copied the dump from Win7 to my Win10 machine, and lo! it's "broken" there too. So now I think the fault actually lies in content_shell_crash_service (which is still breakpad, and so is using MiniDumpWriteDump() and so depends on the version of dbghelp.dll. i.e. it's at capture time, not at dump time where the problem is.
,
Mar 30 2017
Hum, I had hoped that when we started to user crashpad in chrome that contrnt_shell would be moved to the new system as well Apparently that hasn't happened :/
,
Mar 30 2017
No, it hasn't happened yet. I think this is independent of that the more I dig into it, but we definitely need to switch content_shell to using Crashpad and get rid of the last remnants of Breakpad on the client side.
,
Mar 30 2017
OK, I now think this is worse than it originally appeared, and might be affecting symbol_level=1 in general (nothing to do with Win7). I can repro on Win10 with the following steps: rmdir /s/q out\testy gn gen out\testy --args="dcheck_always_on=true ffmpeg_branding=\"Chrome\" is_component_build=false is_debug=false proprietary_codecs=true symbol_level=1 target_cpu=\"x86\"" ninja -C out\testy content_shell content_shell_crash_service copy_cdb_to_output python content\shell\tools\breakpad_integration_test.py --build-dir=out\testy --verbose --binary out\testy\content_shell.exe I did have successful runs before on Win10, but I think I confused myself by using accidentally using symbol_level=2, where maybe things are fine. I also tried simply running that content_shell.exe as windbg -o content_shell.exe and manually navigating to chrome://crash, and the stack is wrong there too.
,
Mar 30 2017
Bug 466890 : content_shell’s still on Breakpad. It was annoying to switch while we had GYP and GN builds going in parallel. That problem is behind us now, so it should be “easy.”
,
Mar 30 2017
Following on from #c11, changing (only) symbol_level=1 to symbol_level=2, the test works fine on both Win10 and Win7. So, yeah. (╯°□°)╯︵ ┻━┻
,
Mar 30 2017
would crashpad work here with symbol_level = 1 as well? Alternatively, I guess we should run the bots with symbol_level = 2. fwiw, gdb also can't use linux binaries that are compiled with symbol_level = 1. yay.
,
Mar 30 2017
Is the stack walking failing or the symbol lookup? I would think that stack walking on x86 would mostly be a matter of following the EBP chain, but the stack walking code does like to be "clever" and will sometimes outsmart itself. There is some metadata to help with stack walking that is stored in the PDB but one would hope that that would always be stored, instead of relying on symbol_level = 2. But, I guess if the metadata is generated by the compiler then it is conceivable that it might be omitted entirely and then not be available at link time in the symbol_level = 1 case. I think that symbol_level = 1 is probably not a well tested scenario at Microsoft. > I notice that despite saying ".lines" to cdb you're not getting any line number information. That is expected for symbol_level = 1 Anyway, just wanted to clarify whether it was symbol lookup that is failing or stack walking. They are completely separate steps. I doubt that crashpad versus breakpad is the problem. All that these tools need to do (for the purposes of stack walking) is record the contents of the stack (just a memcpy), module information, and registers. These are all-or-nothing operations - either you get the contents of the stack or not, and either you get the module information (and therefore PDBs load) or you don't. The stack walking is all done at cdb/windbg/dbghelp/VS time, by looking at the stack contents (and registers, and some metadata) to generate a call stack, and looking at PDBs to generate symbols.
,
Mar 30 2017
> > I notice that despite saying ".lines" to cdb you're not getting any line number information. > That is expected for symbol_level = 1 Oh, right. > Anyway, just wanted to clarify whether it was symbol lookup that is failing or stack walking The very top of the stack is wrong, so I don't think it has anything to do with stack walking. Just looking up eip in this case ought to be in CrashIntentionally(), but it's being reported as at content_shell!mojo::internal::InterfacePtrState<media::mojom::RemoterFactory>::ConfigureProxyIfNecessary+0x2ad. Yup, I agree Breakpad vs. Crashpad doesn't matter here. Between #c8 and #c11 I realized Win7 wasn't relevant, so dbghelp doesn't matter either. And I think both should be able to reliably grab EIP. :) I thought it might be related to inlining, but CrashIntentionally() is specifically marked NOINLINE. The top two entries (the incorrect ones) have quite large offsets from their entry point: 2:030> k3 # ChildEBP RetAddr 00 00b3dd44 034406d4 content_shell!mojo::internal::InterfacePtrState<media::mojom::RemoterFactory>::ConfigureProxyIfNecessary+0x2ad 01 00b3de10 034468ed content_shell!content::RenderFrameImpl::MaybeEnableMojoBindings+0x2d3 02 00b3defc 03440ad0 content_shell!content::RenderFrameImpl::PrepareRenderViewForNavigation+0x60 so I wonder if might have a problem similar to the one we had in dump_syms. Perhaps there's only a symbol emitted for the entry point to the function, but not for any secondary blocks of the function. Then if they get interleaved, the function is attributed. On the other hand `x *!*CrashIntentionally*` doesn't report anything, so maybe it's just ... busted, and dropping some symbols.
,
Mar 30 2017
Er, > Then if they get interleaved, the function is attributed. *mis-*attributed. Anyway, back to the problem at hand, I don't really know what to suggest for this bot now. symbol_level=2 is a non-starter because: => We can't use fastlink when it's being shuttled to swarming, because: => We can't use goma + symbol_level=2 without fastlink, because: => We can't have a win CQ/trybot that's not using goma because it'll collapse under load There might be some combination of noinline/extern/disable optimize that makes it sorta-work. I guess I'll play with that for a bit to see if I can find a workaround. Or we could split it out to a separate bot/test that builds less, but with symbol_level=2, but that seems like quite a pain, since we really want to test content_shell. Or, time to ship clang&lld I guess. :/
,
Mar 30 2017
yeah, the test is supposed to exercise how layout tests symbolize crashes, so ideally this would just work on all the bots :/
,
Mar 30 2017
Any thoughts on what has triggered this? It seems like it should have been happening all the time. Is it possible that this is an incremental linking bug, or triggered by some other linker setting? /opt:ref? /opt:icf? There really aren't that many that seem plausible. The most recent incremental linking bug was apparently associated with cfg (control-flow-guard), FWIW.
,
Mar 30 2017
Aha! It's that the function's in an anonymous namespace. Moving just CrashIntentionally() out to content (rather than content::`anon`), makes CrashIntentionally() correct, but MaybeHandleDebugURL() still incorrect. Moving both out, makes the stack as expected. That almost makes some kind of sense (i.e. I could see why they'd drop anon namespace functions from the symbols in this mode). So moving either just CrashIntentionally() out, or possibly MaybeHandleDebugURL() and its dependents, along with a comment about symbol_level=1 and this bug seems like a relatively happy resolution.
,
Mar 30 2017
Not sure about what caused it. I just tried turning off /OPT:REF, /OPT:ICF, and /guard:cf but didn't to make any difference. It's plausible we just weren't exercising any crash tests on symbol_level=1 bots before Jochen tried to land this CL. :(
,
Mar 30 2017
Okay, yeah, that sort of makes sense. Static functions and anonymous namespace functions are invisible for linking purposes, and symbol_level == 1 says to not generate symbols at compile time, which means that all the linker has to work with is the names that are exported by the compiler, which is only the ones needed for linking purposes. That explains why it only happens here rather than everywhere, although it's still surprising that we haven't hit this before. This should probably be documented somewhere - maybe in the .gn file that sets symbol_level = 1.
,
Mar 30 2017
I'm going to close this and blithely pretend all is well with the world and I didn't get go down a bunch of ratholes to get to this point. :)
,
Mar 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2511b73c11a55d8bb8a4c730d0cb308fbc821780 commit 2511b73c11a55d8bb8a4c730d0cb308fbc821780 Author: jochen <jochen@chromium.org> Date: Fri Mar 31 18:24:15 2017 Enable content shell crash integration test on Windows BUG=688737, 706744 R=dpranke@chromium.org Review-Url: https://codereview.chromium.org/2782603002 Cr-Commit-Position: refs/heads/master@{#461175} [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/build/vs_toolchain.py [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/build/win/BUILD.gn [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/content/renderer/render_frame_impl.cc [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/content/shell/BUILD.gn [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/content/shell/tools/breakpad_integration_test.py [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/testing/buildbot/chromium.win.json [modify] https://crrev.com/2511b73c11a55d8bb8a4c730d0cb308fbc821780/tools/perf/chrome_telemetry_build/BUILD.gn
,
Mar 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/de6de9d3f907de08bce627cb096586c65eabc89a commit de6de9d3f907de08bce627cb096586c65eabc89a Author: brucedawson <brucedawson@chromium.org> Date: Fri Mar 31 20:49:58 2017 Clarify the implications of symbol_level = 1 It is obvious in hindsight that symbol_level = 1 means that functions with internal linkage won't show up in back traces. R=scottmg@chromium.org BUG= 706744 Review-Url: https://codereview.chromium.org/2785263003 Cr-Commit-Position: refs/heads/master@{#461218} [modify] https://crrev.com/de6de9d3f907de08bce627cb096586c65eabc89a/build/config/compiler/compiler.gni |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by scottmg@chromium.org
, Mar 30 2017