Set Next Statement not working correctly with clang
Project Member Reported by email@example.com, Dec 11
I've recently noticed that "Set Next Statement" context menu in Visual Studio isn't working well with clang. Often after I move it there'll be a crash in an implicit constructor call. As a simple repro, run debug browser_tests.exe --single_process --gtest_filter=CaptivePortalBrowserTest.Login and then in this code: "SetUpCaptivePortalService(browser()->profile(), GURL(kMockCaptivePortalTestUrl));" add a breakpoint. When it gets hit, you can move the statement somewhere else (like repeat the EXPECT_EQ call) and then move it again to the SetUpCaptivePortalService call. It'll crash in the std::string constructor. This is interfering with my debugging, so marking as P1.
TL;DR - if clang adjusts what instruction its line records point to be two instructions then this will work. The relevant source code is here: // Double-check that the captive portal service isn't enabled by default for // browser tests. EXPECT_EQ(CaptivePortalService::DISABLED_FOR_TESTING, CaptivePortalService::get_state_for_testing()); CaptivePortalService::set_state_for_testing( CaptivePortalService::SKIP_OS_CHECK_FOR_TESTING); EnableCaptivePortalDetection(browser()->profile(), true); // Set the captive portal service to use URLRequestMockCaptivePortalJob's // mock URL, by default. SetUpCaptivePortalService(browser()->profile(), GURL(kMockCaptivePortalTestUrl)); The machine code around the SetUpCaptivePortalService call is here: 0285668C E8 3F 02 00 00 call CaptivePortalBrowserTest::EnableCaptivePortalDetection (028568D0h) 02856691 8D 4E 5C lea ecx,[esi+5Ch] 02856694 8D 05 5B FC 5C 00 lea eax,[string "http://mock.captive.portal.test/"... (05CFC5Bh)] 0285669A 83 EC 04 sub esp,4 0285669D 89 04 24 mov dword ptr [esp],eax 028566A0 E8 7B 72 61 FE call base::BasicStringPiece<...>::BasicStringPiece (0E6D920h) 028566A5 8D 8E 94 00 00 00 lea ecx,[esi+94h] 028566AB 8B 56 5C mov edx,dword ptr [esi+5Ch] 028566AE 8B 7E 60 mov edi,dword ptr [esi+60h] 028566B1 83 EC 08 sub esp,8 028566B4 89 14 24 mov dword ptr [esp],edx 028566B7 89 7C 24 04 mov dword ptr [esp+4],edi 028566BB 89 46 10 mov dword ptr [esi+10h],eax 028566BE FF 15 A8 28 DF 0A call dword ptr [__imp_GURL::GURL (0ADF28A8h)] 028566C4 8B 4E 40 mov ecx,dword ptr [esi+40h] 028566C7 89 46 0C mov dword ptr [esi+0Ch],eax 028566CA E8 91 73 61 FE call InProcessBrowserTest::browser (0E6DA60h) 028566CF 89 C1 mov ecx,eax 028566D1 E8 AA 73 61 FE call Browser::profile (0E6DA80h) 028566D6 8D 8E 94 00 00 00 lea ecx,[esi+94h] 028566DC 83 EC 08 sub esp,8 028566DF 8B 56 40 mov edx,dword ptr [esi+40h] 028566E2 89 4E 08 mov dword ptr [esi+8],ecx 028566E5 89 D1 mov ecx,edx 028566E7 89 04 24 mov dword ptr [esp],eax 028566EA 8B 46 08 mov eax,dword ptr [esi+8] 028566ED 89 44 24 04 mov dword ptr [esp+4],eax 028566F1 E8 6A 02 00 00 call CaptivePortalBrowserTest::SetUpCaptivePortalService (02856960h) Crucially, if you set a breakpoint on the call to SetUpCaptivePortalService or if you Set Next Statement to that line of code then you get dropped onto 0285669A. That means that the stack gets adjusted properly (which is good) but it means that ECX and EAX are not set up. Those are set up starting on 02856691 and if you Set Next Statement from EXPECT_EQ to SetUpCaptivePortalService then they have the wrong values. They are supposed to be: EAX = 005CFC5B ECX = 0019D89C but they are actually: EAX = 00000000 ECX = 0019D8B4 EAX and ECX are supposed to hold the address of the string used for the constructor's input and the address of the object. These seem like important things. The bad value of EAX means that the string will be treated as a NULL string instead of "http://mock.captive.portal.test/" and the incorrect value for ECX turns into a stack tromp - stack corruption - which can only end in tears. It looks like the problem would be avoided if the debug information for that line was offset by two instructions. Debugging techniques like this make me nervous but it is awesome if they work in VS and we should try to support them as well as VS does. My gn args are below. The top four are probably important (fastlink was causing me problems so I used lld) and the bottom six are less so, but will help with making the builds fast. use_lld = true is_debug = true is_component_build = true target_cpu = "x86" enable_nacl = false remove_webcore_debug_symbols=true use_jumbo_build = true goma_dir = "C:\src\goma\goma-win64" use_goma = true symbol_level = 2
You should never do this in real life, but it’s actually possible to hex edit your PDB with the correct value to confirm this fixes it. I don’t actually expect you to do this, but I’m just putting it here in case you’re like me and like doing things you’re not supposed to do. The trick is to use llvm-pdbutil with the bytes subcommand which has some options for dumping line tables (among other things) but showing you the offset in the pdb where those bytes are. Since you know the cpp file and the address of the current instruction it breaks on, you just have to filter by that object file (use `llvm-pdbutil dump -modules | grep foo.obj` to find the index) and then search for the address in the binary dump. That said, this is probably already enough info to investigate. Thanks!
Ping: any update on an eta for a fix? This is interfering with my debugging. Thanks.
rnk is out until next Tue. zturner, do you know if anyone in lexan land looked at this or is in a position to look this week? Or should we wait for rnk's return?
Nobody has looked at this yet. I can try to look, but I’m also trying to figure out incremental pdb linking, which is also pretty important (and probably something nobody else can do). inglorion@ could also maybe look at this
I filed https://bugs.llvm.org/show_bug.cgi?id=35975 as an llvm-side tracking bug.
I started work on the LLVM bug and the details are at https://bugs.llvm.org/show_bug.cgi?id=35975. The high-level summary is that we may be able to fix this in most cases, but I suspect there will always be corner cases where set next statement fails because LLVM -O0 codegen doesn't have any protections to prevent it from using registers across statement boundaries. Fixing the source location for this particular code pattern is doable, but it will take a bit of effort and it's not clear if it is top priority. I aim to come back to this next week.
Hello, any update? I'm hitting this often, thank you.
No, last week got eaten by a sinus infection and two ski trips on each end. This week there's a CFI summit that I need to be at. I'll try to shop this around the team.
This has been open for 2 months and IMO is a major regression from switch to clang. I hope this can be fixed asap.
I'm not sure "set next statement" is something clang can reliably support. There is nothing ensuring that registers are not allocated across statement boundaries at -O0. Even fixing this particular issue basically requires rewriting a large portion of the -O0 fastisel code generation strategy. The "local value area" concept is pretty deeply embedded in fastisel. I prototyped something yesterday that passes most tests, but I discovered some design flaws in it that might sink it. Hard to say yet.
I'm not familiar with the clang internals, but given that it's functionality that we had and is very useful when debugging, I would like us to do whatever it takes to undo the the regression.
I came up with a patch last Thursday that improves the situation: https://reviews.llvm.org/D43093 However, I expect it to take longer than normal to get into Chromium. The patch is small, but it makes invasive changes to fastisel. I had to do a lot of iteration to get it to pass the test suite, and I suspect more issues will arise when we start testing it on larger codebases. I also haven't measured if it has any compile time impact yet. The purpose of fastisel is to be fast, so if there are significant regressions, I think I'll have to try another approach.
The clang roll landed, so this should work now. Please reopen if it doesn't.
Sign in to add a comment