New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 805409 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

"content_shell_crash_test (with patch)" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Jan 24 2018

Issue description

"content_shell_crash_test (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiVjb250ZW50X3NoZWxsX2NyYXNoX3Rlc3QgKHdpdGggcGF0Y2gpDA.

The chromium-try-flakes app is able to file bugs for individual tests when the test launcher is uploading results to the Test Results Server. If recent flakes above are caused by failing tests and you would like to have them filed as invidual bugs, please read more at https://goo.gl/QJKXV4.

This flaky test/step was previously tracked in issue 798579.
 
Cc: rhalavati@chromium.org
As far as I checked:
 - For Linux failures are all because of not enough shard capacity and expiration.
 - For Windows7 cases, it's all the scenario that the test intentionally has crashed, but a reference to CrashIntentionally is not found in stack:
https://chromium-swarm.appspot.com/task?id=3b3f6d36097ac710&refresh=10&show_raw=1
Cc: a...@chromium.org
Avi, 

Do you have an idea why this is, or who has?
Status: Available (was: Untriaged)

Comment 4 by a...@chromium.org, Jan 24 2018

Cc: thakis@chromium.org
+ thakis as compiler peep

We're crashing intentionally, via a call to CrashIntentionally(), which is marked NOINLINE. We need to not inline is so that "CrashIntentionally" shows up in the stack walk. Yet it's not.

Has anything changed compiler-wise?

Comment 5 by thakis@chromium.org, Jan 24 2018

iirc there's an (old) crbug somewhere that NOINLINE isn't always honored. Is CrashIntentionally() in a separate CL as the caller?

(i'm not aware of any recent changes in any case)

Comment 6 by a...@chromium.org, Jan 24 2018

"Is CrashIntentionally() in a separate CL as the caller?"

CL? Compilation unit? No, both are in content/renderer/render_frame_impl.cc in an anonymous namespace, as they're implementation details of a RenderFrameImpl member function.

Comment 7 by thakis@chromium.org, Jan 24 2018

Different TU, sorry. Maybe try moving it to a separate TU if you're in a rush.

Maybe the increase in linux flakes made us notice the Windows problem for the first time? If so, and we're not in a rush, file a dedicated bug for the Windows thing and we (compiler folks) will try to look at it.

Comment 8 by a...@chromium.org, Jan 24 2018

rhalavati, is this new in Windows, or has this changed recently?

Comment 9 by kbr@chromium.org, Jan 25 2018

Components: Build
There's a deliberate attempt to avoid inlining of this function using the NOINLINE directive:

https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?q=CrashIntentionally&sq=package:chromium&l=823

This is a new regression and it's flaking all over the place on win7_chromium_rel_ng:

https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/?limit=200

specifically, for example:
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88066
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88064
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88063
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88057
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88054
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88053
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88052
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88048
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88046

I don't know why it's not failing on the associated waterfall bots:

https://ci.chromium.org/buildbot/chromium.win/Win7%20Tests%20%28dbg%29%281%29/?limit=200
https://ci.chromium.org/buildbot/chromium.win/Win10%20Tests%20x64/?limit=200

This is urgent; it's dramatically affecting the CQ. Did Clang roll recently?

Comment 10 by kbr@chromium.org, Jan 25 2018

Cc: jochen@chromium.org

Comment 11 by kbr@chromium.org, Jan 25 2018

Owner: kbr@chromium.org
Status: Assigned (was: Available)
I'll try moving these functions to a different translation unit.

Re comment 9: No, the last clang roll was 3 weeks ago (cf https://chromium.googlesource.com/chromium/src/+log/master/tools/clang/scripts/update.py). If it started flaking recently, something else changed and my memory about NOINLINE being ignored is probably incorrect.

Comment 13 by kbr@chromium.org, Jan 25 2018

Cc: st...@chromium.org
Hard to tell. Probing back 1600 builds on win7_chromium_rel_ng still shows evidence of it:
https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/86607

but that's only as far back as yesterday. Chromium developers send a *lot* of tryjobs. +stgao as FYI

chromium-try-flakes isn't discovering all of these flakes:
https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiVjb250ZW50X3NoZWxsX2NyYXNoX3Rlc3QgKHdpdGggcGF0Y2gpDA

maybe because it usually fails both with and without the patch.

Comment 14 by kbr@chromium.org, Jan 25 2018

I also can't find this test on test-results.appspot.com.

Re comment 8: I just saw it on Win7.
Cc: seanmccullough@chromium.org
seanmccullough@ - is it possible to make this builder step upload results to test-results.appspot.com?
Labels: -Sheriff-Chromium

Comment 18 by st...@chromium.org, Jan 25 2018

chromium-try-flakes relies on test-results.appspot.com, so it won't be able to detect such flakes if results are not uploaded there.

Comment 19 by kbr@chromium.org, Jan 25 2018

It's possible I made a mistake in my query of test-results.appspot.com. Can anyone else find the results being uploaded there? chromium-try-flakes is seeing at least some of the flakes.

https://chromium-review.googlesource.com/885901 is inbound attempting to work around this bug. The CQ seemed to be disabled last night.

Project Member

Comment 20 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/538bdbcc1a8c67077ca10c8d68fb6efff7d40958

commit 538bdbcc1a8c67077ca10c8d68fb6efff7d40958
Author: Ken Russell <kbr@chromium.org>
Date: Thu Jan 25 18:13:07 2018

Move crash helper functions to another file.

This should help prevent the compiler from attempting to inline them.

Bug:  805409 
Change-Id: I282c12f09048fef3cb24b8c1b1d14c1a0ca705f0
Reviewed-on: https://chromium-review.googlesource.com/885901
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Jochen Eisinger <jochen@chromium.org>
Reviewed-by: Avi Drissman <avi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#531945}
[modify] https://crrev.com/538bdbcc1a8c67077ca10c8d68fb6efff7d40958/content/renderer/BUILD.gn
[add] https://crrev.com/538bdbcc1a8c67077ca10c8d68fb6efff7d40958/content/renderer/crash_helpers.cc
[add] https://crrev.com/538bdbcc1a8c67077ca10c8d68fb6efff7d40958/content/renderer/crash_helpers.h
[modify] https://crrev.com/538bdbcc1a8c67077ca10c8d68fb6efff7d40958/content/renderer/render_frame_impl.cc

Comment 21 by kbr@chromium.org, Jan 26 2018

Cc: zturner@chromium.org scottmg@chromium.org wfh@chromium.org
Unfortunately that didn't fix the problem. Here's a tryjob that picked up the Chromium revision in which that CL landed:

https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/88769

It still failed content_shell_crash_test here:

https://chromium-swarm.appspot.com/task?id=3b481015cb228310&refresh=10&show_raw=1

There have been some internal email threads about Windows debug information and some work still to be done. Could this failure be related? Can someone download the above isolate and see whether that's likely the issue?

I'm going to disable this test on Windows for the moment; it's affecting the CQ too badly.

Re comment 21: That thread was about tail call optimization. If we were to tailcall into the NOINLINE function, we'd miss the function below the NOINLINE function from the stack, not the NOINLINE function itself, right? CrashIntentionally() doesn't call anything itself.
I'll see if I can repro.
It works OK locally on Win10.

At first I was thinking that it might be because the pdb isn't getting onto the bot. That might be a problem; I don't see anything obvious in https://cs.chromium.org/chromium/src/content/shell/BUILD.gn where the pdb gets packaged.

But looking at a local run compared with the output in https://chromium-swarm.appspot.com/task?id=3b43a6dda2d9b510&refresh=10&show_raw=1 it looks to me like cdb isn't even running. There should at a minimum be some logo header, etc. from cdb when it starts running.

I haven't been able to find a swarming run where it did work (does anyone know of one?).

Looking at https://cs.chromium.org/chromium/src/testing/buildbot/chromium.win.json I'm not confident that this test is even run on either of the waterfall bots linked to in #c9.
Ah, it is run on Win 7 Tests 1 though

https://ci.chromium.org/buildbot/chromium.win/Win7%20Tests%20%281%29/76166

and that run includes

https://chromium-swarm.appspot.com/task?id=3b3ebfc8fc3d9610&refresh=10&show_raw=1

which shows it working, but without output from cdb. So I guess some of stdout or stderr are dropped on swarming.
The swarming output doesn't include the .dmp unfortunately, so can't check that vs. reality.
Aha, but on a Win7 VM, reproing the failing swarming run fails:

python swarming.py reproduce -S chromium-swarm.appspot.com 3b43a6dda2d9b510

whereas the successful one from the bot passes:

python swarming.py reproduce -S chromium-swarm.appspot.com 3b3ebfc8fc3d9610

Progress! Maybe.
Running python swarming.py reproduce -S chromium-swarm.appspot.com 3b43a6dda2d9b510 directly quietly fails.

But running the isolate script test that I believe it runs direct causes a cdb crash dialog. I'm not 100% certain that this is the problem as I can't see how/where the script would be suppressing that crash dialog, but that might be the problem.
cdbcrash.png
136 KB View Download
Two more pieces of information in support of it just being a cdb crash:
- Loading that dump manually in windbg displays the right stack
- Running the manual command line per the screenshot above with the same retrieved swarming inputs works as expected on Win10.

The crash looks like it's in kernelbase.dll, so I think this is likely that the cdb that it's trying to run is just failing on the trybot machines.

But I don't know exactly what we should do to fix it. From #c25, the successful slave (vm73-m4) superficially looks very similar to the failing one (vm1372-m4). I'm not sure how those machines are imaged/provisioned/updated though, so it's possible one is a slightly different patchlevel than the other.

Comment 31 by wfh@chromium.org, Jan 26 2018

are the trybots running the latest updates for windows?
I take it back somewhat, I think that's semi-reasonable as those are redist.

Fiddling around a little more, it seems that an older cdb (10.0.14321.1024) runs, whereas the one we're using now (10.0.15063.468) doesn't on this Win7. So I'd guess an compiler update (not clang, but the Microsoft part) probably is when this broke. (?)
My suspect for when this broke would have been way back at:

https://bugs.chromium.org/p/chromium/issues/detail?id=683729#c158

(noting that that was flipping on a toolchain packaged at https://bugs.chromium.org/p/chromium/issues/detail?id=683729#c146 which mentions debuggers from 15063.468)

But looking at the CL where it landed: https://chromium-review.googlesource.com/c/chromium/src/+/665637 content_shell_crash_test ran https://ci.chromium.org/buildbot/tryserver.chromium.win/win_chromium_rel_ng/534228 and allegedly worked https://chromium-swarm.appspot.com/task?id=38a07ef64ecf8e10&refresh=10&show_raw=1

It ran on vm19-m4.

I wonder if the "low numbered" vmNN-m4 (73, 19) that were successful are somehow in a different image bucket than the "high numbered " vm1372-m4 that failed?
I found the failing slaves that kbr linked to above:
vm1372-m4
vm1320-m4
vm170-m4
vm1292-m4
vm311-m4
vm88-m4
vm1411-m4
vm169-m4
vm15-m4

So the numbering seems irrelevant (as you might expect). I'm going to try to figure out if win 7 tests 1 shares bots with win7_chromium_rel_ng or not.

Comment 35 by kbr@chromium.org, Jan 26 2018

scottmg@: yes, sorry about https://cs.chromium.org/chromium/src/build/win/BUILD.gn?type=cs&q=cdb.exe&sq=package:chromium&l=24 . I didn't know how else to express the redistributable dependencies of cdb. If there's a better way I'm happy to fix it.

Do you think we're missing a DLL dependency? If the cdb from the newer Windows toolchain is zipped up manually and run on one of the older Win7 VMs, does it run?

Comment 36 by kbr@chromium.org, Jan 26 2018

Cc: kbr@chromium.org
Owner: scottmg@chromium.org
Also, can I ask you to take this given that it looks like a Windows toolchain problem? Thanks in advance.

Comment 37 by kbr@chromium.org, Jan 26 2018

win7_chromium_rel_ng almost surely shares bots with "Win7 Tests (1)". These tests are Swarmed, and they surely use the same Swarming dimensions.

(I don't work on Windows any more, and I'm in a different CC now, but I'm nerd sniped at this point, so yeah :)

Nah, I don't have a better solution, other than offloading to ops to have a working cdb on the machine.

If I install windbg/cdb via the Windows SDK 10.0.15063.468, cdb runs fine from that install.

So either we're missing a dll, or it updates something in Windows during install...
Oh wait! I got confused amongst all the directories and VMs...

I believe the correct versions what work/don't are:
- 10.0.15063.468 works
- 10.0.16299.15 doesn't.

I don't know what 16299.15 is or where it came from. The normal download at https://developer.microsoft.com/en-us/windows/hardware/download-windbg gives me the 15063.468 sdk. Did we package the Store Preview version for some reason? It wouldn't surprise me that that doesn't work on 7.
Oh ffs Microsoft, that page doesn't actually download the latest sdk.

Apparently https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/.
Installing Debugging Tools for Windows from 10.0.16299.91 (which is what the link above gets me today) on the Win7 vm installs fine, and runs OK. The packaged version however, continues to not run (that's 10.0.16299.15). I'm not sure if it's the difference in version, or that it updated Windows to make the .91 one work.

...

Reverting the VM to pre-install, and copying the 16299.91 cdb on to it (not installing it) runs fine. So it seems plausible that .15 is just broken on Win7.
Got it. We're "just" missing good ol' api-ms-win-eventing-provider-l1-1-0.dll.

Of course.

(The fact that we use an x86 cdb instead of x64 made things extra confusing, but anyway... The upside is that it shouldn't need a toolchain repackage to fix, which is good because I believe the magic toolchain repackager (Bruce) is still on vacation.)

Comment 43 by a...@chromium.org, Jan 26 2018

BTW, if it's that, can we revert 538bdbcc1a8c67077ca10c8d68fb6efff7d40958?
Status: Started (was: Assigned)
avi: Yeah, I think we should be able to. Let me land a fix first https://crrev.com/c/889701 and confirm things are OK first, and then we can revert that separately.
Cc: brucedaw...@chromium.org
kbr: Just to confirm, you didn't end up disabling anything per #c21, right? I see the test included in "analyze" output in https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/89655 but just wanted to confirm it wasn't disabled in only some specific configs.

+bruce fyi. No packaging problem after all, but cdb-on-win7 is a thing I could imagine causing problems on a future toolchain update, so just so you're aware of this bug.

Comment 46 by kbr@chromium.org, Jan 26 2018

scottmg: correct, nothing was disabled yet. I attempted to do so in https://chromium-review.googlesource.com/888166 but it got kicked out of the CQ due to the infra outage. I think may still be worth putting the crash functions in a separate file.

Thanks very much for helping even though this isn't your job any more.

Project Member

Comment 47 by bugdroid1@chromium.org, Jan 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b91f674a20ce246aa646bb7ecabd217d113c5611

commit b91f674a20ce246aa646bb7ecabd217d113c5611
Author: Scott Graham <scottmg@chromium.org>
Date: Fri Jan 26 23:53:13 2018

Add missing cdb dll, causing failures on win7 crash integration tests

(Link to working run on Win7 bot:
https://chromium-swarm.appspot.com/task?id=3b4d2229fcb0aa10&refresh=10&show_raw=1)

Bug:  805409 ,  804452 
Change-Id: Ieea7271856276162483e9a4240452701a1e062e7
Reviewed-on: https://chromium-review.googlesource.com/889701
Commit-Queue: Scott Graham <scottmg@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#532094}
[modify] https://crrev.com/b91f674a20ce246aa646bb7ecabd217d113c5611/build/win/BUILD.gn
[modify] https://crrev.com/b91f674a20ce246aa646bb7ecabd217d113c5611/build/win/copy_cdb_to_output.py

Status: Fixed (was: Started)

Sign in to add a comment