New issue
Advanced search Search tips

Issue 790935 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

shutdown DCHECKs in interactive_ui_tests' renderers are swallowed

Project Member Reported by grt@chromium.org, Dec 1 2017

Issue description

Suddenly as of r520525 or thereabouts (https://ci.chromium.org/buildbot/chromium.win/Win7%20Tests%20(dbg)(1)/64917), various interactive_ui_tests have started timing out as a result of something throwing up an empty MessageBox with the descriptive title "Error"; see attached screenshot. A typical failure can be found at https://chromium-swarm.appspot.com/task?id=3a261778b09db410&refresh=10&show_raw=1, though the log doesn't provide any useful information.

"Error" is the default caption for a call to ::MessageBox, so I think what we have is a call to MessageBox(nullptr, nullptr, nullptr, 0); (0 == MB_OK).


 
ss_20171130212839_517.png
55.7 KB View Download

Comment 1 by grt@chromium.org, Dec 1 2017

Owner: grt@chromium.org
Status: Started (was: Available)
I believe that the message box is shown by RaiseHardErrorMsg (https://cs.chromium.org/search/?q=RaiseHardErrorMsg&sq=package:chromium&type=cs) since it is shown with MB_SERVICE_NOTIFICATION (csrss is the owning process). If I'm correct, it should not be shown on the bots since they should be running all tests in headless mode. This seems to have regressed in swarming; see issue 790985.

While I have a hunch as to why the dialog is being shown, I don't know why the underlying error is happening, nor if we would get any diagnostics should CHROME_HEADLESS=1 be set. I am investigating.

Comment 2 by grt@chromium.org, Dec 1 2017

Cc: sky@chromium.org wfh@chromium.org
Labels: -Pri-1 -Type-Bug-Regression Pri-2 Type-Bug
Summary: DCHECKs in renderers are very difficult to diagnose in interactive_ui_tests (was: Spurious "Error" MessageBox leading to interactive_ui_tests timeouts on Windows Debug)
+sky@ (general guru) and wfh@ (sandbox guru)

I was mistaken. It's a DCHECK being hit in a sandboxed renderer (specifically, WebServiceWorkerRegistrationImpl::OnConnectionError; see  issue 791041 ). The details of the failure are somehow eaten by the sandbox. I was only able to see what was happening if I repro'd locally with:

out\Deb\interactive_ui_tests.exe --gtest_filter=PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection --single_process  --no-sandbox

(over, and over, and over again)

It took a long, long time for me put the pieces together to get to this point. Did I simply miss a PSA?

Since I just burned a full day on this one failure, I wonder how we can make this sort of thing on the bots more actionable in the future. Yes, I learned a lot in the process (CHROME_HEADLESS regressed, --no-sandbox is needed to see failures in renderers, ETW can be magical at times), but I don't plan to be the eternal guardian of interactive_ui_tests. Thoughts?

(dropping to P2 since https://chromium-review.googlesource.com/c/chromium/src/+/803516 should fix the tests)
Cc: jam@chromium.org
The DCHECK message box behavior on bots bit someone else recently and wasted at least a day or two of their time as well. This affects other multiprocess content test suites, not just interactive_ui_tests.

It would be nice if there were a way for us to plumb sandboxed process logs to the console and disable the message box behavior, at least behind a flag so the bots could use it.

Another option might be to capture a screenshot of the bot on test failure and include it as part of the logs, but a) that's probably harder and b) it's probably a huge waste of space most of the time.

Comment 4 by sky@chromium.org, Dec 1 2017

grt did add logic to take a screenshot in certain cases (always on top window present before/after test) and I would expect that logic to trigger here.

jam knows more about crashes in renderers. I do agree that for tests we would really like the logs of the renderers to get dumped or at least to see some mention that the renderer is crashing unexpectedly and ideally a stack.

Comment 5 by grt@chromium.org, Dec 1 2017

Once issue 790985 is fixed, we should no longer have message boxes appearing for DCHECK failures on the bots. With that out of the way, I'm not sure what would happen to the details in this case. I'll try it out.

Comment 6 by jam@chromium.org, Dec 1 2017

re crashes in renderers: we do print the stack trace in browser tests when that happens. We have a test for this :) ContentBrowserTest.RendererCrashCallStack

Comment 7 by grt@chromium.org, Dec 4 2017

For this particular DCHECK (in WebServiceWorkerRegistrationImpl::OnConnectionError; see  issue 791041 ), I find that the failure does appear in the log with CHROME_HEADLESS=1, but it doesn't fail the test. Methodology (in case someone wants to repeat my findings):

- sync to position r520889 (or anywhere between r520479 and r521008)
- build debug interactive_ui_tests
- set CHROME_HEADLESS=1
- out\Deb\interactive_ui_tests.exe --gtest_filter=PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection --gtest_repeat=10

the output that streams by sometimes has:
[17120:13220:1204/120105.054:FATAL:web_service_worker_registration_impl.cc(225)] Check failed: LifecycleState::kDetached == state_ (2 vs. 1)
Backtrace:
	base::debug::StackTrace::StackTrace [0x101FE856+102]
	base::debug::StackTrace::StackTrace [0x101FD3F3+35]
	logging::LogMessage::~LogMessage [0x10273F55+149]
	content::WebServiceWorkerRegistrationImpl::OnConnectionError [0x15F90B49+457]
....

but the test never fails. When I add in "--brave-new-test-launcher --test-launcher-bot-mode" (as is used on the bots), this output is swallowed:

out\Deb\interactive_ui_tests.exe --cfi-diag=0 --snapshot-output-dir=c:\src\chromium\src\out\Deb  --gtest_filter=PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection --gtest_repeat=10 --brave-new-test-launcher --test-launcher-bot-mode
IMPORTANT DEBUGGING NOTE: each test is run inside its own process.
For debugging a test inside a debugger, use the
--gtest_filter=<your_test_name> flag along with either
--single_process (to run the test in one launcher/browser process) or
--single-process (to do the above, and also run Chrome in single-process mode).
Using sharding settings from environment. This is shard 0/1
Using 1 parallel jobs.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4553 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4496 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4653 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4511 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4549 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4547 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4440 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4633 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4688 ms)
SUCCESS: all tests passed.
[1/1] PlatformNotificationServiceBrowserTest.WebNotificationOptionsReflection (4544 ms)
SUCCESS: all tests passed.
Summary of all test iterations:
End of the summary.

How can we have late-process DCHECK failures in subprocesses lead to overall test failure?
I added a DCHECK(false) to the start of this test and I found that I got call stacks and "1 test crashed" both with and without "--brave-new-test-launcher --test-launcher-bot-mode".

However --test-launcher-bot-mode apparently tells the test runner to retry the tests if it fails. So if you have a flaky test then this will cause it to be reported as passing. Could our controversial test retry logic be the only explanation needed for why the test never shows up as failing?

Comment 9 by grt@chromium.org, Dec 6 2017

In your case, the CHECK is in the main test process -- the one running the test -- and will naturally be handled properly. In the WebServiceWorkerRegistrationImpl case, however, it's in a sandboxed renderer. To replicate, put a DCHECK(false) somewhere in a renderer codepath. Maybe during renderer startup the error would be handled (since the test wouldn't be able to do anything at all). For the failure I was looking at, I think the DCHECK was hit during shutdown in a racy way such that the test passed no matter what. 
Labels: Hotlist-DesktopUIToolingRequired Hotlist-DesktopUIChecked
***UI Mass Triage ***

Just checking whether there is any pending work here? If not please feel free to close this issue.
Owner: ----
Status: Available (was: Started)
I don't know that anything has changed to resolve the issue described in comments 7 and 9. I don't have cycles to work on this any more. If the status quo is good enough, this can be closed out. I suspect that we're missing seeing certain failures, though.
Summary: shutdown DCHECKs in interactive_ui_tests' renderers are swallowed (was: DCHECKs in renderers are very difficult to diagnose in interactive_ui_tests)
(updating the summary a bit)

Sign in to add a comment