New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 916461 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

DCHECK hit in FieldTrialList::CreateTrialsFromCommandLine

Project Member Reported by grt@chromium.org, Dec 19

Issue description

The mini_installer_tests target starts and stops Chrome fairly quickly. It is tripping over this DCHECK on line 828 of base/metrics/field_trial.cc in utility procs:

    bool result = CreateTrialsFromSwitchValue(switch_value);
    UMA_HISTOGRAM_BOOLEAN("ChildProcess.FieldTrials.CreateFromShmemSuccess",
                          result);
    DCHECK(result);

If I were to guess, I'd wager that creation fails because the parent proc is or has been torn down. Looks like this DCHECK was added in r431974 by lawrencewu@.

asvitkine@: I'll send up a CL to remove this DCHECK since I don't think it's helping -- we have the histogram above to tell us that it is happening in the field. Feel free to take this over if you think something else needs to be done. Thanks.
 
"If I were to guess, I'd wager that creation fails because the parent proc is or has been torn down."

That seems surprising to me. I'd imagine given we share the handle explicitly with child processes, it should be refcounted internally by Windows and not destroyed if the parent goes away. I'd be interested to learn more...

Can you clarify in which context this is happening? What kind of child process is this and what's launching it? Is it actually Chrome itself?

The histogram shows that the failure case happens very infrequently:
0.00000160422 of the cases

https://uma.googleplex.com/p/chrome/histograms?endDate=20181217&dayCount=7&histograms=ChildProcess.FieldTrials.CreateFromShmemSuccess&fixupData=true&showMax=true&filters=platform%2Ceq%2CW%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

So if we have a reliable repro, I think it would be good to understand the cause/fix it rather than just disabling the DCHECK.

Hi Alexei. The problem is in DeserializeSharedMemoryHandleMetadata here: https://cs.chromium.org/chromium/src/base/metrics/field_trial.cc?type=cs&q=file:field_trial%5C.cc+duplicatehandle&sq=package:chromium. The handle can't be duplicated if the parent process is on its way out the door. I binary-searched my way to this:

[30684:26116:1220/110901.651:FATAL:field_trial.cc(1242)] Check failed: DuplicateHandle(parent_handle, handle, GetCurrentProcess(), &handle, 0, FALSE, DUPLICATE_SAME_ACCESS). : Access is denied. (0x5)

Here's an example command line from a crashing child process:

"C:\Program Files (x86)\Chromium\Application\chrome.exe" --type=utility --field-trial-handle=1760,6282298966140217550,13955979057982902029,131072 --lang=en-US --no-sandbox --enable-logging --ignore-certificate-errors --log-level=0 --v=1 --ignore-certificate-errors --user-data-dir="c:\users\grt\appdata\local\temp\tmpmdrht0" --enable-logging --log-level=0 --v=1 --service-request-channel-token=14650047044558677779 --mojo-platform-channel-handle=5364 /prefetch:8

It is created by the browser process. Based on ProcMon logs, it appears that my hunch was correct, and that the browser is in the midst of process termination when the DCHECK is hit.

Here are steps to repro this locally:

1) build mini_installer_tests from a normal checkout the same way it's being built on the bots:

tools\mb\mb.bat gen -b win10_chromium_x64_rel_ng -m tryserver.chromium.win out\try
autoninja -C out\try mini_installer_tests

2) run the tests from an elevated cmd prompt (warning: you'll need to uninstall any other Chromium build(s) on your machine, and will likely want to quit chrome.exe before running -- the test will terminate all chrome.exe processes):

cd out\try
python ..\..\chrome\test\mini_installer\test_installer.py --output-dir=..\..\..

3) load one of the crash dumps that gets deposited in your main src directory into your debugger of choice.

If you pick up https://chromium-review.googlesource.com/c/chromium/src/+/1386205, you'll see the child proc's log messages in the test output.
Thanks! If the parent process is on being destroyed, then is there any reason to have the child continue to run vs. just exiting? Can we turn the DCHECK into an exit() call?
Ultimately, the child should be destroyed by virtue of the job object owned by the parent going away. I don't know if we should assume that every failure to duplicate is because the parent is in teardown. Could you ask a content owner to weigh-in?
Cc: jam@chromium.org
"Ultimately, the child should be destroyed by virtue of the job object owned by the parent going away. I don't know if we should assume that every failure to duplicate is because the parent is in teardown."

If there are legitimate failures that are not in that category, then I'd like to know about them since continuing execution without trials would be a problem then. I agree that if the whole process hierarchy is being shut down, then I see less of a problem with that.

I also think that if this is indeed that case (parent process is shutting down), then it is better to exit the child in this specific point than in some arbitrary place down the line. This code runs very early during child start up, so if we exit at this point, it's basically equivalent to the child not doing any more work (not doing anything it was launched to do). Which seems reasonable if parent is going away.

If there are cases where this happens for other reasons than parent going away - then presumably the exit() would let us uncover this - i.e. via test failures or problem reports from users. So I'd rather we do that than just remove the DCHECK. Do all the tests you were trying pass fine if we do an exit there? (I don't have a working Windows machine right now so can't verify... need to order a new one.)

Also, do you know why we started seeing this recently? Did something change? Did we not run the tests on bots before? Did it start with a specific revision?

"Could you ask a content owner to weigh-in?"

Sure, +jam for his thoughts.


These tests haven't been running for all that long. There have been undiagnosed flakes for a while. I've just recently started trying to iron them out. (Apologize for brevity -- writing from phone)
Hey Alexei. This DCHECK is slowing down Chromium development for no benefit while we discuss the proper way to deal with this situation. Would you please send the https://chromium-review.googlesource.com/c/chromium/src/+/1384025 to the CQ? The sooner we silence  issue 916389 , the better.

Re: "...continuing execution without trials would be a problem...". There is nothing in the production code to prevent execution without trials. If there should be, its introduction should be independent of this particular DCHECK.

Each line of this code from FieldTrialList::DeserializeSharedMemoryHandleMetadata could fail for one reason or another:

    base::ProcessId parent_pid = base::GetParentProcessId(GetCurrentProcess());
    HANDLE parent_handle = OpenProcess(PROCESS_ALL_ACCESS, FALSE, parent_pid);
    DuplicateHandle(parent_handle, handle, GetCurrentProcess(), &handle, 0,
                    FALSE, DUPLICATE_SAME_ACCESS);

I can't say with any certainty that death-of-parent is the only reason they'd fail. If you think that a failure in DeserializeSharedMemoryHandleMetadata for any reason should lead to process termination, then care needs to be taken to not emit a crash dump in "normal" cases such as parent termination. One way to do this is to never crash and always base::Process::TerminateCurrentProcessImmediately. With my tiny view into the issues here, that seems fine to me. If this were to happen in cases where the browser is not shutting down, would we see a rise in some "child process died" metric? Perhaps the termination should use a distinct process exit code so that the browser could UMA specifically that the child died because it couldn't initialize field trials.
Sounds good to disabling the DCHECK for now since we couldn't agree on a solution that fixes the issue yet. cq'd the change.
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a8f33aee59bc7840fedcd7e169acb487ed75da7b

commit a8f33aee59bc7840fedcd7e169acb487ed75da7b
Author: Greg Thompson <grt@chromium.org>
Date: Fri Dec 21 15:53:26 2018

Remove DCHECK on CreateTrialsFromSwitchValue failures.

This DCHECK is hit in utility processes during test_installer
runs. The attempt to duplicate the parent proc's handle in
DeserializeSharedMemoryHandleMetadata fails because the browser
process is being torn down.

BUG= 916389 ,916461
R=asvitkine@chromium.org

Change-Id: I27c03717c0272bdf309f78e0a1724d7d0f205c57
Reviewed-on: https://chromium-review.googlesource.com/c/1384025
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#618539}
[modify] https://crrev.com/a8f33aee59bc7840fedcd7e169acb487ed75da7b/base/metrics/field_trial.cc

Components: -Internals>Metrics Internals>Metrics>Variations
"Hey Alexei. This DCHECK is slowing down Chromium development for no benefit while we discuss the proper way to deal with this situation. Would you please send the https://chromium-review.googlesource.com/c/chromium/src/+/1384025 to the CQ? The sooner we silence  issue 916389 , the better."

Done. CL is landed. :)

"Re: "...continuing execution without trials would be a problem...". There is nothing in the production code to prevent execution without trials. If there should be, its introduction should be independent of this particular DCHECK."

Well, the DCHECKs are intended to cover that. Obviously we have a case here where one fires so therefore an issue exists that we need to resolve.
Ultimately given a number of different functionality and sometimes security and privacy related features are enabled through trials (usually when they are first being developed), it's important that the field trial state of parent and child processes is consistent. Ideally, there would be no scenario under which this would fail. (As an example, for a trial like NetworkService, having child processes not get the same trial state as the parent will lead to various breakage.)

"I can't say with any certainty that death-of-parent is the only reason they'd fail."

Right, hence why I'm interested to see if any tests fail on the bots or we get bug reports from the wild if we do the termination. If it's only death-of-parent, I think doing child termination is reasonable. If there are other cases, then I'd like to know about them and understand them to decide what to do about them.

Hi Alexei. I dug a bit deeper into the DCHECK, and figure that it can only happen when the browser is running elevated. This is largely not the case for user sessions, but it is due to the way that the test launches Chrome. I'm trying to modify the test (which must run elevated) to launch Chrome non-elevated so that it's closer to testing the actual user experience. This isn't as simple as it sounds.

For ordinary non-elevated Chrome launches, the handle duplication code isn't hit. If I can make the change to the test mentioned above, the DCHECK (which should maybe be a CHECK?) could be brought back.

Sign in to add a comment