New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 671035 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner: ----
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 3
Type: Bug



Sign in to add a comment

gpu crash in run-webkit-tests

Project Member Reported by robho...@gmail.com, Dec 4 2016

Issue description

[robert@mwenge WebKit (667128-3)]$ cat ../../out/Release/args.gn 
# Build arguments go here. Examples:
is_component_build = true
is_debug = false
dcheck_always_on = true
enable_nacl = false
# See "gn args <out_dir> --list" for available build arguments.


[robert@mwenge WebKit (667128-3)]$ ninja -C ../../out/Release chrome; ninja -C ../../out/Release blink_tests;

[robert@mwenge WebKit (669867-3)]$ Tools/Scripts/run-webkit-tests --no-show-results --no-retry --new-test-results LayoutTests/fast/table/percent-*
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, release>
View the test results at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/results.html
View the archived results dashboard at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/dashboard.html
Baseline search path: linux -> win -> generic
Using Release build
Pixel tests enabled
Regular timeout: 6000, slow test timeout: 30000
Command line: /home/robert/Dev/blink/src/out/Release/content_shell --run-layout-test --enable-crash-reporter --crash-dumps-dir=/home/robert/Dev/blink/src/out/Release/crash-dumps -

Found 24 tests; running 24, skipping 0.
                  
Running 1 content_shell.                                                    

[2/24] fast/table/percent-height-border-box-content-in-cell-3.html failed unexpectedly (gpu crashed)
[3/24] fast/table/percent-height-border-box-content-in-cell.html failed unexpectedly (gpu[32454:32472:1204/194538.363787:25355218948:ERROR:browser_gpu_channel_host_factory.cc(113)] crashed)
[11/24] fast/table/percent-height-content-in-fixed-height-content-box-sized-cell.html failed unexpectedly (gpu crashed)  
[13/24] fast/table/percent-height-form-elements-in-cell.html failed unexpectedly (gpu crashed)   
[20/24] fast/table/percent-heights.html failed unexpectedly (gpu crashed)     
[22/24] fast/table/percent-widths-stretch-vertical.html failed unexpectedly (gpu crashed)
[23/24] fast/table/percent-widths-stretch.html failed unexpectedly (gpu crashed)
                          
17 tests ran as expected, 7 didn't:

Each crash looks like:

STDERR: [31931:31931:1204/194412.330908:25269186060:FATAL:memory.cc(22)] Out of memory. size=131072
STDERR: #0 0x7fa9c857658e base::debug::StackTrace::StackTrace()
STDERR: #1 0x7fa9c859ad7b logging::LogMessage::~LogMessage()
STDERR: #2 0x7fa9c85d32c4 base::(anonymous namespace)::OnNoMemory()
STDERR: #3 0x7fa9c85b0237 base::FieldTrialList::CreateTrialsFromSharedMemoryHandle()
STDERR: #4 0x7fa9c85b017a base::FieldTrialList::CreateTrialsFromDescriptor()
STDERR: #5 0x7fa9c85affe7 base::FieldTrialList::CreateTrialsFromCommandLine()
STDERR: #6 0x7fa9c97d3d41 content::(anonymous namespace)::InitializeFieldTrialAndFeatureList()
STDERR: #7 0x7fa9c97d4a41 content::ContentMainRunnerImpl::Run()
STDERR: #8 0x7fa9c97d33f0 content::ContentMain()
STDERR: #9 0x00000046554b main
STDERR: #10 0x7fa9c0dfa830 __libc_start_main
STDERR: #11 0x000000465441 <unknown>
STDERR: 
STDERR: Received signal 6
STDERR: #0 0x7fa9c8576127 base::debug::(anonymous namespace)::StackDumpSignalHandler()
STDERR: #1 0x7fa9ca27e3e0 <unknown>
STDERR: #2 0x7fa9c0e0f428 gsignal
STDERR: #3 0x7fa9c0e1102a abort
STDERR: #4 0x7fa9c8574362 base::debug::BreakDebugger()
STDERR: #5 0x7fa9c859b072 logging::LogMessage::~LogMessage()
STDERR: #6 0x7fa9c85d32c4 base::(anonymous namespace)::OnNoMemory()
STDERR: #7 0x7fa9c85b0237 base::FieldTrialList::CreateTrialsFromSharedMemoryHandle()
STDERR: #8 0x7fa9c85b017a base::FieldTrialList::CreateTrialsFromDescriptor()
STDERR: #9 0x7fa9c85affe7 base::FieldTrialList::CreateTrialsFromCommandLine()
STDERR: #10 0x7fa9c97d3d41 content::(anonymous namespace)::InitializeFieldTrialAndFeatureList()
STDERR: #11 0x7fa9c97d4a41 content::ContentMainRunnerImpl::Run()
STDERR: #12 0x7fa9c97d33f0 content::ContentMain()
STDERR: #13 0x00000046554b main
STDERR: #14 0x7fa9c0dfa830 __libc_start_main
STDERR: #15 0x000000465441 <unknown>
STDERR:   r8: ffff8ce7635afd18  r9: ffff8ce7635afd08 r10: 0000000000000008 r11: 0000000000000206
STDERR:  r12: 00000ce428b1f9a0 r13: 00007ffcc54478f0 r14: 00007ffcc5446e70 r15: 00007ffcc5446e60
STDERR:   di: 0000000000007cbb  si: 0000000000007cbb  bp: 00000000ffffffff  bx: 0000000000000000
STDERR:   dx: 0000000000000006  ax: 0000000000000000  cx: 00007fa9c0e0f428  sp: 00007ffcc54468b8
STDERR:   ip: 00007fa9c0e0f428 efl: 0000000000000206 cgf: 0000000000000033 erf: 0000000000000000
STDERR:  trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
STDERR: [end of stack trace]

This started happening on my most recent 'git pull'.



 

Comment 1 by ajha@chromium.org, Dec 5 2016

Labels: M-57
Tagging with canary milestone for further triaging.

Comment 2 by robho...@gmail.com, Dec 5 2016

Cc: junov@chromium.org
HI junov - this started happening to me all of a sudden. Can you suggest anything to troubleshoot or work around it?

Comment 3 by robho...@gmail.com, Dec 5 2016

Description: Show this description

Comment 4 by robho...@gmail.com, Dec 5 2016

Cc: lawrencewu@chromium.org
I think https://chromium.googlesource.com/chromium/src/+/c4fe88004d0457cf00b1731c1e974c96a3cd649e might be causing this for me.

lawrencewu@ - any suggestions what I should do?
Cc: asvitk...@chromium.org
That is the generic error thrown when we can't map the fd backing shared memory for field trial state passed over the command line (via the --field-trial-handle flag) from the browser process for whatever reason. If you just want to work around it locally, you can disable kUseSharedMemoryForFieldTrials in field_trial.cc. I'll take a look at this tomorrow too, but if you can help me debug that would be great, too. I have several theories:

1) The field trial handle is not actually being inherited or shared. Since these are renderer tests I think this is the most likely case (because there may not be a browser process spawning the renderer or something). You can try adding a logging statement in CopyFieldTrialStateToFlags and check that the readonly_allocator_handle_ actually has a value, and follow it down to where it gets whitelisted by being added to |fds_to_map| in child_process_launcher.cc. If that code doesn't even run, then that means the handle wasn't passed appropriately.

1a) Since this is the gpu process - I think - and gpu processes are launched differently, I may not have appended the handle here properly. Maybe check argv for this process and see if --field-trial-handle is even being passed through the cmd line.

2) The system is actually out of memory. Seems unlikely since the mapped size is not very big.

3) mmap is failing for some other mysterious reason. Also unlikely since this is new code and I most likely just made a mistake here.
Hey robhogan@, I'm getting the following error when I'm trying to run the webkit tests:

lawrencewu@lawrencewu:~/chromium/src$ third_party/WebKit/Tools/Scripts/run-webkit-tests --no-show-results --new-test-results third_party/WebKit/LayoutTests/fast/table/percent-* -t Default
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, debug>
View the test results at file:///usr/local/google/home/lawrencewu/chromium/src/out/Default/layout-test-results/results.html
View the archived results dashboard at file:///usr/local/google/home/lawrencewu/chromium/src/out/Default/layout-test-results/dashboard.html
Baseline search path: linux -> win -> generic
Using Debug build
Pixel tests enabled
Regular timeout: 18000, slow test timeout: 90000
Command line: /usr/local/google/home/lawrencewu/chromium/src/out/Default/content_shell --run-layout-test --enable-crash-reporter --crash-dumps-dir=/usr/local/google/home/lawrencewu/chromium/src/out/Default/crash-dumps -

Found 20 tests; running 20, skipping 0.
                  
Running 1 content_shell.                                                                        

Failed to start the content_shell process: 
content_shell took too long to startup.
[1/20] fast/table/percent-height-content-in-fixed-height-border-box-sized-cell-with-collapsed-border-on-table.html failed unexpectedly (content_shell crashed [pid=23515])
Failed to start the content_shell process: 
content_shell took too long to startup.

Do you know how to fix this?

Comment 7 by robho...@gmail.com, Dec 6 2016

I get that when I run:

[robert@mwenge WebKit (548616)]$ Tools/Scripts/run-webkit-tests --no-show-results --additional-drt-flag=--no-zygote LayoutTests/fast/lists/
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, release>
View the test results at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/results.html
View the archived results dashboard at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/dashboard.html
Baseline search path: no-zygote -> linux -> win -> generic
Using Release build
Pixel tests enabled
Regular timeout: 6000, slow test timeout: 30000
Command line: /home/robert/Dev/blink/src/out/Release/content_shell --no-zygote --run-layout-test --enable-crash-reporter --crash-dumps-dir=/home/robert/Dev/blink/src/out/Release/crash-dumps -

Found 117 tests; running 117, skipping 0.
                  
Running 1 content_shell.                                                    

Failed to start the content_shell process: 
content_shell took too long to startup.
[1/117] fast/lists/001-vertical.html failed unexpectedly (content_shell crashed [pid=19078])
Failed to start the content_shell process: 
content_shell took too long to startup.
[2/117] fast/lists/001.html failed unexpectedly (content_shell crashed [pid=19081])
Failed to start the content_shell process: 
content_shell took too long to startup.
[3/117] fast/lists/002-vertical.html failed unexpectedly (content_shell crashed [pid=19084])
^CInterrupted, exiting ...

The only other thing I can suggest is to try a Release build rather than a Debug.
Hmm, okay, I'll try that out.

Comment 9 by robho...@gmail.com, Dec 6 2016

I'm returning early here, because there's no global_ set:

void FieldTrialList::CopyFieldTrialStateToFlags(
 856     const char* field_trial_handle_switch,
 857     CommandLine* cmd_line) {
 858   // TODO(lawrencewu): Ideally, having the global would be guaranteed. However,
 859   // content browser tests currently don't create a FieldTrialList because they
 860   // don't run ChromeBrowserMainParts code where it's done for Chrome.
 861   LOG(ERROR) << "Got here" << global_;
 862   if (!global_)
 863     return;

So it looks like (1) is the answer?
Yep, looks like it. Do you know where that function is being called from? We probably want to create a FieldTrialList singleton right before so the global_ is defined, like we do here: https://cs.chromium.org/chromium/src/content/app/content_main_runner.cc?sq=package:chromium&rcl=1481026846&l=772.

I'm still building the Release build right now...~10000/25000.
lawrencewu@ - thanks for looking into this. For now, I'm just working around it by disabling kUseSharedMemoryForFieldTrials. A few minutes looking through the code was enough to convince me I don't have the domain knowledge to be much use here.

Let me know if there's anything you want me to experiment with. Patch snippets to apply would be easiest!
Hi lawrencewu - are you working on this? Weird that I'm the only one affected by it, so I'm guessing that lowers the severity somewhat. :)

Any thoughts on what I need to do to fix my env and work around this permanently?

Comment 15 by r...@opera.com, Dec 13 2016

I'm also seeing these gpu crashers all over the place trying to run layout tests in debug.

Hey robhogan@, sorry for the delay! Please try pulling and trying again and making sure you have this commit: https://crrev.com/d17b445e932a10c875cea034b037b19d30e47a8c

The description is kind of out-of-date, but basically I changed the way we pass the field trial handle on POSIX so that we signal to the child process via a command line switch if the shared memory segment containing the field trial data has been initialized (--field-trial-handle=1). I believe it wasn't being initialized in these tests, so we were trying to read an invalid descriptor. *Hopefully* this should fix your error.
Yes, that has fixed it for me. rune@ - OK to close?

Comment 18 by r...@opera.com, Dec 16 2016

Status: Fixed (was: Unconfirmed)
Yup. WFM with that commit.

Sign in to add a comment