gpu crash in run-webkit-tests |
||||||
Issue description
[robert@mwenge WebKit (667128-3)]$ cat ../../out/Release/args.gn
# Build arguments go here. Examples:
is_component_build = true
is_debug = false
dcheck_always_on = true
enable_nacl = false
# See "gn args <out_dir> --list" for available build arguments.
[robert@mwenge WebKit (667128-3)]$ ninja -C ../../out/Release chrome; ninja -C ../../out/Release blink_tests;
[robert@mwenge WebKit (669867-3)]$ Tools/Scripts/run-webkit-tests --no-show-results --no-retry --new-test-results LayoutTests/fast/table/percent-*
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, release>
View the test results at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/results.html
View the archived results dashboard at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/dashboard.html
Baseline search path: linux -> win -> generic
Using Release build
Pixel tests enabled
Regular timeout: 6000, slow test timeout: 30000
Command line: /home/robert/Dev/blink/src/out/Release/content_shell --run-layout-test --enable-crash-reporter --crash-dumps-dir=/home/robert/Dev/blink/src/out/Release/crash-dumps -
Found 24 tests; running 24, skipping 0.
Running 1 content_shell.
[2/24] fast/table/percent-height-border-box-content-in-cell-3.html failed unexpectedly (gpu crashed)
[3/24] fast/table/percent-height-border-box-content-in-cell.html failed unexpectedly (gpu[32454:32472:1204/194538.363787:25355218948:ERROR:browser_gpu_channel_host_factory.cc(113)] crashed)
[11/24] fast/table/percent-height-content-in-fixed-height-content-box-sized-cell.html failed unexpectedly (gpu crashed)
[13/24] fast/table/percent-height-form-elements-in-cell.html failed unexpectedly (gpu crashed)
[20/24] fast/table/percent-heights.html failed unexpectedly (gpu crashed)
[22/24] fast/table/percent-widths-stretch-vertical.html failed unexpectedly (gpu crashed)
[23/24] fast/table/percent-widths-stretch.html failed unexpectedly (gpu crashed)
17 tests ran as expected, 7 didn't:
Each crash looks like:
STDERR: [31931:31931:1204/194412.330908:25269186060:FATAL:memory.cc(22)] Out of memory. size=131072
STDERR: #0 0x7fa9c857658e base::debug::StackTrace::StackTrace()
STDERR: #1 0x7fa9c859ad7b logging::LogMessage::~LogMessage()
STDERR: #2 0x7fa9c85d32c4 base::(anonymous namespace)::OnNoMemory()
STDERR: #3 0x7fa9c85b0237 base::FieldTrialList::CreateTrialsFromSharedMemoryHandle()
STDERR: #4 0x7fa9c85b017a base::FieldTrialList::CreateTrialsFromDescriptor()
STDERR: #5 0x7fa9c85affe7 base::FieldTrialList::CreateTrialsFromCommandLine()
STDERR: #6 0x7fa9c97d3d41 content::(anonymous namespace)::InitializeFieldTrialAndFeatureList()
STDERR: #7 0x7fa9c97d4a41 content::ContentMainRunnerImpl::Run()
STDERR: #8 0x7fa9c97d33f0 content::ContentMain()
STDERR: #9 0x00000046554b main
STDERR: #10 0x7fa9c0dfa830 __libc_start_main
STDERR: #11 0x000000465441 <unknown>
STDERR:
STDERR: Received signal 6
STDERR: #0 0x7fa9c8576127 base::debug::(anonymous namespace)::StackDumpSignalHandler()
STDERR: #1 0x7fa9ca27e3e0 <unknown>
STDERR: #2 0x7fa9c0e0f428 gsignal
STDERR: #3 0x7fa9c0e1102a abort
STDERR: #4 0x7fa9c8574362 base::debug::BreakDebugger()
STDERR: #5 0x7fa9c859b072 logging::LogMessage::~LogMessage()
STDERR: #6 0x7fa9c85d32c4 base::(anonymous namespace)::OnNoMemory()
STDERR: #7 0x7fa9c85b0237 base::FieldTrialList::CreateTrialsFromSharedMemoryHandle()
STDERR: #8 0x7fa9c85b017a base::FieldTrialList::CreateTrialsFromDescriptor()
STDERR: #9 0x7fa9c85affe7 base::FieldTrialList::CreateTrialsFromCommandLine()
STDERR: #10 0x7fa9c97d3d41 content::(anonymous namespace)::InitializeFieldTrialAndFeatureList()
STDERR: #11 0x7fa9c97d4a41 content::ContentMainRunnerImpl::Run()
STDERR: #12 0x7fa9c97d33f0 content::ContentMain()
STDERR: #13 0x00000046554b main
STDERR: #14 0x7fa9c0dfa830 __libc_start_main
STDERR: #15 0x000000465441 <unknown>
STDERR: r8: ffff8ce7635afd18 r9: ffff8ce7635afd08 r10: 0000000000000008 r11: 0000000000000206
STDERR: r12: 00000ce428b1f9a0 r13: 00007ffcc54478f0 r14: 00007ffcc5446e70 r15: 00007ffcc5446e60
STDERR: di: 0000000000007cbb si: 0000000000007cbb bp: 00000000ffffffff bx: 0000000000000000
STDERR: dx: 0000000000000006 ax: 0000000000000000 cx: 00007fa9c0e0f428 sp: 00007ffcc54468b8
STDERR: ip: 00007fa9c0e0f428 efl: 0000000000000206 cgf: 0000000000000033 erf: 0000000000000000
STDERR: trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
STDERR: [end of stack trace]
This started happening on my most recent 'git pull'.
,
Dec 5 2016
HI junov - this started happening to me all of a sudden. Can you suggest anything to troubleshoot or work around it?
,
Dec 5 2016
,
Dec 5 2016
I think https://chromium.googlesource.com/chromium/src/+/c4fe88004d0457cf00b1731c1e974c96a3cd649e might be causing this for me. lawrencewu@ - any suggestions what I should do?
,
Dec 6 2016
That is the generic error thrown when we can't map the fd backing shared memory for field trial state passed over the command line (via the --field-trial-handle flag) from the browser process for whatever reason. If you just want to work around it locally, you can disable kUseSharedMemoryForFieldTrials in field_trial.cc. I'll take a look at this tomorrow too, but if you can help me debug that would be great, too. I have several theories: 1) The field trial handle is not actually being inherited or shared. Since these are renderer tests I think this is the most likely case (because there may not be a browser process spawning the renderer or something). You can try adding a logging statement in CopyFieldTrialStateToFlags and check that the readonly_allocator_handle_ actually has a value, and follow it down to where it gets whitelisted by being added to |fds_to_map| in child_process_launcher.cc. If that code doesn't even run, then that means the handle wasn't passed appropriately. 1a) Since this is the gpu process - I think - and gpu processes are launched differently, I may not have appended the handle here properly. Maybe check argv for this process and see if --field-trial-handle is even being passed through the cmd line. 2) The system is actually out of memory. Seems unlikely since the mapped size is not very big. 3) mmap is failing for some other mysterious reason. Also unlikely since this is new code and I most likely just made a mistake here.
,
Dec 6 2016
Hey robhogan@, I'm getting the following error when I'm trying to run the webkit tests:
lawrencewu@lawrencewu:~/chromium/src$ third_party/WebKit/Tools/Scripts/run-webkit-tests --no-show-results --new-test-results third_party/WebKit/LayoutTests/fast/table/percent-* -t Default
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, debug>
View the test results at file:///usr/local/google/home/lawrencewu/chromium/src/out/Default/layout-test-results/results.html
View the archived results dashboard at file:///usr/local/google/home/lawrencewu/chromium/src/out/Default/layout-test-results/dashboard.html
Baseline search path: linux -> win -> generic
Using Debug build
Pixel tests enabled
Regular timeout: 18000, slow test timeout: 90000
Command line: /usr/local/google/home/lawrencewu/chromium/src/out/Default/content_shell --run-layout-test --enable-crash-reporter --crash-dumps-dir=/usr/local/google/home/lawrencewu/chromium/src/out/Default/crash-dumps -
Found 20 tests; running 20, skipping 0.
Running 1 content_shell.
Failed to start the content_shell process:
content_shell took too long to startup.
[1/20] fast/table/percent-height-content-in-fixed-height-border-box-sized-cell-with-collapsed-border-on-table.html failed unexpectedly (content_shell crashed [pid=23515])
Failed to start the content_shell process:
content_shell took too long to startup.
Do you know how to fix this?
,
Dec 6 2016
I get that when I run:
[robert@mwenge WebKit (548616)]$ Tools/Scripts/run-webkit-tests --no-show-results --additional-drt-flag=--no-zygote LayoutTests/fast/lists/
Using port 'linux-trusty'
Test configuration: <trusty, x86_64, release>
View the test results at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/results.html
View the archived results dashboard at file:///home/robert/Dev/blink/src/out/Release/layout-test-results/dashboard.html
Baseline search path: no-zygote -> linux -> win -> generic
Using Release build
Pixel tests enabled
Regular timeout: 6000, slow test timeout: 30000
Command line: /home/robert/Dev/blink/src/out/Release/content_shell --no-zygote --run-layout-test --enable-crash-reporter --crash-dumps-dir=/home/robert/Dev/blink/src/out/Release/crash-dumps -
Found 117 tests; running 117, skipping 0.
Running 1 content_shell.
Failed to start the content_shell process:
content_shell took too long to startup.
[1/117] fast/lists/001-vertical.html failed unexpectedly (content_shell crashed [pid=19078])
Failed to start the content_shell process:
content_shell took too long to startup.
[2/117] fast/lists/001.html failed unexpectedly (content_shell crashed [pid=19081])
Failed to start the content_shell process:
content_shell took too long to startup.
[3/117] fast/lists/002-vertical.html failed unexpectedly (content_shell crashed [pid=19084])
^CInterrupted, exiting ...
The only other thing I can suggest is to try a Release build rather than a Debug.
,
Dec 6 2016
Hmm, okay, I'll try that out.
,
Dec 6 2016
I'm returning early here, because there's no global_ set:
void FieldTrialList::CopyFieldTrialStateToFlags(
856 const char* field_trial_handle_switch,
857 CommandLine* cmd_line) {
858 // TODO(lawrencewu): Ideally, having the global would be guaranteed. However,
859 // content browser tests currently don't create a FieldTrialList because they
860 // don't run ChromeBrowserMainParts code where it's done for Chrome.
861 LOG(ERROR) << "Got here" << global_;
862 if (!global_)
863 return;
So it looks like (1) is the answer?
,
Dec 6 2016
Yep, looks like it. Do you know where that function is being called from? We probably want to create a FieldTrialList singleton right before so the global_ is defined, like we do here: https://cs.chromium.org/chromium/src/content/app/content_main_runner.cc?sq=package:chromium&rcl=1481026846&l=772. I'm still building the Release build right now...~10000/25000.
,
Dec 6 2016
,
Dec 6 2016
I can see it getting initialized at: https://cs.chromium.org/chromium/src/content/app/content_main_runner.cc?rcl=1481026846&l=306 CopyFieldTrialStateToFlags is getting called from https://cs.chromium.org/chromium/src/content/browser/browser_child_process_host_impl.cc?rcl=0&l=220
,
Dec 6 2016
lawrencewu@ - thanks for looking into this. For now, I'm just working around it by disabling kUseSharedMemoryForFieldTrials. A few minutes looking through the code was enough to convince me I don't have the domain knowledge to be much use here. Let me know if there's anything you want me to experiment with. Patch snippets to apply would be easiest!
,
Dec 13 2016
Hi lawrencewu - are you working on this? Weird that I'm the only one affected by it, so I'm guessing that lowers the severity somewhat. :) Any thoughts on what I need to do to fix my env and work around this permanently?
,
Dec 13 2016
I'm also seeing these gpu crashers all over the place trying to run layout tests in debug.
,
Dec 14 2016
Hey robhogan@, sorry for the delay! Please try pulling and trying again and making sure you have this commit: https://crrev.com/d17b445e932a10c875cea034b037b19d30e47a8c The description is kind of out-of-date, but basically I changed the way we pass the field trial handle on POSIX so that we signal to the child process via a command line switch if the shared memory segment containing the field trial data has been initialized (--field-trial-handle=1). I believe it wasn't being initialized in these tests, so we were trying to read an invalid descriptor. *Hopefully* this should fix your error.
,
Dec 14 2016
Yes, that has fixed it for me. rune@ - OK to close?
,
Dec 16 2016
Yup. WFM with that commit. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by ajha@chromium.org
, Dec 5 2016