New issue
Advanced search Search tips

Issue 659804 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Flaky startup crashes in MusThreadProxy::GpuRefreshNativeDisplays when running with --mash

Project Member Reported by jamescook@chromium.org, Oct 26 2016

Issue description

Chrome ToT r427689
OS ToT 8935 (developer build from today)
On edgar, x86_64 braswell

Run mash a few times, killing it with Ctrl-C on the terminal. It usually works the first time, but eventually you'll get into a state where it crashes on startup:

localhost chrome # sudo -u chronos ./chrome --mash --user-data-dir=/home/chronos --homedir=/ --no-sandbox
[17950:17953:1026/142448:700710153:INFO:child_process_host.cc(202)] Launched child process pid=17955, instance=tracing, name=service:tracing, user_id=d0c3bc2e-a776-4e6c-8b5f-1450c461c7dd
[17950:17954:1026/142448:700716471:INFO:child_process_host.cc(202)] Launched child process pid=17956, instance=mash_session, name=service:chrome_mash, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17955:17955:1026/142448:700754733:ERROR:interface_registry.cc(107)] Failed to locate a binder for interface: tracing::mojom::Factory requested by: service:service_manager exposed by: service:tracing
[17950:17954:1026/142448:700761313:INFO:child_process_host.cc(202)] Launched child process pid=17959, instance=app_driver, name=service:chrome_mash, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17950:17954:1026/142448:700766084:INFO:child_process_host.cc(202)] Launched child process pid=17960, instance=ash, name=service:chrome_mash, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17950:17953:1026/142448:700768623:INFO:child_process_host.cc(202)] Launched child process pid=17961, instance=quick_launch, name=service:chrome_mash, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17950:17954:1026/142448:700771215:INFO:child_process_host.cc(202)] Launched child process pid=17962, instance=browser, name=service:content_browser, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17955:17955:1026/142448:700858477:ERROR:interface_registry.cc(107)] Failed to locate a binder for interface: tracing::mojom::Factory requested by: service:quick_launch exposed by: service:tracing
[17950:17953:1026/142448:700910376:INFO:child_process_host.cc(202)] Launched child process pid=17969, instance=font_service, name=service:chrome_mash, user_id=505C0EE9-3013-43C0-82B0-A84F50CF8D84
[17955:17955:1026/142448:700965514:ERROR:interface_registry.cc(107)] Failed to locate a binder for interface: tracing::mojom::Factory requested by: service:font_service exposed by: service:tracing
[17955:17955:1026/142448:700985807:ERROR:interface_registry.cc(107)] Failed to locate a binder for interface: tracing::mojom::Factory requested by: service:ash exposed by: service:tracing
[17950:17954:1026/142448:700995878:INFO:child_process_host.cc(202)] Launched child process pid=17978, instance=ui, name=service:chrome_mash, user_id=94e293f9-eff4-4b36-87a2-0fabb2dec838
[17960:17960:1026/142448:701000971:ERROR:wm_shell_mus.cc(330)] Not implemented reached in virtual void ash::mus::WmShellMus::RecordUserMetricsAction(ash::UserMetricsAction)
[17960:17960:1026/142448:701003045:ERROR:shell_delegate_mus.cc(208)] Not implemented reached in virtual ash::MediaDelegate* ash::ShellDelegateMus::CreateMediaDelegate()
[17960:17960:1026/142448:701003095:ERROR:shell_delegate_mus.cc(214)] Not implemented reached in virtual std::unique_ptr<ash::PaletteDelegate> ash::ShellDelegateMus::CreatePaletteDelegate()
[17960:17960:1026/142448:701003158:ERROR:wm_shell_mus.cc(406)] Not implemented reached in virtual void ash::mus::WmShellMus::AddDisplayObserver(ash::WmDisplayObserver*)
[17955:17955:1026/142449:701084519:ERROR:interface_registry.cc(107)] Failed to locate a binder for interface: tracing::mojom::Factory requested by: service:ui exposed by: service:tracing
[17950:17953:1026/142449:701976853:INFO:child_process_host.cc(202)] Launched child process pid=18004, instance=test_ime_driver, name=service:chrome_mash, user_id=94e293f9-eff4-4b36-87a2-0fabb2dec838
Received signal 11 SEGV_MAPERR 000000000088
#0 0x7fd5c3ca3a76 base::debug::StackTrace::StackTrace()
#1 0x7fd5c3ca3e70 [17978:18002:1026/142449:702046066:ERROR:vaapi_wrapper.cc(1217)] Could not get a valid VA display
[17978:18002:1026/142449:702046464:ERROR:generic_v4l2_device.cc(109)] SetDevicePollInterrupt(): write() failed: Bad file descriptor
base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7fd5c0582180 <unknown>
#3 0x7fd5c3d2f8aa base::Thread::IsRunning()
#4 0x7fd5c1a8b6e3 ui::MusThreadProxy::GpuRefreshNativeDisplays()
#5 0x7fd5c1a7f029 ui::DrmDisplayHostManager::UpdateDisplays()
#6 0x7fd5c49a72d2 ui::UpdateDisplayConfigurationTask::Run()
#7 0x7fd5c499fc5c ui::DisplayConfigurator::ForceInitialConfigure()
#8 0x7fd5c3769090 display::PlatformScreenOzone::Init()
#9 0x7fd5c8e375c1 ui::ws::WindowManagerWindowTreeFactorySet::OnWindowManagerWindowTreeFactoryReady()
#10 0x7fd5c8e5ff47 ui::ws::WindowManagerWindowTreeFactory::SetWindowTree()
#11 0x7fd5c8e6003c ui::ws::WindowManagerWindowTreeFactory::CreateWindowTree()
#12 0x7fd5c1b7aafc ui::mojom::WindowManagerWindowTreeFactoryStubDispatch::Accept()
#13 0x7fd5c439d7fd mojo::InterfaceEndpointClient::HandleValidatedMessage()
#14 0x7fd5c439b005 mojo::FilterChain::Accept()
#15 0x7fd5c439d40c mojo::InterfaceEndpointClient::HandleIncomingMessage()
#16 0x7fd5c43a47f3 mojo::internal::MultiplexRouter::ProcessIncomingMessage()
#17 0x7fd5c43a6b85 mojo::internal::MultiplexRouter::Accept()
#18 0x7fd5c439b005 mojo::FilterChain::Accept()
#19 0x7fd5c43979c7 mojo::Connector::ReadSingleMessage()
#20 0x7fd5c4397ae3 mojo::Connector::ReadAllAvailableMessages()
#21 0x7fd5c4397d53 mojo::Connector::OnHandleReadyInternal()
#22 0x7fd5c43af39b mojo::Watcher::OnHandleReady()
#23 0x7fd5c3d654e6 base::debug::TaskAnnotator::RunTask()
#24 0x7fd5c3cd0d82 base::MessageLoop::RunTask()
#25 0x7fd5c3cd127e base::MessageLoop::DeferOrRunPendingTask()
#26 0x7fd5c3cd5018 base::MessageLoop::DoWork()
#27 0x7fd5c3cd5c49 base::MessagePumpLibevent::Run()
#28 0x7fd5c3cd21ee base::MessageLoop::RunHandler()
#29 0x7fd5c3d01f58 base::RunLoop::Run()
#30 0x7fd5c17d4e7e MashRunner::StartChildApp()
#31 0x7fd5c17d2725 _ZN4base8internal7InvokerINS0_9BindStateIM10MashRunnerFvN4mojo16InterfaceRequestIN15service_manager5mojom7ServiceEEEEJNS0_17UnretainedWrapperIS3_EEEEEFvS9_EE3RunEPNS0_13BindStateBaseEOS9_
#32 0x7fd5c2e11ecf service_manager::ChildProcessMainWithCallback()
#33 0x7fd5c17d4b7a MashRunner::RunChild()
#34 0x7fd5c17d4d27 MashMain()
#35 0x7fd5c17cf2dd ChromeMain
#36 0x7fd5bf1d1fb6 __libc_start_main
#37 0x7fd5c17cf0d4 <unknown>
  r8: 0000000000000000  r9: 00000d61634155f0 r10: fffffffc0151f40c r11: fffff29d620762cc
 r12: 00007fff8cdba800 r13: 00000d61634c2b98 r14: 00000d6163558750 r15: 00007fd5c1a85360
  di: 0000000000000000  si: 00007fff8cdba800  bp: 00007fff8cdba7c0  bx: 00000d61634e0a80
  dx: 0000000000000000  ax: 0000000000000000  cx: 00000d616355b640  sp: 00007fff8cdba570
  ip: 00007fd5c3d2f8aa efl: 0000000000010202 cgf: 0000000000000033 erf: 0000000000000004
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000088
[end of stack trace]

Rebooting seems to fix the issue.

To rjkroege for triage.

 
Also, when you are in this state "start ui" will come up normally and display the lock screen.

Cc: kylec...@chromium.org penghuang@chromium.org sadrul@chromium.org
+a few other people

I'm now blocked on this crash.

It happens whenever I try to start chrome either via "start ui" or via "sudo -u chronos ./chrome --mash --user-data-dir=/home/chronos --homedir=/ --no-sandbox".

I'm on edgar with a ToT r431874 build of chrome.

Platform is a test image with M56 dev channel 8953.0.0 from a couple weeks ago (the most recent dev channel for this board).

Linux localhost 3.18.0-13425-g95a3f1d #1 SMP PREEMPT Tue Nov 1 03:02:55 PDT 2016 x86_64 Intel(R) Celeron(R) CPU N3060 @ 1.60GHz GenuineIntel GNU/Linux

#0 0x7f2e36e7adf6 base::debug::StackTrace::StackTrace()
#1 0x7f2e36e7b1f0 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7f2e33e68180 <unknown>
#3 0x7f2e352482a5 ui::MusThreadProxy::GpuRefreshNativeDisplays()
#4 0x7f2e3523ea39 [19194:19214:1114/161551:1078672322:ERROR:vaapi_wrapper.cc(1217)] Could not get a valid VA display
ui::DrmDisplayHostManager::UpdateDisplays()
#5 0x7f2e3788b249 ui::UpdateDisplayConfigurationTask::Run()
#6 0x7f2e37885135 ui::DisplayConfigurator::ForceInitialConfigure()
#7 0x7f2e36a5daec display::PlatformScreenOzone::Init()
#8 0x7f2e36a50d8b ui::Service::OnStart()
#9 0x7f2e37b79b1f service_manager::ServiceContext::OnStart()
#10 0x7f2e35253f65 service_manager::mojom::ServiceStubDispatch::AcceptWithResponder()
#11 0x7f2e370c0fef mojo::InterfaceEndpointClient::HandleValidatedMessage()
#12 0x7f2e370c5331 mojo::internal::MultiplexRouter::ProcessIncomingMessage()
#13 0x7f2e370c69d5 mojo::internal::MultiplexRouter::Accept()
#14 0x7f2e370be017 mojo::Connector::ReadSingleMessage()
#15 0x7f2e370be133 mojo::Connector::ReadAllAvailableMessages()
#16 0x7f2e370cbb58 mojo::Watcher::OnHandleReady()
#17 0x7f2e36f0e0c5 base::debug::TaskAnnotator::RunTask()
#18 0x7f2e36e9bbef base::MessageLoop::RunTask()
#19 0x7f2e36e9c07e base::MessageLoop::DeferOrRunPendingTask()
#20 0x7f2e36e9e528 base::MessageLoop::DoWork()
#21 0x7f2e36e9eaf9 base::MessagePumpLibevent::Run()
#22 0x7f2e36e9b96a base::MessageLoop::RunHandler()
#23 0x7f2e36ebe8fa base::RunLoop::Run()
#24 0x7f2e34ff63bf MashRunner::StartChildApp()
#25 0x7f2e34ff4cc4 _ZN4base8internal7InvokerINS0_9BindStateIM10MashRunnerFvN4mojo16InterfaceRequestIN15service_manager5mojom7ServiceEEEEJNS0_17UnretainedWrapperIS3_EEEEEFvS9_EE3RunEPNS0_13BindStateBaseEOS9_
#26 0x7f2e362c5dd4 service_manager::ChildProcessMainWithCallback()
#27 0x7f2e34ff6097 MashRunner::RunChild()
#28 0x7f2e34ff6227 MashMain()
#29 0x7f2e34ff317d ChromeMain
#30 0x7f2e32ad9fb6 __libc_start_main
#31 0x7f2e34ff2f74 <unknown>

Comment 3 by sadrul@chromium.org, Nov 15 2016

When you press ctrl+c to terminate chrome, it does not always kill all the processes. Probably something like that is happening here? If an older mus/gpu process is up and running, it probably retains some lock on some system resource, and the new mus/gpu process is unable to talk to it, causing this failure?

Some potentially interesting logs in the first backtrace:

[17978:18002:1026/142449:702046066:ERROR:vaapi_wrapper.cc(1217)] Could not get a valid VA display
[17978:18002:1026/142449:702046464:ERROR:generic_v4l2_device.cc(109)] SetDevicePollInterrupt(): write() failed: Bad file descriptor

Could those be relevant?
I think this could be my fault from https://codereview.chromium.org/2484223006/. Taking a look now.
Reverting that CL does fix the crash. The CL doesn't revert cleanly anymore though.

I'm still not getting any display output on device after reverting. It doesn't crash but DrmNativeDisplayDelegate is returning 0 displays now. I'm trying to figure out what is going on with that now.
I *think* I understand the original problem and crash. We always had a race a between gpu service initialization and display initialization. Normally gpu service init would win and it wasn't a problem. Sometimes display initialization would win and MusThreadProxy::StartDrmThread() wouldn't have happened yet. Maybe the delay in gpu service initialization was caused by a lock on system resources that wasn't cleared properly, as sadrul suggested.

I moved display initialization to happen earlier, causing the crash to happen every time. The right solution is to wait for GPU initialization to finish before starting display initialization, I think?
kylechar@ I'm not sure that I agree. But that's probably because I'm not understanding correctly. Maybe we need some kind of design note that clarifies the start up ordering of the different threads and subsystems in musws and musgpu?
Project Member

Comment 8 by bugdroid1@chromium.org, Nov 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7

commit 3e7dcd6974ae2895ba6cc2f38141c90338d20fc7
Author: kylechar <kylechar@chromium.org>
Date: Wed Nov 16 06:24:18 2016

Fix GPU service and display initialization race.

When running mustash on device with Ozone, display initialization was
happening before GPU service initialization was finished. The DRM thread
wasn't ready when DrmNativeDisplayDelegate went to use it.

The problem was made much worse by crrev.com/2484223006 when display
initialization was moved so it started earlier. With this patch, we
start display initialization when it's known to be safe.

BUG= 659804 

Review-Url: https://codereview.chromium.org/2498073003
Cr-Commit-Position: refs/heads/master@{#432388}

[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/service.cc
[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/service.h
[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/ws/test_utils.cc
[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/ws/test_utils.h
[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/ws/window_server.cc
[modify] https://crrev.com/3e7dcd6974ae2895ba6cc2f38141c90338d20fc7/services/ui/ws/window_server_delegate.h

Status: Fixed (was: Assigned)
This particular problem is fixed. Unfortunately mustash running with Ozone DRM still doesn't work, as DRM finds 0 displays. See  crbug.com/665544 .

Comment 10 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 11 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 12 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 14 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)
Components: -Internals>MUS Internals>Services>WindowService

Sign in to add a comment