Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 4 users
Status: Fixed
Closed: Oct 2013
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

issue 495906
issue 323391
issue 328249
issue 357644
issue 695643
issue 713947
issue 715658

  • Only users with EditIssue permission may comment.

Sign in to add a comment
[telemetry] Telemetry tests intermittently timing out
Project Member Reported by, Oct 18 2013 Back to list
Some time during October 17, the Telemetry steps on the Linux GPU bots started timing out intermittently. The problem's happening on two separate bots and apparently only on Linux debug configurations, so it looks like a software rather than a hardware problem.

The two builders:

The first builds which demonstrated the problem:

This needs to be triaged right away. We mustn't let flakiness like this remain in the tree.

Comment 1 by, Oct 18 2013
Another example which shows a different Telemetry based step timing out:

Comment 2 by, Oct 18 2013
Here's a screenshot of the bot's results, since these will scroll off the bottom of the machine's logs within about a day. Please try to ignore the solid red at the top, which was caused by a Blink roll affecting a bad test. The webgl_conformance step even inside the red section was intermittently timing out and getting killed by the bot's harness.

Screen Shot 2013-10-18 at 11.44.27 AM.png
533 KB View Download
Comment 3 by, Oct 18 2013

Comment 4 by, Oct 18 2013
Spotted a similar failure on Linux Perf (1) running page_cycler.bloat
Comment 5 by, Oct 18 2013
It seems difficult to pin down a CL which started this. Looking back in the build logs for the Linux Debug (NVIDIA) bot on the GPU waterfall, this build is quite far before the failures started:

and this is the first one showing this failure:

This corresponds to a regression range of 229124:229202:

I don't see any suspect CLs in that regression range which would cause either the browser or Telemetry to hang. I think we need to try to directly reproduce this problem locally.

Comment 6 by, Oct 18 2013
I also don't see any suspect CLs in the tools workspace in that range:

Comment 7 by, Oct 18 2013
After discussion with the infra team, it's almost certain that one of the chrome sub-processes is hanging, failing to exit, and as a consequence Telemetry and the rest of the harnesses hang forever.

We're going to try to catch this failure as it happens on the bot, and figure out what exactly is hanging.

Comment 8 by, Oct 18 2013
Script is in the home dir on the affected bot; grabbing into pstrees-log.txt in the same dir; running now on the bot.

Comment 9 by, Oct 18 2013
It's been confirmed that chrome's utility process is the rogue sub-process which is sticking around too long.

In the pstree output, had two child processes which didn't show up in ps output. It seems clear they were waiting for the utility process to exit. While the webgl_conformance_tests step was stuck, killing the utility process by hand got it unstuck and allowed the run to complete successfully.

Unassigning from dtu and trying to find a reason why this started failing recently, and an owner.

Comment 10 by, Oct 18 2013
+jorgelo who knows a lot about the utility process.

Will post more details shortly.

Comment 11 by, Oct 18 2013
Labels: Restrict-View-Google
I'm restricting the view of this bug in order to post details from the affected bot.

Comment 12 by, Oct 18 2013
Here are two ps/pstree outputs. One is while (one of) the test steps was running -- I'm not positive it was the webgl_conformance_test step -- and the other is after the test claimed it completed, but was hung.'s python process claims to have two child python processes, which seem to be dead -- they don't show up in ps's output, and I don't know why they're showing up in pstree. There's a stray Chrome utility process as PID 8012. The utility process is definitely the problem. After gathering this data, I killed that PID, and that unstuck the step, allowing it to complete successfully. The build this happened on:

Note that the webgl_conformance_tests step took 17 minutes to run instead of the usual 4. It unstuck immediately upon killing the stray utility process.

44.1 KB View Download
29.5 KB View Download
Infra team, how does wait for its children to finish? Does it grep for a certain pattern or does it just wait for its (process-tree) children?
Comment 14 by, Oct 18 2013
It waits for its children to return.

Source is here:

Line 948 -> _RunGTestCommand() on line 88

From there it calls RunCommand, line 740 of

The actual 'wait for process to end' is on line 960.
For those of us who don't hack on the browser itself that much, what exactly is the Chrome Utility process anyway? And should we file a blocking bug for "Chrome utility process fails to exit cleanly"?
Comment 16 by, Oct 18 2013
Here are:

 - Combined ps/pstree output during the most recent run while the utility process was stuck (with jorgelo's suggested format, including the parent PID)

 - A backtrace from the utility process while it was stuck, showing that it's stuck inside tcmalloc (!) deep inside process spawning.

5.4 KB View Download
24.1 KB View Download
How many times was the backtrace run, and did it give the same output? In other words, are we actually stuck inside tcmalloc, or are we stuck inside some loop higher up that happened to be calling tcmalloc when the stacktrace was taken?
We run it a couple of times with the same output. Things seemed to have deadlocked or be blocked in the lock.
If you look at the ps output, you see that the cmdline for the deadlocked process includes chrome, but the actual process name is "dconf worker".

dconf apparently launches a worker thread to read config info, which is exactly what's happening in the backtrace, the backtrace chains up to clone().
+cc Julien because he knows more about Linux process creation than I ever will.

Julien: we're seeing a runaway "dconf worker" thread which looks to be spawned from a utility process, when running Telemetry tests. The thread seems to be blocked in a spinlock, and the backtrace chains all the way up to clone().
Comment 21 by, Oct 19 2013
Now that I look at it again, the failure on perf doesn't look the same. That one failed at the very beginning of the reference test run, not at the end of a run. So this failure could still just be limited to Debug builds.
Comment 22 by, Oct 19 2013
Yes, this is the very typical bug that happens when someone uses non pure function between fork() and execve().

If when you fork() another thread is having a lock (for instance because it's allocating memory), that lock exists in the child process. If the child process now tried to acquire the same lock before execve(), you have a deadlock.

glib/gspawn.c:947 calls opendir(), which is a bug. We need to fix that.
Comment 23 by, Oct 19 2013
Even broader regression range. If a code change caused this, it's almost certainly in this range:

Still not seeing anything obvious.

The GPU bots used to build with linux_use_tcmalloc=0 before the switch to recipes. Given that the thread in the backtrace above is stuck in an infinite loop in tcmalloc, perhaps a workaround would be to turn it off again.

Comment 24 by, Oct 19 2013
The lock in tcmalloc is reall just a symptom, not the bug.

The bug is either in glib (which shouldn't call opendir()), or in the utility process which shouldn't call into glic while multithreaded.

Did anyone figure out a backtrace before the utility process calls into glib already ? We might be able to workaround by making sure we're not multithreaded there.
We haven't yet. Ken is running more tracing to see if we can better
identify the deadlocked process' lifetime.
Comment 26 by, Oct 19 2013
Did we rule out this patch:
229174	[MediaGalleries] Add utility process handling for iphoto library parsing.
Comment 27 by, Oct 19 2013
So yeah, glib's fork_exec_with_pipes() function is broken and has been so for a while.

- We may be able to workaround the immediate bug by somehow making sure to set G_SPAWN_LEAVE_DESCRIPTORS_OPEN, but that will be very brittle.

- A much better workaround is to figure out what is calling into glib and make sure that we move glib's initialization much earlier in the process lifetime, before we have any thread.

gfx::GtkInitFromCommandLine() should be done very early in the utility process.
Comment 28 by, Oct 19 2013
On second thought, I don't see anything that would affect linux in the patch. I just got distracted by the fact that it said "utility process."
Comment 29 by, Oct 19 2013
I think the workaround might be to initialize glib in UtilityMain() before creating threads instead of doing so in UtilityThreadImpl::OnLoadPlugins().

I don't know anything about these things, so I'll defer to mpcomplete@ to see if it makes any sense.

If someone can reproduce the bug and can try the suggested workaround, that would be helpful.
Comment 30 by, Oct 19 2013
jorgelo@ suggested increasing the sampling rate to 10 times per second to be able to see the process's lifecycle better. Haven't yet seen a run where this problem happened, but interestingly, *do* have logs from a run where a content_browsertests sub-process leaked in the same way. Attached is a file containing four ps/pstree snapshots:

1) When the leaked process (pid 15015) first started, as a child of content_browsertests
2) When its parent process (pid 15009) is becoming defunct
3) When its parent process has exited, and it's been reparented to init (pid 1)
4) Just before the machine rebooted, showing the orphaned child still in the same state

Interestingly, this sub-process didn't cause the python harnesses to hang. The main difference seems to be that the content_browsertests executable was sandboxed and the utility process wasn't.

I believe that the switch to Linux Aura will solve this problem conclusively by removing Chrome's dependency on Gtk and, presumably, glib. Elliot, can you confirm that this will be the case? If so, I don't think we should waste any time working around this flakiness, but instead invest that effort in finishing the switch to Aura and live with the flakiness in the meantime.

Comment 31 by, Oct 19 2013
Forgot the attachment.

90.7 KB View Download
If you git blame OnLoadPlugins(), you'll see who moved GLib init there ;-)

Let me take a look on Monday to see why I had to move it.
Comment 33 by, Oct 19 2013
It looks like Gtk initialization was moved into OnLoadPlugins in 2011. It's definitely not the reason why things started being flaky now, and I would not ask to move it earlier in process initialization given that Aura, and removal of Gtk dependencies, is the product's direction.

Comment 34 by, Oct 19 2013
jorgelo: I don't think there is a clear CL to blame.

GLib is broken with a subtle bug that I described earlier. The bug is inherently racy, tracking down why a thread is now sometimes allocated exactly when glib does fork, or even worse, why it's precisely a case where tcmalloc needs to hold a lock is tricky or impossible. As far as we know, a compiler update might just make that bug suddenly trigger.

(I did git log / git blame around and I didn't see anything obvious)

kbr: I'm confident in my diagnosis, so if it's ok to just move around gfx::GtkInitFromCommandLine() to UtilityMain(), (i.e. if the fix is trivial), we should do that.

This is a real bug that will sometimes trigger on users' machines. ;-)

That's why we need to check.
Comment 36 by, Oct 21 2013
The GTK dependency will be removed...from chrome proper. For most people, there'll still be that small gtk shim for opening dialogs and getting desktop setting information.

I don't know if removing glib will ever be plausible. Message loops, proxy configuration, the internationalization stuff, talking to system menu and appindicator... It'll be hard enough to pull gtk out of the main binary. I don't want to think about removing glib.
To answer a previous question that Ken and I had, there's indeed one test in webgl_conformance_testing that's spawning an instance of the utility process.
Comment 38 by, Oct 21 2013
Which one?

Comment 39 by, Oct 21 2013
Jorge, I'm assigning this bug to you to follow up on Julien's suggestion to move the Glib initialization earlier. I'm not qualified to make this change as I gather there were dependent sandboxing changes made on top of it. Please make this a top priority. The Linux GPU bots have suddenly turned flaky because of it and this is not an acceptable situation.

It's not clear whether moving GTK earlier is possible. GTK init needs FS access and when we fork from the Zygote this is not possible. I'm gonna cross my logs so maybe we can make the webgl tests not start the utility process.
Comment 41 by, Oct 21 2013
But OnLoadPlugins() still runs after the normal initialization, no ?

It looks like we should just pre-init glib any time the sandbox is not engaged, which will include any case where OnLoadPlugins() might run.

We may even be able to do better, don't we know at startup that and instance of the utility process might need to load plugins ?

I'm not sure what OnLoadPlugins does - but I don't think we know it on startup. rsesek wrote that code initially. Maybe he knows.
MetricsService kicks of an initialization task (about 10s after startup IIRC) to probe the list of plugins for UMA reporting. The list of plugins installed needs to be probed for compatibility (i.e. is this really an NPAPI module, can we get its version info, etc.), and this used to be done in the browser process, which meant loading somebody else's code into the browser process address space. That caused crashes from buggy code, so I moved the plugin loading to be out-of-process via the Utility process (unsanboxed). OnLoadPlugins is the IPC handler for the message sent by the browser to attempt to load/probe a plugin. This will be called from UtilityMain via the UtilityThreadImpl.
Comment 44 by, Oct 21 2013
So the question is: how can we know in UtilityMain if we're the "MetricsService" code?

We need to be able to initialize glib iff the current utility process is "MetricsService". Is there a command line flag?
That we don't know. We don't know which utility IPC message we're gonna get. However, we do know whether we're sandboxed or not, so we could init GTK iff we're not sandboxed, before starting threads.
Sorry, should have been more clear. The first time that plugins are needed via PluginServiceImpl::GetPlugins(), the browser process will spawn this utility process on POSIX systems, to load the plugin list. Typically this is done by MetricsService at startup, but any web page that has a plugin on it could also cause this. The MetricsService lives in the browser process and is responsible for UMA uploads.
Comment 47 by, Oct 21 2013
I thought we did not re-use the utility process and ran one instance for every "operation". Shouldn't we be able to know which utility process we're running and for what?

Why is the utility process still generic and can still accept any of the IPCs in that case? It's this same problem that also prevent doing any meaningful sandboxing of the utility process in LinuxSandbox::InitializeSandbox().

Is there a fundamental reason for this or is it something that needs to be done?
The utility process is one-off: in UtilityThreadImpl::OnLoadPlugins(), it calls ReleaseProcessIfNeeded() to shut it down when all plugins are loaded. The ChildProcessHost will create another if plugins need to be reloaded. There's no reason why the utility process couldn't be specialized on a per-need basis using a --utility-operation=load-plugins flag, for example. Note that this is the only use in content/ of the utility process, but there are other uses in chrome/ via the ChromeContentUtilityClient.
Project Member Comment 49 by, Oct 23 2013
r230287 | | 2013-10-23T02:03:17.986382Z

Changed paths:

Utility process: initialize GTK when the sandbox is disabled.

TEST=Open the Web Store, utility process launches correctly.

Review URL:
Comment 50 by, Oct 23 2013
Thanks for that fix. It did improve the reliability of these bots, but the same problem resurfaced in the browser process. I'll take the bug back and fix that.

Comment 51 by, Oct 23 2013
Status: Started hung after running the context_lost_tests through Telemetry. When it did, the following chrome processes were active:

    1  5917 chrome          /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/chrome --enable-gpu-benchmarking --disable-domain-blocking-for-3d-apis --disable-background-networking --metrics-recording-only --no-first-run --no-default-browser-check --no-proxy-server --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/27.0.1453.111 Safari/537.22 --disable-component-extensions-with-background-pages --remote-debugging-port=40778 --enable-crash-reporter-for-testing --window-size=1280,1024 --ppapi-flash-path=/b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/third_party/adobe/flash/binaries/ppapi/linux_x64/ --enable-net-benchmarking --user-data-dir=/tmp/tmpJQAoQ2 --type=sandbox-ipc
    1  5918 chrome_sandbox  /opt/chromium/chrome_sandbox /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/chrome --type=zygote --noerrdialogs --user-data-dir=/tmp/tmpJQAoQ2 --ppapi-flash-path=/b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/third_party/adobe/flash/binaries/ppapi/linux_x64/
 5918  5919 chrome          /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/chrome --type=zygote --noerrdialogs --user-data-dir=/tmp/tmpJQAoQ2 --ppapi-flash-path=/b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/third_party/adobe/flash/binaries/ppapi/linux_x64/
 5919  5923 nacl_helper     /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/nacl_helper
 5919  5924 chrome          /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/chrome --type=zygote --noerrdialogs --user-data-dir=/tmp/tmpJQAoQ2 --ppapi-flash-path=/b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/third_party/adobe/flash/binaries/ppapi/linux_x64/
    1  5928 dconf worker    /b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/out/Debug/chrome --enable-gpu-benchmarking --disable-domain-blocking-for-3d-apis --disable-background-networking --metrics-recording-only --no-first-run --no-default-browser-check --no-proxy-server --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/27.0.1453.111 Safari/537.22 --disable-component-extensions-with-background-pages --remote-debugging-port=40778 --enable-crash-reporter-for-testing --window-size=1280,1024 --ppapi-flash-path=/b/build/slave/GPU_Linux__dbg___NVIDIA_/build/src/third_party/adobe/flash/binaries/ppapi/linux_x64/ --enable-net-benchmarking --user-data-dir=/tmp/tmpJQAoQ2

Stack trace from the sandbox-ipc process:

Thread 1 (Thread 0x7f2b4e18f9c0 (LWP 5917)):
#0  0x00007f2b5410d2d8 in __GI___poll (fds=0x7fff38a23720, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x00007f2b59f09323 in content::SandboxIPCProcess::Run (this=0x7fff38a23f20)
    at ../../content/browser/renderer_host/
#2  0x00007f2b59f08d7f in content::RenderSandboxHostLinux::Init (this=0x7f2b4df8e5c0, sandbox_path=...)
    at ../../content/browser/renderer_host/
#3  0x00007f2b59b47776 in content::(anonymous namespace)::SetupSandbox (parsed_command_line=...)
    at ../../content/browser/
#4  0x00007f2b59b48caf in content::BrowserMainLoop::EarlyInitialization (this=0x7f2b4e0b7020)
    at ../../content/browser/
#5  0x00007f2b59b51dc6 in content::BrowserMainRunnerImpl::Initialize (this=0x7f2b4df91200, parameters=...)
    at ../../content/browser/
#6  0x00007f2b59b46fda in content::BrowserMain (parameters=...) at ../../content/browser/
#7  0x00007f2b59b16de7 in content::RunNamedProcessTypeMain (process_type=..., main_function_params=..., 
    delegate=0x7fff38a24d30) at ../../content/app/
#8  0x00007f2b59b17fb6 in content::ContentMainRunnerImpl::Run (this=0x7f2b4e0e47a0)
    at ../../content/app/
#9  0x00007f2b59b16277 in content::ContentMain (argc=16, argv=0x7fff38a24e98, delegate=0x7fff38a24d30)
    at ../../content/app/
#10 0x00007f2b6698e9bd in ChromeMain (argc=16, argv=0x7fff38a24e98) at ../../chrome/app/
#11 0x00007f2b6698e97c in main (argc=16, argv=0x7fff38a24e98) at ../../chrome/app/

Stack trace from the nacl_helper process:
Thread 1 (process 5923):
#0  0x00007f89cbb8b210 in ?? ()
#1  0x00007f89ccb11da9 in ?? ()
#2  0x00007fff1e9b7240 in ?? ()
#3  0x0000000000000800 in ?? ()
#4  0x00007fff1e9b7280 in ?? ()
#5  0x0000000300000000 in ?? ()
#6  0x00007fff1e9b6ae0 in ?? ()
#7  0xffff8075c3436114 in ?? ()
#8  0x00007fff1e9b6ae0 in ?? ()
#9  0x0000000000000000 in ?? ()

The stack trace from the browser process (pid 5928) shows that it's stuck the same way the utility process used to be:
Thread 1 (Thread 0x7f2b4c74b700 (LWP 5928)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
#1  0x00007f2b669a8c21 in base::internal::SpinLockDelay (w=0x7f2b69f48b48, value=2, loop=102855)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock_linux-inl.h:97
#2  0x00007f2b669a8a17 in SpinLock::SlowLock (this=0x7f2b69f48b48)
    at ../../third_party/tcmalloc/chromium/src/base/
#3  0x00007f2b66993159 in SpinLock::Lock (this=0x7f2b69f48b48)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock.h:75
#4  0x00007f2b669a4329 in MallocBlock::Initialize (this=0x2a7a2013a000, size=32808, type=-271733872)
    at ../../third_party/tcmalloc/chromium/src/
#5  0x00007f2b669a4b36 in MallocBlock::Allocate (size=32808, type=-271733872)
    at ../../third_party/tcmalloc/chromium/src/
#6  0x00007f2b669a2079 in DebugAllocate (size=32808, type=-271733872)
    at ../../third_party/tcmalloc/chromium/src/
#7  0x00007f2b669a5ac3 in debug_cpp_alloc (size=32808, new_type=-271733872, nothrow=true)
    at ../../third_party/tcmalloc/chromium/src/
#8  0x00007f2b669a5b4f in do_debug_malloc_or_debug_cpp_alloc (size=32808)
    at ../../third_party/tcmalloc/chromium/src/
#9  0x00007f2b68b07fa5 in tc_malloc (size=32808)
    at ../../third_party/tcmalloc/chromium/src/
#10 0x00007f2b540e051b in __alloc_dir (fd=11, close_fd=true, flags=<optimized out>, statp=<optimized out>)
    at ../sysdeps/unix/opendir.c:202
#11 0x00007f2b55d60b8a in fdwalk (cb=<optimized out>, data=<optimized out>)
    at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:947
#12 do_exec (child_err_report_fd=13, stdin_fd=-1, stdout_fd=15, stderr_fd=17, 
    working_directory=<optimized out>, argv=0x2a7a2012f620, envp=0x0, close_descriptors=1, search_path=1, 
    stdout_to_null=0, stderr_to_null=0, child_inherits_stdin=0, file_and_argv_zero=0, child_setup=0, 
    user_data=0x0) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:1061
#13 0x00007f2b55d613b0 in fork_exec_with_pipes (intermediate_child=0, working_directory=0x0, 
    argv=0x2a7a2012f620, envp=0x0, close_descriptors=1, search_path=1, stdout_to_null=0, stderr_to_null=0, 
    child_inherits_stdin=0, file_and_argv_zero=0, child_setup=0, user_data=0x0, child_pid=0x7f2b4c74a868, 
    standard_input=0x0, standard_output=0x7f2b4c74a860, standard_error=0x7f2b4c74a864, error=0x7f2b4c74a9a8)
    at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:1324
#14 0x00007f2b55d61a74 in g_spawn_sync (working_directory=<optimized out>, argv=<optimized out>, 
    envp=<optimized out>, flags=<optimized out>, child_setup=<optimized out>, user_data=0x0, 
    standard_output=0x7f2b4c74a938, standard_error=0x7f2b4c74a940, exit_status=0x7f2b4c74a94c, 
    error=0x7f2b4c74a9a8) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:285
#15 0x00007f2b55d62179 in g_spawn_command_line_sync (command_line=<optimized out>, 
    standard_output=0x7f2b4c74a938, standard_error=0x7f2b4c74a940, exit_status=0x7f2b4c74a94c, 
    error=0x7f2b4c74a9a8) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:722
#16 0x00007f2b5205a789 in get_session_address_dbus_launch (error=0x7f2b4c74a9a8)
    at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1060
#17 0x00007f2b5205c1da in get_session_address_platform_specific (error=0x7f2b4c74a9a8)
    at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1451
#18 g_dbus_address_get_for_bus_sync (bus_type=G_BUS_TYPE_SESSION, cancellable=<optimized out>, 
    error=0x7f2b4c74aa88) at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1534
#19 0x00007f2b52064fae in get_uninitialized_connection (bus_type=G_BUS_TYPE_SESSION, cancellable=0x0, 
    error=0x7f2b4c74aa88) at /build/buildd/glib2.0-2.32.3/./gio/gdbusconnection.c:6750
#20 0x00007f2b5206c4bb in g_bus_get_sync (bus_type=<optimized out>, cancellable=0x0, error=0x7f2b4c74aa88)
    at /build/buildd/glib2.0-2.32.3/./gio/gdbusconnection.c:6807
#21 0x00007f2b4c750caa in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#22 0x00007f2b4c751087 in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#23 0x00007f2b55d23d53 in g_main_dispatch (context=0x7f2b4e0dfb60)
    at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:2539
#24 g_main_context_dispatch (context=0x7f2b4e0dfb60) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3075
#25 0x00007f2b55d240a0 in g_main_context_iterate (dispatch=1, block=<optimized out>, context=0x7f2b4e0dfb60, 
    self=<optimized out>) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3146
#26 g_main_context_iterate (context=0x7f2b4e0dfb60, block=<optimized out>, dispatch=1, self=<optimized out>)
    at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3083
#27 0x00007f2b55d2449a in g_main_loop_run (loop=0x2a7a20128fc0)
    at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3340
#28 0x00007f2b4c75098b in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#29 0x00007f2b55d459e5 in g_thread_proxy (data=0x7f2b4dfa6de0)
    at /build/buildd/glib2.0-2.32.3/./glib/gthread.c:801
#30 0x00007f2b5566ee9a in start_thread (arg=0x7f2b4c74b700) at pthread_create.c:308
#31 0x00007f2b54118ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#32 0x0000000000000000 in ?? ()

Gtk/Glib initialization in the browser process must be moved earlier, before any threads are started.

Project Member Comment 52 by, Oct 23 2013
r230547 | | 2013-10-23T23:19:34.466362Z

Changed paths:

Move Gtk initialization earlier in BrowserMain.

In order to work around bugs in glib, it must be initialized before
any threads are started in the process. If this is not done then the
process will intermittently hang upon exit.


Review URL:
Comment 53 by, Oct 24 2013
This just flaked again in . The webgl_conformance_tests step got stuck because of the utility process. Here's the pstree output:

 2262  6078 python          python -u /b/build/scripts/slave/ --target Debug --build-dir /b/build/slave/Linux_Debug__NVIDIA_/build/src/out --no-xvfb --factory-prope
rties {"blamelist": ["", "", "", "", "", ""], "branch": 
"src", "build_config": "Debug", "buildername": "Linux Debug (NVIDIA)", "buildnumber": 20764, "generate_gtest_json": true, "mastername": "chromium.gpu", "project": "", "recipe
": "gpu", "repository": "svn://", "revision": "230583", "scheduler": "gpu", "slavename": "gpulin2", "test_results_server": "test-resu", "workdir": "/b/build/slave/Linux_Debug__NVIDIA_"} --build-properties {"blamelist": ["", "", "", "p", "", ""], "branch": "src", "build_config": "Debug", "buildername": "Linux Debug (NVIDIA)", "buildnumber": 20764, 
"generate_gtest_json": true, "mastername": "chromium.gpu", "project": "", "recipe": "gpu", "repository": "svn://", "revision": "23058
3", "scheduler": "gpu", "slavename": "gpulin2", "test_results_server": "", "workdir": "/b/build/slave/Linux_Debug__NVIDIA_"} --annotate=gtest --test-t
ype=webgl_conformance_tests --generate-json-file --results-directory=/b/build/slave/Linux_Debug__NVIDIA_/build/gtest-results/webgl_conformance_tests --build-number=20764 --bu
ilder-name=Linux Debug (NVIDIA) --run-python-script /b/build/slave/Linux_Debug__NVIDIA_/build/src/content/test/gpu/run_gpu_test webgl_conformance --show-stdout --output-forma
t=gtest --browser=debug --webgl-conformance-version=1.0.1
    1  6268 dconf worker    /b/build/slave/Linux_Debug__NVIDIA_/build/src/out/Debug/chrome --type=utility --channel=6101.3.1909115831 --lang=en-US --no-sandbox --utility-allo
wed-dir --enable-crash-reporter=,Ubuntu 12.04.3 LTS --noerrdialogs --user-data-dir=/tmp/tmpcPSmhH

and a gdb stack trace for the utility process's only thread at the time:

Thread 1 (Thread 0x7fac42554700 (LWP 6268)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
#1  0x00007fac5ccbbd81 in base::internal::SpinLockDelay (w=0x7fac60274b48, value=2, loop=138061) at ../../third_party/tcmalloc/chromium/src/base/spinlock_linux-inl.h:97
#2  0x00007fac5ccbbb77 in SpinLock::SlowLock (this=0x7fac60274b48) at ../../third_party/tcmalloc/chromium/src/base/
#3  0x00007fac5cca62b9 in SpinLock::Lock (this=0x7fac60274b48) at ../../third_party/tcmalloc/chromium/src/base/spinlock.h:75
#4  0x00007fac5ccb7489 in MallocBlock::Initialize (this=0x1aa0cabbf000, size=32808, type=-271733872) at ../../third_party/tcmalloc/chromium/src/
#5  0x00007fac5ccb7c96 in MallocBlock::Allocate (size=32808, type=-271733872) at ../../third_party/tcmalloc/chromium/src/
#6  0x00007fac5ccb51d9 in DebugAllocate (size=32808, type=-271733872) at ../../third_party/tcmalloc/chromium/src/
#7  0x00007fac5ccb8c23 in debug_cpp_alloc (size=32808, new_type=-271733872, nothrow=true) at ../../third_party/tcmalloc/chromium/src/
#8  0x00007fac5ccb8caf in do_debug_malloc_or_debug_cpp_alloc (size=32808) at ../../third_party/tcmalloc/chromium/src/
#9  0x00007fac5ee2abf5 in tc_malloc (size=32808) at ../../third_party/tcmalloc/chromium/src/
#10 0x00007fac4a3df51b in __alloc_dir (fd=10, close_fd=true, flags=<optimized out>, statp=<optimized out>) at ../sysdeps/unix/opendir.c:202
#11 0x00007fac4c05fb8a in fdwalk (cb=<optimized out>, data=<optimized out>) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:947
#12 do_exec (child_err_report_fd=11, stdin_fd=-1, stdout_fd=13, stderr_fd=15, working_directory=<optimized out>, argv=0x1aa0cabbc4a0, envp=0x0, close_descriptors=1, 
    search_path=1, stdout_to_null=0, stderr_to_null=0, child_inherits_stdin=0, file_and_argv_zero=0, child_setup=0, user_data=0x0)
    at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:1061
#13 0x00007fac4c0603b0 in fork_exec_with_pipes (intermediate_child=0, working_directory=0x0, argv=0x1aa0cabbc4a0, envp=0x0, close_descriptors=1, search_path=1, 
    stdout_to_null=0, stderr_to_null=0, child_inherits_stdin=0, file_and_argv_zero=0, child_setup=0, user_data=0x0, child_pid=0x7fac42553868, standard_input=0x0, 
    standard_output=0x7fac42553860, standard_error=0x7fac42553864, error=0x7fac425539a8) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:1324
#14 0x00007fac4c060a74 in g_spawn_sync (working_directory=<optimized out>, argv=<optimized out>, envp=<optimized out>, flags=<optimized out>, child_setup=<optimized out>, 
    user_data=0x0, standard_output=0x7fac42553938, standard_error=0x7fac42553940, exit_status=0x7fac4255394c, error=0x7fac425539a8)
    at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:285
#15 0x00007fac4c061179 in g_spawn_command_line_sync (command_line=<optimized out>, standard_output=0x7fac42553938, standard_error=0x7fac42553940, 
    exit_status=0x7fac4255394c, error=0x7fac425539a8) at /build/buildd/glib2.0-2.32.3/./glib/gspawn.c:722
#16 0x00007fac48359789 in get_session_address_dbus_launch (error=0x7fac425539a8) at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1060
#17 0x00007fac4835b1da in get_session_address_platform_specific (error=0x7fac425539a8) at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1451
#18 g_dbus_address_get_for_bus_sync (bus_type=G_BUS_TYPE_SESSION, cancellable=<optimized out>, error=0x7fac42553a88)
    at /build/buildd/glib2.0-2.32.3/./gio/gdbusaddress.c:1534
#19 0x00007fac48363fae in get_uninitialized_connection (bus_type=G_BUS_TYPE_SESSION, cancellable=0x0, error=0x7fac42553a88)
    at /build/buildd/glib2.0-2.32.3/./gio/gdbusconnection.c:6750
#20 0x00007fac4836b4bb in g_bus_get_sync (bus_type=<optimized out>, cancellable=0x0, error=0x7fac42553a88) at /build/buildd/glib2.0-2.32.3/./gio/gdbusconnection.c:6807
#21 0x00007fac42559caa in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#22 0x00007fac4255a087 in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#23 0x00007fac4c022d53 in g_main_dispatch (context=0x7fac443de890) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:2539
#24 g_main_context_dispatch (context=0x7fac443de890) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3075
#25 0x00007fac4c0230a0 in g_main_context_iterate (dispatch=1, block=<optimized out>, context=0x7fac443de890, self=<optimized out>)
    at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3146
#26 g_main_context_iterate (context=0x7fac443de890, block=<optimized out>, dispatch=1, self=<optimized out>) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3083
#27 0x00007fac4c02349a in g_main_loop_run (loop=0x1aa0caba75c0) at /build/buildd/glib2.0-2.32.3/./glib/gmain.c:3340
#28 0x00007fac4255998b in ?? () from /usr/lib/x86_64-linux-gnu/gio/modules/
#29 0x00007fac4c0449e5 in g_thread_proxy (data=0x7fac44295de0) at /build/buildd/glib2.0-2.32.3/./glib/gthread.c:801
#30 0x00007fac4b96de9a in start_thread (arg=0x7fac42554700) at pthread_create.c:308
#31 0x00007fac4a417ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#32 0x0000000000000000 in ?? ()

Is it possible the GTK initialization was not moved up early enough, and another thread is still started when it's reached?
Comment 54 by, Oct 24 2013
Note that the reason this build succeeded was that I logged on to the machine and killed the Chrome process.

Are we sure this bot has the changes? There's really nothing else happening: creating a MessageLoop, setting the current (and hopefully only) thread name, and then starting GTK.
Comment 56 by, Oct 24 2013
I'm 99.9% sure it has the changes. This was the first build after r230547 which flaked. r230547 addressed the remaining GTK initialization issue in the browser process.

Is it at least less frequent?
Comment 58 by, Oct 24 2013
The hangs are occurring less frequently but still unacceptably often. shows that a hang is occurring roughly once every four hours, which is not acceptable for bots that are intended to run on the main waterfall.

Within the past couple of weeks this bot was running reliably for 24 hours at a stretch. It's very strange that this problem started just recently with no apparent related code changes.

--utility-cmd-prefix='xterm -e gdb --args' confirms that the utility process is single-threaded at the time GTK is initialized, so jorge, I'm taking back this bug from you.

GLib/GTK initialization does start *two* threads; one apparently for libgio, and the other for libdconfsettings. The libdconfsettings thread is started first. It's interesting that that's the one stuck when the utility process hang occurs. Maybe the two threads race to start, and if the libgio thread gets farther along, then it performs a malloc() which leads to the libdconfsettings thread getting wedged.

Here's some output from gdb showing the state of the utility process right after the call to gfx::GtkInitFromCommandLine:

[New Thread 0x7fffdd644700 (LWP 31390)]
[New Thread 0x7fffdce43700 (LWP 31391)]

Thread 3 (Thread 0x7fffdce43700 (LWP 31391)):
#0  0x00007fffe5d0da43 in __GI___poll (fds=<optimized out>, 
    nfds=<optimized out>, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007fffe7924ff6 in ?? () from /lib/x86_64-linux-gnu/
#2  0x00007fffe792545a in g_main_loop_run ()
   from /lib/x86_64-linux-gnu/
#3  0x00007fffe3a8a5e6 in ?? () from /usr/lib/x86_64-linux-gnu/
#4  0x00007fffe79469b5 in ?? () from /lib/x86_64-linux-gnu/
#5  0x00007fffe726fe9a in start_thread (arg=0x7fffdce43700)
    at pthread_create.c:308
#6  0x00007fffe5d193fd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffdd644700 (LWP 31390)):
#0  0x00007fffe5d0da43 in __GI___poll (fds=<optimized out>, 
    nfds=<optimized out>, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007fffe7924ff6 in ?? () from /lib/x86_64-linux-gnu/
#2  0x00007fffe792545a in g_main_loop_run ()
   from /lib/x86_64-linux-gnu/
#3  0x00007fffdd64998b in ?? ()
   from /usr/lib/x86_64-linux-gnu/gio/modules/
#4  0x00007fffe79469b5 in ?? () from /lib/x86_64-linux-gnu/
#5  0x00007fffe726fe9a in start_thread (arg=0x7fffdd644700)
    at pthread_create.c:308
#6  0x00007fffe5d193fd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fffdfba1980 (LWP 31287)):
#0  content::UtilityMain (parameters=...)
    at ../../content/utility/
#1  0x00007fffeb71d327 in content::RunNamedProcessTypeMain (
    process_type="utility", main_function_params=..., delegate=0x7fffffffe0b0)
    at ../../content/app/
#2  0x00007fffeb71e4f6 in content::ContentMainRunnerImpl::Run (
    this=0x7fffdfaf5820) at ../../content/app/
#3  0x00007fffeb71c7b7 in content::ContentMain (argc=6, argv=0x7fffffffe218, 
    delegate=0x7fffffffe0b0) at ../../content/app/
#4  0x0000555555b1b99d in ChromeMain (argc=6, argv=0x7fffffffe218)
    at ../../chrome/app/
#5  0x0000555555b1b95c in main (argc=6, argv=0x7fffffffe218)
    at ../../chrome/app/

Maybe we do need to fix the issue in GLib?
Comment 60 by, Oct 24 2013
Very frustrating.

A few bad options:

- Maybe with environment tweaking we could make sure that GLib doesn't start threads

- Initializing DBUS ourselves earlier, prior to even starting Chrome, and having the proper environment set-up might take care of the DBUS issue. As Ken pointed out, there is also a "dconfsettings" issue.

- Fix GLib's gspawn()
Comment 61 by, Oct 24 2013
Regarding initializing dbus: are you talking about glib's internal dbus stuff? Because we (regretfully) use that in the linux port. In the gtk port, libdbusmenu-gtk talks to the glib dbus stuff behind our backs, and in the aura port, we need to do some setup before handing a glib dbus object to libdbusmenu-glib.

If you "dbus_g_thread_init()", do things get fixed?
Theoretically, posix_spawn() is a thread-safe alternative to fork()/exec(). It has a much more restrictive feature-set, but sometimes that's all that is needed.

I don't actually have first-hand experience with the implementation of posix_spawn() in glibc, though. So, it is always possible that the promised thread-safety doesn't actually exist. But for now, let's assume that it is safer than fork()/exec().

It might be possible to re-implement g_spawn() in terms of posix_spawn(). But I haven't drilled down in detail to see how difficult it would be to make this change. And of course, any other code that tries to fork() in a multi-threaded environment is likely to run into the exact same problem. So, fixing g_spawn() might not be sufficient.
Comment 63 by, Oct 24 2013
erg: I mean starting the dbus daemon. If we did start the dbus daemon ourselves wouldn't Glib re-use that instead of trying to start one with a broken fork()-exec?
Comment 64 by, Oct 24 2013
The code path taken inside glib's initialization attempts to establish a G_BUS_TYPE_SESSION dbus connection inside g_dbus_address_get_for_bus_sync. That leads to the fork/exec later which illegally calls malloc and which gets stuck in a system call because another thread was holding that futex when the thread was spawned.

Before attempting to spawn the dbus-launch binary (inside get_session_address_dbus_launch), g_dbus_address_get_for_bus_sync queries the environment variable DBUS_SESSION_BUS_ADDRESS. Would it be possible for Chrome to initialize dbus itself and set this environment variable to the resulting address before calling gtk_init?

Comment 65 by, Oct 24 2013
kbr: when I said initialize DBUS earlier, I meant even before Chrome.

Basically in your test script, just do "eval `dbus-launch`" and "export DBUS_SESSION_BUS_ADDRESS" "export DBUS_SESSION_BUS_PID"

Hopefully that would at least get rid of glib trying to launch dbus.

But the dconf issue will remain.
Comment 66 by, Oct 24 2013
If we are going to go down that route then I would want to add code to Chrome to do it, not in our test harness, because Chrome would still be susceptible to the bug. Or are you suggesting to try this as a workaround and see whether it solves the problem before going further?

Comment 67 by, Oct 24 2013
My expectation is that somehow the headless mode is not setting DBUS_SESSION_BUS_ADDRESS, but I would expect this to always be set on users machines.

Maybe what we should add to Chrome is a DCHECK that DBUS_SESSION_BUS_ADDRESS is set? But yeah, I would be curious to know if this fixes anything. Have you checked already whether that environment was set to a responsive dbus before starting Chrome?

Perhaps because the machine is overloaded dbus got OOM-killed?
Comment 68 by, Oct 25 2013
Aha! Thank you for pointing that out.

Yes, DEBUS_SESSION_BUS_ADDRESS is not set in the environment running the buildbot scripts. It is set when logging on to the machine's console and launching the terminal. It isn't set when ssh'ing in to any Linux machine.

I'll ask the infrastructure team whether this is something we can change globally on the bots or whether I should work around this in our test launching scripts.

Comment 69 by, Oct 25 2013
...and, to be clear, DBUS_SESSION_BUS_ADDRESS isn't set either. (Typo in previous comment.)

Comment 70 by, Oct 25 2013
Just to try things out: what happens if you do something along the lines of:

eval `dbus-launch`

in the script ? Does that fix things?
Project Member Comment 71 by, Oct 25 2013
r230897 | | 2013-10-25T01:20:21.626979Z

Changed paths:

Add DBUS_SESSION_BUS_ADDRESS to allowed slave env vars.,,

Review URL:
Comment 72 by, Oct 25 2013
@jln: trying your suggestion in . I have a feeling DBUS_SESSION_BUS_ADDRESS isn't set at all in the buildbot slave's parent environment.

Project Member Comment 73 by, Oct 25 2013
r230904 | | 2013-10-25T01:54:11.642863Z

Changed paths:

Try to work around bug in glib by ensuring DBUS_SESSION_BUS_ADDRESS is set.


Review URL:
Project Member Comment 74 by, Oct 25 2013
r231085 | | 2013-10-25T19:32:23.100834Z

Changed paths:

Added GLib/DBus workaround to under --do-dbus-workaround option.

It's likely we'll want to do this on all the bots and just remove
this command line argument altogether, but I'm concerned about
destabilizing the existing bots that don't reboot between each run.
For the moment, use this argument only on the GPU bots.

It's known to be needed generally, and not specifically for Telemetry
tests, due to failures like this one in content_gl_tests:


Review URL:
Project Member Comment 75 by, Oct 26 2013
r231172 | | 2013-10-26T01:04:46.241648Z

Changed paths:

Remove dbus-launch workaround now that it's been propagated to


Review URL:
Comment 76 by, Oct 26 2013
Status: Fixed
Thank you all for your help tracking down the root cause of this issue and coming up with both good code changes and workarounds. These bots are now running reliably again.

Project Member Comment 77 by, Nov 19 2013
r235869 | | 2013-11-19T01:17:02.196328Z

Changed paths:

Spawn DBus by default as a workaround on all Linux bots.

It seems likely that the intermittent hangs of browser_tests with
Linux Aura have the same root cause as Issue 309093. Since that
workaround worked well on the affected bots, enable it for all Linux


Review URL:
Comment 78 by, Nov 26 2013
Lei is reporting that memory.FYI bots (which do not reboot) are having a bunch of dbus daemons kept around.
Comment 79 by, Nov 26 2013
Blocking: chromium:323391
Comment 80 by, Nov 26 2013
I've taken Issue 323391 regarding the dbus-daemon processes sticking around.

Comment 81 by, Dec 13 2013
Blocking: chromium:328249
Comment 82 by, Mar 28 2014
Blocking: chromium:357644
Comment 83 by, Apr 15 2015
Labels: -Restrict-View-Google
There's no confidential information in this bug report. Unrestricting.

Blocking: chromium:495906
If I'm understanding the bug trail here correctly, you identified a real issue in glib that bites real Chromium and Chrome releases (and probably a lot of other clients too), and added a workaround to your test runner so the bot runs run reliably, but the bug in the real shipping code remains, and could still bite real users.

(Actually: it's unlikely to bite very many interactive users who run with a full desktop environment that's already started dbus, but it's very likely to bite users who try to use buildbot+Xvfb+Chrome exactly the way the Chromium project does, say, to use karma or protractor to automate tests of some other web software.)

That last sentence applies to us; we do test web software in karma+Chrome and were bitten by this exact bug. It's actually pretty easy to reproduce with no bots necessary, if you just start Xvfb+Chrome in a loop.

Any chance of fixing this on the Chrome side or did you get any traction opening bugs with glib or dbus where the actual issue is? Any chance of reopening this bug report until the actual bug in the shipping software, not just the test runner, is fixed; or should I file a new bug?
Comment 86 by, Jul 7 2015
#85: I'm sorry for not following through and filing a bug against glib at the time this was fixed. We won't reopen this bug, but I'll figure out how to file a bug against glib -- or if you have any suggestions on how to do so, please email me. Thanks.

Please see for instructions how to submit a glib bug report: "If you encounter a bug, misfeature, or missing feature in GLib, please file a bug report on"
Comment 88 Deleted
Comment 89 Deleted
Comment 90 Deleted
Comment 91 Deleted
Comment 92 by, Jun 16 2016
Labels: Restrict-AddIssueComment-EditIssue
It turns out that a bug was filed a while ago under the GNOME project as . Please comment on that bug to indicate the severity.

There is no workaround possible in Chromium aside from writing a wrapper script to ensure that DBUS_SESSION_BUS_ADDRESS is set before starting the browser.

I'm closing this bug to further comments. Please push the GNOME developers for a fix.

Blocking: 713947
Blocking: 695643
Blocking: 715658
Sign in to add a comment