dbus autolaunch causes chrome to hang |
||||
Issue descriptionGeneral problem: some third party libraries we depend on (glibc and friends) do a sequence of fork() -> malloc() -> exec(). This can cause deadlocks (full explanation in Issue 695643 ) if another thread is in a malloc() at the time of the fork(). This is because chromium's tcmalloc doesn't have the atfork handler which was later introduced in tcmalloc [1]. I looked into that and backporting the fix is non-trivial. chromium's tcmalloc is lagging behind and has some other security patches on top, so a backport would be a stability risk. Also, the real problem here is that these libraries shouldn't just do anything else other than an exec() after a fork(). tcmalloc is only one of the many possible problems here, and changing it will not solve the others. There is a very specific instance of this problem: dbus. the dbus client library has this annonying feature of auto-launching dbus when it's not there in the desktop session. This happens very frequently in test environment which run under xvfb and don't have a full session manager with dbus. This struck internal and external developers in a number of occasions, such as: - Issue 693668 - https://github.com/w3c/web-platform-tests/issues/5406 - https://bugs.chromium.org/p/chromedriver/issues/detail?id=1699 A proposal here is to workaround this specific problem by disabling dbus autolaunch. My understanding is that dbus is not a hard requirement for chrome, and we can live without it. If this is the case, let's figure out what is the best way to prevent dbus autolaunch. I have two proposals, but don't know too much about it to judge the side effects , would appreciate more knowledgeable people to jump in. Option 1) if getenv(DBUS_SESSION_BUS_ADDRESS) == "" setenv(DBUS_SESSION_BUS_ADDRESS, "/dev/null") This will effectively disable the autolaunch feature when chrome is invoked outside a desktop session (e.g., if you launch chrome from ssh or a cron job). There is a potential side effect here: in this case, the dbus auto launch would have first tried to connect to one of the existing sessions under ~/.dbus/ and eventually found one if there was a real desktop session somewhere. forcing the DBUS_SESSION_BUS_ADDRESS to /dev/null will prevent the dbus-launch to early out sooner. Option 2) Force a dbus connection in browser_main_loop or somewhere else very early, before we start any other thread. This will kick the auto-launch in a controlled environment when we know that no other thread can be causing a malloc (Because there is no other thread). I don't know if this will prevent any future dbus-launch attempt. [1] https://github.com/gperftools/gperftools/issues/499
,
Apr 26 2017
Bonus question, if the proposals above don't fly, is there any cmdline flag we could use to detect that we are in a testing harness, so we could clear DBUS_SESSION_BUS_ADDRESS only in testing cases?
,
Apr 26 2017
Most of Chromium's tests pass the --test-type flag into the main process, and that's used to tell whether or not we're in a testing environment. See Issue 537776 .
,
Apr 26 2017
One point: in gspawn.c: https://github.com/GNOME/glib/blob/master/glib/gspawn.c Fixing this bug would amount to refactoring fork_exec_with_pipes and do_exec so that if "close_descriptors" is true, it calls a helper function to do the fdwalk and gather the results *before* it calls fork(). At least, I think that's the correct fix.
,
Apr 27 2017
kbr@ I think I have missed some other bug here, where is that gspawn called from? Last time I checked dbus (see Issue 695643 ) it has its own implementation of "fork+pipe+exec", where they end up malloc-ing via opendir/readdir [1] entirely from libdbus. libdbus doesn't seem to use gspawn, but maybe something else does? (in which case now we have two problems to deal with, not just one :/) > Fixing this bug would amount to refactoring fork_exec_with_pipes and do_exec so that if "close_descriptors" is true, it calls a helper function to do the fdwalk and gather the results *before* it calls fork(). > At least, I think that's the correct fix. Well not quite. You don't really want to close all file descriptor before fork()-ing, because doing so will just tear down all resources in the parent process that called gspawn (say, chrome), which essentially would mean: using gspawn() causes the caller process to become unusable, which is quite undesirable :) The common practice is to close() after fork() to make sure that the child process doesn't leak / keep refcounted resources and capabilities from the parent process. close() itself is harmless. The real problem here is: - fork()s should be made only in a controlled environment (say before bringing up any threads) - if one (say libdbus in our case) wants to take the risk and fork() at any random point (when there are other threads) at very least they have to be careful and not using any function that uses malloc() (which is quite tricky because lot of gblibc functions do these days). Specifically with reference to the example above, the less-harmful thing to do would be IMHO just for(i in [3, MAX_FDs]) close(i). (Which ironically, is what dbus does as a fallback in NON-Linux cases (See https://cgit.freedesktop.org/dbus/dbus/tree/dbus/dbus-sysdeps-unix.c?id=d42d167cc3ba0d20aa891a75aa1cd80ec5f490f1#n4413)) The bug we are hitting with dbus is because on Linux they are attempting to do something too clever and they use opendir(/proc/self/fds) to reduce the number of close() calls. See https://cgit.freedesktop.org/dbus/dbus/tree/dbus/dbus-sysdeps-unix.c?id=d42d167cc3ba0d20aa891a75aa1cd80ec5f490f1#n4368 [1] https://cgit.freedesktop.org/dbus/dbus/tree/dbus/dbus-sysdeps-unix.c?id=d42d167cc3ba0d20aa891a75aa1cd80ec5f490f1#n4358
,
Apr 27 2017
I filed a bug upstream against libdbus: https://bugs.freedesktop.org/show_bug.cgi?id=100843
,
Apr 27 2017
And there seem to be a similar discussion in glib about gspawn: https://bugzilla.gnome.org/show_bug.cgi?id=746604 I had some back and forth with the libdbus maintainer (See https://bugs.freedesktop.org/show_bug.cgi?id=100843). I think his ultimate opinion is "this is bad, but I'd stick with whatever glib does". I feel that both libdbus and glib make very aggressive assumptions (i.e. everybody uses glib's malloc) and fork()+malloc() is IMHO a really unorthodox practice. However, in all honesty, I can understand the position of a library (dbus) maintainer to just stick with whatever glib does. On other news, I asked about the idea above of clobbering DBUS_SESSION_BUS_ADDRESS when not set, copy-pasting from comment 5 there: ----- One thing that this could easily break, though: recent versions of libdbus and other major D-Bus implementations (sd-bus, GDBus) will try "$XDG_RUNTIME_DIR/bus" (which we hope will become the dominant way to locate a session bus over the next few years) before falling back to X11 autolaunch. At the moment, we expect distributions that make use of this feature to set DBUS_SESSION_BUS_ADDRESS="unix:path=$XDG_RUNTIME_DIR/bus". However, eventually we would like to be able to stop setting that variable, which would break your proposed code. ----- So, weighting everything together, my personal opinion is that we should, in order of priority and realism: 1. definitely set DBUS_SESSION_BUS_ADDRESS to "disabled:" if not set when having a --test-type flag 2. consider (but I'd like other opinions here) just doing this regardless of --test-type. 3. adding some checks around third-party code using fork() (i.e. introducing some ScopedAllowFork() only in places where we actually expect it) and forbidding the use of any library that just does that. 4. considering backporting that tcmalloc atfork, but that honestly is quite some work and would make me quite nervous about the stability risks.
,
Apr 28 2017
#5: what I'm suggesting is to change the interface to do_exec so that the caller does the opendir/fdwalk before fork(), but just collects the file descriptors, and passes that array of "fds to call fcntl (fd, F_SETFD, FD_CLOEXEC) upon" to do_exec. Then do_exec (after the fork) would iterate down that array making that call on each.
,
Apr 28 2017
Primiano and I talked offline and he pointed out that my proposed solution would introduce a race condition where a multithreaded program could open a new fd between the time the fdwalk was done, and when fork() was called, and that would be leaked. Ignore my suggestion as there's a better one in https://bugzilla.gnome.org/show_bug.cgi?id=746604#c8 .
,
Apr 28 2017
Also, per #5: the stack trace showing the problem in gspawn is from this linked bug comment: https://bugs.chromium.org/p/chromium/issues/detail?id=309093#c51
,
May 2 2017
Re #10, Ah thanks. If I read that stacktrace correctly I think that is just the old version of the dbus client glib which was using g_spawn, while now libdbus migrated to have its own implementation (which is almost copy/pasted from gspawn). Essentially I am saying that: - I would expect https://bugs.chromium.org/p/chromium/issues/detail?id=309093#c51 to NOT show up anymore on trusty in favor of Issue 695643 (i.e. the same problem has moved from gspawn to libdbus by virtue of moving to trusty) - the DBUS_SESSION_BUS_ADDRESS="disabled:" workaround should apply also here. - We should still fix both libdbus and gspawn, as gspawn could be used by some other library that we don't know about yet. I'll send out a patch to that as I get some spare bandwidth
,
May 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8511820ec8280caacbd4f81f3ecd13b6c61681b0 commit 8511820ec8280caacbd4f81f3ecd13b6c61681b0 Author: primiano <primiano@chromium.org> Date: Mon May 08 15:09:51 2017 Linux: Disable DBus auto-launch This is a workaround (ETA ~ 2-3 years) for libdbus not being multi-threading friendly and causing random hangs when running chrome outside of Linux desktop environments. Background: ----------- Typically, Linux desktop environments set the DBUS_SESSION_BUS_ADDRESS environment variable. This variable allows the dbus client library to directly connect to the existing bus, which is started by the desktop environment or systemd. When this variable is missing, the dbus client library will fallback to auto-launch mode [1], which causes 4 nested fork() + exec() calls. Doing this has two problems: (i) slows down startup; (ii) can hang the browser if the fork() happens while another thread is in a malloc() (Chrome's tcmalloc has no at-fork handlers). This situation (no env variable) is very common in test scenarios (browsertests, chromedriver, etc). Change introduced by this CL: ----------------------------- This CL sets the bus address env variable to "disabled:" if not set. This effectively shuts down the dbus auto-launch. If necessary, this behavior can be restored by setting, before launching chrome, DBUS_SESSION_BUS_ADDRESS="autolaunch:" . This workaround will be necessary until libdbus and gspawn are fixed to be multi-threading friendly [2,3] and that fix rolls into the various distributions. The change is introduced in the main embedder rather than in the google-chrome wrapper, as several binaries can be affected by this, for instance: - browser tests (http://crbug.com/693668) - chrome --headless - webdriver/selenium which seem to directly invoke "chrome" see https://github.com/SeleniumHQ/docker-selenium/issues/87 [1] https://dbus.freedesktop.org/doc/dbus-launch.1.html [2] https://bugs.freedesktop.org/show_bug.cgi?id=100843 [3] https://bugs.chromium.org/p/chromedriver/issues/detail?id=1699 BUG= 715658 , 695643 , 713947 TEST=strace -ff -o trace chrome; grep dbus-launch trace* Review-Url: https://codereview.chromium.org/2861163002 Cr-Commit-Position: refs/heads/master@{#469987} [modify] https://crrev.com/8511820ec8280caacbd4f81f3ecd13b6c61681b0/services/service_manager/embedder/main.cc
,
May 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1e78cb7863da28bb3411286cdbcc4fb4510ce173 commit 1e78cb7863da28bb3411286cdbcc4fb4510ce173 Author: iclelland <iclelland@chromium.org> Date: Mon May 08 18:41:31 2017 Revert of Linux: Disable DBus auto-launch (patchset #1 id:1 of https://codereview.chromium.org/2861163002/ ) Reason for revert: Speculative revert -- the TSan bots have been reporting a data race when setting Envvars (in this case, appending to the python path to start a websocket server). The race appeared immediately after this patch landed, so it may be legitimate. Reverting this to see if it clears the failures up; if so, we'll probably just have to serialize the calls to setenv. Filed crbug.com/719633 for this as well. Original issue's description: > Linux: Disable DBus auto-launch > > This is a workaround (ETA ~ 2-3 years) for libdbus not being multi-threading > friendly and causing random hangs when running chrome outside of Linux > desktop environments. > > Background: > ----------- > Typically, Linux desktop environments set the DBUS_SESSION_BUS_ADDRESS > environment variable. This variable allows the dbus client library to > directly connect to the existing bus, which is started by the desktop > environment or systemd. > When this variable is missing, the dbus client library will fallback > to auto-launch mode [1], which causes 4 nested fork() + exec() calls. > Doing this has two problems: (i) slows down startup; (ii) can hang > the browser if the fork() happens while another thread is in a malloc() > (Chrome's tcmalloc has no at-fork handlers). > This situation (no env variable) is very common in test scenarios > (browsertests, chromedriver, etc). > > Change introduced by this CL: > ----------------------------- > This CL sets the bus address env variable to "disabled:" if not set. > This effectively shuts down the dbus auto-launch. If necessary, this > behavior can be restored by setting, before launching chrome, > DBUS_SESSION_BUS_ADDRESS="autolaunch:" . > This workaround will be necessary until libdbus and gspawn are fixed > to be multi-threading friendly [2,3] and that fix rolls into the > various distributions. > The change is introduced in the main embedder rather than in the > google-chrome wrapper, as several binaries can be affected by this, > for instance: > - browser tests (http://crbug.com/693668) > - chrome --headless > - webdriver/selenium which seem to directly invoke "chrome" > see https://github.com/SeleniumHQ/docker-selenium/issues/87 > > [1] https://dbus.freedesktop.org/doc/dbus-launch.1.html > [2] https://bugs.freedesktop.org/show_bug.cgi?id=100843 > [3] https://bugs.chromium.org/p/chromedriver/issues/detail?id=1699 > > BUG= 715658 , 695643 , 713947 > TEST=strace -ff -o trace chrome; grep dbus-launch trace* > > Review-Url: https://codereview.chromium.org/2861163002 > Cr-Commit-Position: refs/heads/master@{#469987} > Committed: https://chromium.googlesource.com/chromium/src/+/8511820ec8280caacbd4f81f3ecd13b6c61681b0 TBR=satorux@google.com,thestig@chromium.org,jam@chromium.org,satorux@chromium.org,primiano@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG= 715658 , 695643 , 713947 Review-Url: https://codereview.chromium.org/2869843003 Cr-Commit-Position: refs/heads/master@{#470059} [modify] https://crrev.com/1e78cb7863da28bb3411286cdbcc4fb4510ce173/services/service_manager/embedder/main.cc
,
May 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2fc330d0b93d4bfd7bd04b9fdd3102e529901f91 commit 2fc330d0b93d4bfd7bd04b9fdd3102e529901f91 Author: primiano <primiano@chromium.org> Date: Tue May 09 13:07:12 2017 Reland of Linux: Disable DBus auto-launch (patchset #1 id:1 of https://codereview.chromium.org/2869843003/ ) Reason for reland: Adding TSan suppression. The race is independent of this CL, see crbug.com/719633 Original issue's description: > Revert of Linux: Disable DBus auto-launch (patchset #1 id:1 of https://codereview.chromium.org/2861163002/ ) > > Reason for revert: > Speculative revert -- the TSan bots have been reporting a data race when setting Envvars (in this case, appending to the python path to start a websocket server). The race appeared immediately after this patch landed, so it may be legitimate. Reverting this to see if it clears the failures up; if so, we'll probably just have to serialize the calls to setenv. > > Filed crbug.com/719633 for this as well. > > Original issue's description: > > Linux: Disable DBus auto-launch > > > > This is a workaround (ETA ~ 2-3 years) for libdbus not being multi-threading > > friendly and causing random hangs when running chrome outside of Linux > > desktop environments. > > > > Background: > > ----------- > > Typically, Linux desktop environments set the DBUS_SESSION_BUS_ADDRESS > > environment variable. This variable allows the dbus client library to > > directly connect to the existing bus, which is started by the desktop > > environment or systemd. > > When this variable is missing, the dbus client library will fallback > > to auto-launch mode [1], which causes 4 nested fork() + exec() calls. > > Doing this has two problems: (i) slows down startup; (ii) can hang > > the browser if the fork() happens while another thread is in a malloc() > > (Chrome's tcmalloc has no at-fork handlers). > > This situation (no env variable) is very common in test scenarios > > (browsertests, chromedriver, etc). > > > > Change introduced by this CL: > > ----------------------------- > > This CL sets the bus address env variable to "disabled:" if not set. > > This effectively shuts down the dbus auto-launch. If necessary, this > > behavior can be restored by setting, before launching chrome, > > DBUS_SESSION_BUS_ADDRESS="autolaunch:" . > > This workaround will be necessary until libdbus and gspawn are fixed > > to be multi-threading friendly [2,3] and that fix rolls into the > > various distributions. > > The change is introduced in the main embedder rather than in the > > google-chrome wrapper, as several binaries can be affected by this, > > for instance: > > - browser tests (http://crbug.com/693668) > > - chrome --headless > > - webdriver/selenium which seem to directly invoke "chrome" > > see https://github.com/SeleniumHQ/docker-selenium/issues/87 > > > > [1] https://dbus.freedesktop.org/doc/dbus-launch.1.html > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=100843 > > [3] https://bugs.chromium.org/p/chromedriver/issues/detail?id=1699 > > > > BUG= 715658 , 695643 , 713947 > > TEST=strace -ff -o trace chrome; grep dbus-launch trace* > > > > Review-Url: https://codereview.chromium.org/2861163002 > > Cr-Commit-Position: refs/heads/master@{#469987} > > Committed: https://chromium.googlesource.com/chromium/src/+/8511820ec8280caacbd4f81f3ecd13b6c61681b0 > Review-Url: https://codereview.chromium.org/2869843003 > Cr-Commit-Position: refs/heads/master@{#470059} > Committed: https://chromium.googlesource.com/chromium/src/+/1e78cb7863da28bb3411286cdbcc4fb4510ce173 BUG= 715658 , 695643 , 713947 , 719633 TBR=satorux@google.com,thestig@chromium.org,jam@chromium.org Review-Url: https://codereview.chromium.org/2865283002 Cr-Commit-Position: refs/heads/master@{#470301} [modify] https://crrev.com/2fc330d0b93d4bfd7bd04b9fdd3102e529901f91/build/sanitizers/tsan_suppressions.cc [modify] https://crrev.com/2fc330d0b93d4bfd7bd04b9fdd3102e529901f91/services/service_manager/embedder/main.cc
,
May 12 2017
PSA discussing the fix: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/-cs1V-tz06A Thank so much Primiano!
,
May 15 2017
Workaround landed in chrome. please reopen if this happens again. |
||||
►
Sign in to add a comment |
||||
Comment 1 by kbr@chromium.org
, Apr 26 2017