Weird telemetry_unittests CrOS+Amd64 only failure due to missing SUID bit on chrome-sandbox |
||||||
Issue descriptionExample failed try job : https://ci.chromium.org/p/chromium/builders/luci.chromium.try/chromeos-amd64-generic-rel/125113 https://chromium-review.googlesource.com/c/chromium/src/+/1305891/14 makes only this very specific bot timeout reproducibly on telemetry.core.tracing_controller_unittest.StartupTracingTest.testCloseBrowserBeforeTracingIsStopped only... I've been trying to debug this for a few days now and added reentrancy detection (could have been an explanation) but no luck. Not sure if it's an issue with the platform or my CL but I now have to move on and will sadly be forced to disable this feature on CrOS+Arm64... Example failure log : [95/96] telemetry.core.tracing_controller_unittest.StartupTracingTest.testCloseBrowserBeforeTracingIsStopped failed unexpectedly 254.1829s: Chose browser: PossibleCrOSBrowser(browser_type=cros-chrome) Created trace config file in /tmp/chrome-trace-config.json (Re)starting the ui (logs the user out) Restarting Chrome (pid=17665) with remote port dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--ignore-background-tasks,--enable-gpu-benchmarking,--deny-permission-prompts,--autoplay-policy=no-user-gesture-required,--disable-background-networking,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--mute-audio,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\,*/chromeos/login/*=2\,chrome_browser_main_posix=2,--allow-failed-policy-fetch-for-test,--disable-gaia-services,--trace-config-file=/tmp/chrome-trace-config.json" array:string: Failed with BrowserConnectionGoneException while starting the browser backend. *************** BROWSER STANDARD OUTPUT *************** Cannot get standard output on CrOS *********** END OF BROWSER STANDARD OUTPUT ************ ********************* BROWSER LOG ********************* No log file ***************** END OF BROWSER LOG ****************** ********************* SYMBOLIZED MINIDUMP ********************* ***************** END OF SYMBOLIZED MINIDUMP ****************** Closing browser (pid=17823) ... No devtools clients for issuing clock sync. Remove trace config file in /tmp/chrome-trace-config.json Created trace config file in /tmp/chrome-trace-config.json (Re)starting the ui (logs the user out) Browser is closed. No devtools clients for issuing clock sync. Remove trace config file in /tmp/chrome-trace-config.json Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/decorators.py", line 219, in wrapper func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller_unittest.py", line 174, in testCloseBrowserBeforeTracingIsStopped with self.possible_browser.BrowserSession(self.browser_options) as browser: File "/b/s/w/ir/.swarming_module/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/possible_browser.py", line 92, in BrowserSession browser = self.Create() File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_finder.py", line 105, in Create browser_backend, self._platform_backend, startup_args) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 48, in __init__ self._browser_backend.Start(startup_args, startup_url=startup_url) File "/b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py", line 80, in Start self.BindDevToolsClient() File "/b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py", line 118, in BindDevToolsClient raise exceptions.BrowserConnectionGoneException(self.browser, e) BrowserConnectionGoneException: Timed out while waiting 240s for _GetDevToolsClient. Found Minidump: False Stack Trace: ******************************************************************************** Cannot get stack trace on CrOS ******************************************************************************** Standard output: ******************************************************************************** Cannot get standard output on CrOS ******************************************************************************** System log: (Not implemented)
,
Nov 7
Sorry I did mean amd64. I saw the comment but it makes no sense to me... seems like a platform/bot configuration issue and I've run out of cycles to spend trying to debug it. I'll be fine without this feature on CrOS64 but filled this since I figured platform owners might want to investigate. Feel free to dupe into the other one if it's the same thing; in my mind the other one was about the lack of appropriate logging for this error. The error itself then remains a mystery...
,
Nov 7
I'm inclined to agree that it's test/bot issue. The more I look at it... the more I think that specific file permissions check has never passed. I'll disable the specific telemetry test for now, and I'll try to find time to figure out why it's failing. Apologies for the disruption. And I'll keep the two bugs separate since they're covering different things (this weird telemetry failure vs debugging of telemetry tests on that bot)
,
Nov 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5db1ab8dc0279dc7d77b731e4815a57114dd8642 commit 5db1ab8dc0279dc7d77b731e4815a57114dd8642 Author: Gabriel Charette <gab@chromium.org> Date: Wed Nov 07 05:14:06 2018 Add a trace event for the duration of ScopedBlockingCalls This helps identify sections of a trace event that were blocked on external dependencies and shed some light on the "descheduled" portion of an event. This CL is non-trivial for 3 reasons: 1) This new TRACE_EVENT kicks off tracing earlier than before in some configurations. This means that main() of basic utilities all of a sudden needed AtExitManagers (first patch set of this CL). But that then spread out to many main()'s and was painful. Since the Singleton is only required in tracing to know whether it's enabled : this CL specializes Singleton/TraceEventETWExport to avoid instantiating the instance only to notice it's not enabled (guaranteed default state). 2) We do not want tracing events while a thread is idle (i.e. when it's sleeping on its own WaitableEvent/ConditionVariable while waiting for more tasks). Ideally we simply wouldn't record the trace event when it's not nested within another event but tracing doesn't have a notion of "current event depth". As such this CL opts for a declarative model where owners of the few WaitableEvents/ConditionVariables used to sleep-while-idle can flag those as irrelevant for tracing. 3) Possibility of reentrancy if the trace event macros perform a blocking call themselves. Added DCHECKs to confirm this doesn't occur (was initial suspected cause of CrOS+amd64 only failures but it kept failing despite that and this feature had to be disabled there..) Bonus: * Singleton::GetIfExists() avoids instantiating TraceEventETWExport when switches::kTraceExportEventsToETW isn't present. * The declarative model brought forth by (2) also enables not instantiating debug::ScopedEventWaitActivity objects while a thread is merely sleeping. This will avoid polluting crash metadata with wait-acitvity for sleeping threads. * In a follow-up CL we will be able to remove base::internal::ScopedClearBlockingObserverForTesting which was previously necessary specifically to avoid a situation in TaskScheduler unit tests where WaitableEvents used for the test logic would instantiate ScopedBlockingCalls and interfere with the very logic under test. (not done in this one as it's large enough on its own) Bug: 899897, 902514 Change-Id: I93b63e83ac6bd9e5e9d4e29a9b86c62c73738835 Reviewed-on: https://chromium-review.googlesource.com/c/1305891 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: danakj <danakj@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Reviewed-by: Peter Kasting <pkasting@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Avi Drissman <avi@chromium.org> Reviewed-by: Misha Efimov <mef@chromium.org> Cr-Commit-Position: refs/heads/master@{#605966} [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/memory/singleton.h [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/message_loop/message_pump_default.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/condition_variable.h [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/condition_variable_posix.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/condition_variable_win.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/waitable_event.h [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/waitable_event_posix.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/synchronization/waitable_event_win.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/task/task_scheduler/scheduler_worker.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/threading/scoped_blocking_call.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/threading/scoped_blocking_call.h [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/trace_event/trace_event_etw_export_win.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/base/trace_event/trace_event_etw_export_win.h [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/cc/raster/single_thread_task_graph_runner.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/content/public/test/nested_message_pump_android.cc [modify] https://crrev.com/5db1ab8dc0279dc7d77b731e4815a57114dd8642/content/renderer/categorized_worker_pool.cc
,
Nov 7
Hrmm, it looks like this failure would've been avoided prior to https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1112816/ The change (and the one linked in its description) implies we'd don't need the suid bit on the sandbox binary in CrOS, but the failures you were seeing were caused by its absence. I don't have much context here, so hopefully +some others do. Should we revert https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1112816/?
,
Nov 7
,
Nov 7
no, we do not want to reenable a set*id sandbox. supporting a namespace-only/set*id-less approach is a general Linux thing ( issue 312380 ), and one we were able to more actively leverage in CrOS. you'll want to find a way to do what you need w/out set*id access.
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/40c2f1577996de8510142d8868a65621baaba1e2 commit 40c2f1577996de8510142d8868a65621baaba1e2 Author: Andrey Malets <malets@yandex-team.ru> Date: Wed Jan 16 14:29:40 2019 Fix thread cache after clone() in CanCreateProcessInNewUserNS Since CanCreateProcessInNewUserNS uses base::ForkWithFlags() to create new process, which in turn uses raw clone(), no function registered through pthread_atfork() gets called in child. This leads to invalid state when g_thread_id in platform_thread_posix.cc is stale and PlatformThread::CurrentId() returns wrong thread id on Linux. Despite the fact that the child process in CanCreateProcessInNewUserNS has very short lifetime, it may use PlatformThread APIs when tracing was enabled at the time of the call (this is the case when valid --trace-config-file= argument was passed to Chrome at startup). Also add a DCHECK to ensure we do not treat child crash and normal exit with code 1 the same (this would have been very helpful to debug the error originally). Bug: 899897, 902514 Change-Id: I8e2c82c8030537f5a62b725970d4ea5ec5948659 Reviewed-on: https://chromium-review.googlesource.com/c/1407830 Commit-Queue: Robert Sesek <rsesek@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#623228} [modify] https://crrev.com/40c2f1577996de8510142d8868a65621baaba1e2/base/process/launch_posix.cc [modify] https://crrev.com/40c2f1577996de8510142d8868a65621baaba1e2/base/threading/platform_thread_internal_posix.h [modify] https://crrev.com/40c2f1577996de8510142d8868a65621baaba1e2/base/threading/platform_thread_posix.cc [modify] https://crrev.com/40c2f1577996de8510142d8868a65621baaba1e2/base/threading/scoped_blocking_call.cc [modify] https://crrev.com/40c2f1577996de8510142d8868a65621baaba1e2/sandbox/linux/services/credentials.cc
,
Jan 16
(6 days ago)
Fixed by malex@yandex; woohoo, thanks!
,
Jan 16
(6 days ago)
that must have been fun to debug. nice job! |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by bpastene@chromium.org
, Nov 7