Tast tests progressively use more CPU and threads |
|||||
Issue descriptionI added more logging to help track this down, but it's just saying that the app launched and is displaying. This bug will be for tracking the data about the failures so hopefully I can nail down the cause. wayland_terminal, celes, 11230.0.0, https://stainless.corp.google.com/browse/chromeos-autotest-results/255315260-chromeos-test/ wayland_terminal, celes, 11229.0.0 wayland_terminal, sand, 11229.0.0 wayland_launcher, sand, 11229.0.0 - this one is extra weird, the container logs indicate launching and then after the screenshot times out, it gets the keypress for closing it. So it apparently has focus...but is not rendering to the screen somehow. x11_terminal, x11_launcher relm, 11229.0.0 - interesting as this one from the launcher screenshot shows a spinner on a square shelf icon (but not with out color); that spinner is still up when wayland_demo launches after this. And we have a 'Failed opening display' in the container log file too. We also have this in the logs for a startup failure with sommelier: Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/cyrillic Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to find transport for Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/Type1 Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/100dpi Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/75dpi Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/cyrillic Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to find transport for Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/Type1 Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/100dpi Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address /usr/share/fonts/X11/75dpi Nov 04 22:16:02 penguin sommelier[141]: No protocol specified Nov 04 22:16:02 penguin sommelier[141]: xdpyinfo: unable to open display ":0". Nov 04 22:16:02 penguin sommelier[141]: No protocol specified Nov 04 22:16:02 penguin sommelier[141]: xrdb: Resource temporarily unavailable Nov 04 22:16:02 penguin sommelier[141]: xrdb: Can't open display ':0' Nov 04 22:16:02 penguin sommelier[141]: Child exited with status: 1 The font errors are normal and are seen in successful runs too.
,
Nov 16
I was able to reproduce the above failure with Wayland and determined it's a CPU utilization issue with tast itself (I was looking at the screen, and the Wayland app launched just fine). If I run the full set of tests that it executes in the lab, which is: arc.SettingsBridge cryptohome.Login cryptohome.LoginGuest graphics.ScreenshotCLI graphics.ScreenshotChrome platform.Histograms platform.MLServiceBootstrap security.GPUSandboxed security.OpenFDs security.SELinuxFileLabelWithChrome security.Sandboxed session.LogoutCleanup ui.Exceptions ui.SupervisedUserCrash ui.VirtualKeyboardOmnibox ui.VirtualKeyboardTyping video.PlayH264 video.PlayVP8 video.PlayVP9 video.WebRTCCamera video.WebRTCCameraPerf video.WebRTCPeerConnectionWithCameraH264 video.WebRTCPeerConnectionWithCameraH264Perf video.WebRTCPeerConnectionWithCameraVP8 video.WebRTCPeerConnectionWithCameraVP8Perf vm.CrostiniFiles vm.CrostiniStartEverything Then I can see the 'cros' process (which is from /usr/local/libexec/tast/bundles/local_pushed/cros) continually use more and more CPU as each test executes until it's at 100% CPU. If I also look at the thread count for that process, I can see it start at 11 and then creep up to 25 before it completes. The specific tests where the thread count goes up isn't consistent...and sometimes the thread count goes up by 2 or 4 in a very short period of time. The thread count never goes down when I was checking. This then ends up causing CPU issues for the Crostini test and a screenshot can take over 10 seconds to complete...and usually it needs 2 screenshots until it gets a good one. So this excess CPU usage then causes the CrostiniStartEverything test to fail. Any ideas for what's wrong here? I'm very hesitant to just increase the screenshot timeout in the Crostini tests to mask some other larger issue that's potentially causing many of our tests to run slower than they should in the lab.
,
Nov 17
Aaaaand, I think I see my bug (well, at least one of them).
Sending SIGQUIT to the "cros" process shows a bunch of goroutines that look like this:
goroutine 472 [select]:
chromiumos/tast/local/chrome.(*browserWatcher).start.func1(0xc420418300)
/home/derat/trunk/src/platform/tast-tests/src/chromiumos/tast/local/chrome/watcher.go:39 +0xf0
created by chromiumos/tast/local/chrome.(*browserWatcher).start
/home/derat/trunk/src/platform/tast-tests/src/chromiumos/tast/local/chrome/watcher.go:36 +0x3f
In tast/local/chrome/watcher.go:
func (bw *browserWatcher) start() {
go func() {
bw.check()
for {
select {
case <-bw.done:
break
case <-time.After(checkBrowserInterval):
if !bw.check() {
break
}
}
}
}()
}
Those "break"s should be "return"s.
And yeah, fixing that makes the cros process not monotonically increase in CPU usage through the course of testing. Sorry about that.
I'll dig a bit more to see if there's anything else that's leaking. (Coincidentally, we were just discussing tracking goroutine count throughout testing to try to catch leaks.)
,
Nov 17
,
Nov 17
Glad you found it so quick! This should fix some of the last regular failures we have with Crostini tests. :)
,
Nov 17
It's unfortunate to be at the end of the list. Maybe there will be fewer failures in video tests now, too. :-P
,
Nov 17
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/d39d92d697c007a096eedbd8618315ae20f42b9a commit d39d92d697c007a096eedbd8618315ae20f42b9a Author: Daniel Erat <derat@chromium.org> Date: Sat Nov 17 03:31:26 2018 tast-tests: Fix a goroutine leak in the chrome package. Fix an issue where the chrome package was leaking the goroutines that it used to watch for Chrome crashes. This resulted in steadily-increasing CPU usage by the "cros" test bundle process on the DUT as more tests were run. BUG= chromium:902380 TEST=manual: ran a bunch of tests while watching the cros process Change-Id: Ibd5da43769afcca0b33ae95ba437e0dee4aa265e Reviewed-on: https://chromium-review.googlesource.com/c/1340735 Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Jeffrey Kardatzke <jkardatzke@google.com> Reviewed-by: Stephen Barber <smbarber@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> [modify] https://crrev.com/d39d92d697c007a096eedbd8618315ae20f42b9a/src/chromiumos/tast/local/chrome/watcher.go
,
Nov 17
This should be fixed now, but I've also uploaded https://crrev.com/c/1340555 as a potential approach for detecting issues like this in the future. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jkardatzke@chromium.org
, Nov 9