New issue
Advanced search Search tips

Issue 902380 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Tast tests progressively use more CPU and threads

Project Member Reported by jkardatzke@google.com, Nov 6

Issue description

I added more logging to help track this down, but it's just saying that the app launched and is displaying. This bug will be for tracking the data about the failures so hopefully I can nail down the cause.

wayland_terminal, celes, 11230.0.0, https://stainless.corp.google.com/browse/chromeos-autotest-results/255315260-chromeos-test/

wayland_terminal, celes, 11229.0.0

wayland_terminal, sand, 11229.0.0

wayland_launcher, sand, 11229.0.0 - this one is extra weird, the container logs indicate launching and then after the screenshot times out, it gets the keypress for closing it. So it apparently has focus...but is not rendering to the screen somehow.

x11_terminal, x11_launcher relm, 11229.0.0 - interesting as this one from the launcher screenshot shows a spinner on a square shelf icon (but not with out color); that spinner is still up when wayland_demo launches after this.  And we have a 'Failed opening display' in the container log file too.  We also have this in the logs for a startup failure with sommelier:

Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/cyrillic
Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to find transport for
Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/Type1
Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/100dpi
Nov 04 22:16:01 penguin sommelier[145]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/75dpi
Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/cyrillic
Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to find transport for
Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/Type1
Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/100dpi
Nov 04 22:16:01 penguin sommelier[141]: _FontTransOpen: Unable to Parse address  /usr/share/fonts/X11/75dpi
Nov 04 22:16:02 penguin sommelier[141]: No protocol specified
Nov 04 22:16:02 penguin sommelier[141]: xdpyinfo:  unable to open display ":0".
Nov 04 22:16:02 penguin sommelier[141]: No protocol specified
Nov 04 22:16:02 penguin sommelier[141]: xrdb: Resource temporarily unavailable
Nov 04 22:16:02 penguin sommelier[141]: xrdb: Can't open display ':0'
Nov 04 22:16:02 penguin sommelier[141]: Child exited with status: 1

The font errors are normal and are seen in successful runs too.

 
I broke out the X issue into another bug,  crbug.com/903876  as I think it's different than the Wayland failures (if they actually are failures and not something wrong with the screenshotting code...but if the screenshot code was at fault, I would expect to see failures with the X11 demo app too in this way).
Cc: nya@chromium.org derat@chromium.org smbar...@chromium.org
Summary: Tast tests progressively use more CPU and threads (was: Screenshot tests are failing sometimes)
I was able to reproduce the above failure with Wayland and determined it's a CPU utilization issue with tast itself (I was looking at the screen, and the Wayland app launched just fine). If I run the full set of tests that it executes in the lab, which is:

arc.SettingsBridge cryptohome.Login cryptohome.LoginGuest graphics.ScreenshotCLI graphics.ScreenshotChrome platform.Histograms platform.MLServiceBootstrap security.GPUSandboxed security.OpenFDs security.SELinuxFileLabelWithChrome security.Sandboxed session.LogoutCleanup ui.Exceptions  ui.SupervisedUserCrash ui.VirtualKeyboardOmnibox ui.VirtualKeyboardTyping video.PlayH264 video.PlayVP8 video.PlayVP9 video.WebRTCCamera video.WebRTCCameraPerf video.WebRTCPeerConnectionWithCameraH264 video.WebRTCPeerConnectionWithCameraH264Perf video.WebRTCPeerConnectionWithCameraVP8 video.WebRTCPeerConnectionWithCameraVP8Perf vm.CrostiniFiles vm.CrostiniStartEverything

Then I can see the 'cros' process (which is from /usr/local/libexec/tast/bundles/local_pushed/cros) continually use more and more CPU as each test executes until it's at 100% CPU.  If I also look at the thread count for that process, I can see it start at 11 and then creep up to 25 before it completes.

The specific tests where the thread count goes up isn't consistent...and sometimes the thread count goes up by 2 or 4 in a very short period of time.  The thread count never goes down when I was checking.

This then ends up causing CPU issues for the Crostini test and a screenshot can take over 10 seconds to complete...and usually it needs 2 screenshots until it gets a good one. So this excess CPU usage then causes the CrostiniStartEverything test to fail.

Any ideas for what's wrong here?  I'm very hesitant to just increase the screenshot timeout in the Crostini tests to mask some other larger issue that's potentially causing many of our tests to run slower than they should in the lab.
Cc: hidehiko@chromium.org
Components: -OS>Systems>Containers Tests>Tast
Labels: -Pri-2 Pri-1
Owner: derat@chromium.org
Status: Started (was: Assigned)
Aaaaand, I think I see my bug (well, at least one of them).

Sending SIGQUIT to the "cros" process shows a bunch of goroutines that look like this:

goroutine 472 [select]:                                                                
chromiumos/tast/local/chrome.(*browserWatcher).start.func1(0xc420418300)           
        /home/derat/trunk/src/platform/tast-tests/src/chromiumos/tast/local/chrome/watcher.go:39 +0xf0
created by chromiumos/tast/local/chrome.(*browserWatcher).start                                       
        /home/derat/trunk/src/platform/tast-tests/src/chromiumos/tast/local/chrome/watcher.go:36 +0x3f

In tast/local/chrome/watcher.go:

func (bw *browserWatcher) start() {
    go func() {
        bw.check()
        for {
            select {
            case <-bw.done:
                break
            case <-time.After(checkBrowserInterval):
                if !bw.check() {
                    break
                }
            }
        }
    }()
}

Those "break"s should be "return"s.

And yeah, fixing that makes the cros process not monotonically increase in CPU usage through the course of testing. Sorry about that.

I'll dig a bit more to see if there's anything else that's leaking. (Coincidentally, we were just discussing tracking goroutine count throughout testing to try to catch leaks.)
Cc: jkardatzke@chromium.org
Glad you found it so quick!  This should fix some of the last regular failures we have with Crostini tests. :)
It's unfortunate to be at the end of the list. Maybe there will be fewer failures in video tests now, too. :-P
Project Member

Comment 7 by bugdroid1@chromium.org, Nov 17

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/d39d92d697c007a096eedbd8618315ae20f42b9a

commit d39d92d697c007a096eedbd8618315ae20f42b9a
Author: Daniel Erat <derat@chromium.org>
Date: Sat Nov 17 03:31:26 2018

tast-tests: Fix a goroutine leak in the chrome package.

Fix an issue where the chrome package was leaking the
goroutines that it used to watch for Chrome crashes. This
resulted in steadily-increasing CPU usage by the "cros" test
bundle process on the DUT as more tests were run.

BUG= chromium:902380 
TEST=manual: ran a bunch of tests while watching the cros
     process

Change-Id: Ibd5da43769afcca0b33ae95ba437e0dee4aa265e
Reviewed-on: https://chromium-review.googlesource.com/c/1340735
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Jeffrey Kardatzke <jkardatzke@google.com>
Reviewed-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/d39d92d697c007a096eedbd8618315ae20f42b9a/src/chromiumos/tast/local/chrome/watcher.go

Status: Fixed (was: Started)
This should be fixed now, but I've also uploaded https://crrev.com/c/1340555 as a potential approach for detecting issues like this in the future.

Sign in to add a comment