New issue
Advanced search Search tips

Issue 853406 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Tast test bundle panicked after writing test error to closed channel

Project Member Reported by derat@chromium.org, Jun 16 2018

Issue description

In the setzer-release #2291 run at http://stainless/browse/chromeos-autotest-results/208618800-chromeos-test/, I think that a bug in Tast's test-running code resulted in a panic after the arc.IntentForward test timed out that caused the test bundle process to abort without running other tests.

Summarizing the log:

03:25:33  arc.IntentForward starts
03:26:34  arc.IntentForward reports "Failed to open a web page" error
03:26:35  arc.IntentForward reports "Failed to open Downloads" error
03:26:35  test times out and Test.Run reports error: "test.go:99: Test timed out: context deadline exceeded"
03:26:36  platform.CheckProcesses starts
03:26:36  platform.CheckProcesses ends
03:26:36  power.CheckStatus starts
03:26:36  bundle crashes with "exit status 2 (panic: send on closed channel [recovered] ..."

Here's the actual error:

        panic: send on closed channel

goroutine 21 [running]:
chromiumos/tast/testing.(*State).Errorf(0xc4200b0f00, 0x5d4b94753593, 0x9, 0xc42005fb80, 0x1, 0x1)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/state.go:125 +0xf3
chromiumos/tast/testing.(*Test).Run.func2.1(0xc4200b0f00, 0xc420158070)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:88 +0xb5
panic(0x5d4b948bb9c0, 0x5d4b94972d20)
        /usr/lib/go/x86_64-cros-linux-gnu/src/runtime/panic.go:502 +0x22d
chromiumos/tast/testing.(*State).Logf(0xc4200b0f00, 0x5d4b9475a373, 0x15, 0xc42005fe40, 0x1, 0x1)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/state.go:111 +0xfa
chromiumos/tast/testing.ContextLogf(0x5d4b94979720, 0xc42006a2c0, 0x5d4b9475a373, 0x15, 0xc42005fe40, 0x1, 0x1)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/state.go:181 +0xce
chromiumos/tast/local/bundles/cros/arc.IntentForward.func1(0x5d4b94979720, 0xc42006a2c0, 0xc42013e300, 0x5d4b9476ceb9, 0x3d, 0x0, 0x0)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/src/chromiumos/tast/local/bundles/cros/arc/intent_forward.go:51 +0x120
chromiumos/tast/local/bundles/cros/arc.IntentForward(0xc4200b0f00)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/src/chromiumos/tast/local/bundles/cros/arc/intent_forward.go:77 +0x4fc
chromiumos/tast/testing.(*Test).Run.func2(0xc4200b0f00, 0xc420158070, 0xc420142870)
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:92 +0x66
created by chromiumos/tast/testing.(*Test).Run
        /build/setzer/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r76/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:85 +0xb8) (status 5)

arc.IntentForward called testing.State.Logf after its output channel had been closed by runTest() in chromiumos/tast/bundle/bundle.go. Writing to a closed channel causes a panic, which crashed the entire bundle here.

Test.Run() contains the following to try to recover from panics that occur while a test is running:

    // Tests call runtime.Goexit() to make the current goroutine exit immediately
    // (after running defer blocks) on failure.
    done := make(chan bool, 1)
    go func() {
        defer func() {
            if r := recover(); r != nil {
                s.Errorf("Panic: %v", r)
            }
            done <- true
        }()
        tst.Func(s)
    }()

    select {
    case <-done:
        // The goroutine running the test finished.
    case <-s.ctx.Done():
        s.Errorf("Test timed out: %v", s.ctx.Err())
        // TODO(derat): Might need to make sure it's dead somehow...
    }

I think that that worked here, but the recovery code itself calls Errorf(), which results in another write to the channel, which causes another panic, which has nobody to catch it now... :-(

I suspect that Test.Run should probably instead be responsible for closing the channel that it's using (I remember wavering on this, not sure if there were cases where I'd want to reuse the same channel across multiple tests). I don't think that that would solve this, though, since the test function is running asynchronously here and hasn't necessarily stopped when Test.Run() returns (as happened here).

The deeper problem (which I'm aware of; see the TODO above) is that I don't have a good way to kill tests that ignore their deadlines right now. Tests run in goroutines, and there's no way to kill a goroutine. This is probably mostly harmless in this case, but it could be bad if a test continued running and did things that stepped on later tests' toes.

The big hammer would be for the test bundle to fork-exec itself to run each test. It can be sure that a test is dead if it does this, but it'll result in additional overhead -- each remote test will need to reestablish the SSH connection, for instance. That might not be a huge deal.

It *is* possible to exit at any point in a goroutine by calling runtime.Goexit() (https://golang.org/pkg/runtime/#Goexit) -- this is what testing.State.Fatal() does, for example. It might be reasonable for State.Context() (and functions in other test support packages?) to make the test abort if it continues using its context after it's hit its deadline.

To start out with, I'll probably fix the panic-during-recovery bug in Test.Run() and add some light measures to try to abort rogue tests.

Running each test in its own process is still on the table, but I'd rather not focus on the hostile-test use case -- if we end up with lots of runaway tests, we probably need to be more careful in how we write tests. :-)
 
full.txt
10.1 KB View Download

Comment 1 by derat@chromium.org, Jun 16 2018

Status: Started (was: Assigned)
https://crrev.com/c/1103672 fixes the panic.
Project Member

Comment 2 by bugdroid1@chromium.org, Jun 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/46ea2fc66ffb17fd23402bf5b82feb304abe5960

commit 46ea2fc66ffb17fd23402bf5b82feb304abe5960
Author: Daniel Erat <derat@chromium.org>
Date: Sun Jun 17 00:09:08 2018

tast: Fix panic when logging panic from runaway test.

Fix a subtle issue where testing.Test.Run's code to recover
from an in-test panic could write to a closed output
channel, resulting in an unhandled panic that would bring
down the entire test bundle.

BUG= chromium:853406 
TEST=added a unit test

Change-Id: I315b1e404b76b4ca56642ce7b2ef1821db84c068
Reviewed-on: https://chromium-review.googlesource.com/1103672
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Jason Clinton <jclinton@chromium.org>

[modify] https://crrev.com/46ea2fc66ffb17fd23402bf5b82feb304abe5960/src/chromiumos/tast/testing/test_test.go
[modify] https://crrev.com/46ea2fc66ffb17fd23402bf5b82feb304abe5960/src/chromiumos/tast/testing/test.go
[modify] https://crrev.com/46ea2fc66ffb17fd23402bf5b82feb304abe5960/src/chromiumos/tast/testing/state.go
[modify] https://crrev.com/46ea2fc66ffb17fd23402bf5b82feb304abe5960/src/chromiumos/tast/bundle/bundle.go

Project Member

Comment 3 by bugdroid1@chromium.org, Jun 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa

commit 2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa
Author: Daniel Erat <derat@chromium.org>
Date: Fri Jun 22 13:00:17 2018

tast: Fix races in reporting of test timeouts.

Make the goroutines that run tests responsible for closing
their output channels. Output channels were formerly closed
by the main goroutine, which caused races where runaway
tests could panic after writing to closed channels.
https://tour.golang.org/concurrency/4 warns against doing
this: "Only the sender should close a channel, never the
receiver. Sending on a closed channel will cause a panic."

This change also makes test bundles responsible for
reporting test timeouts rather than testing.Test.Run.

BUG= chromium:853406 
TEST=updated unit tests; also manually verified that a
     timeout error is reported for a test that ignores its
     deadline and that the bundle doesn't panic if the test
     writes to its output channel after the next test has
     started

Change-Id: I04a40480b01f4e1ffab5c9c1f9faf800c43984dc
Reviewed-on: https://chromium-review.googlesource.com/1111014
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa/src/chromiumos/tast/testing/test_test.go
[modify] https://crrev.com/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa/src/chromiumos/tast/bundle/bundle_test.go
[modify] https://crrev.com/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa/src/chromiumos/tast/testing/state.go
[modify] https://crrev.com/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa/src/chromiumos/tast/bundle/bundle.go
[modify] https://crrev.com/2c1ed3911f61f8d1b267c1844bcaf152ac8b4afa/src/chromiumos/tast/testing/test.go

Comment 4 by derat@chromium.org, Jun 25 2018

Status: Fixed (was: Started)
Filed issue 856386 to track trying to kill runaway tests.

Sign in to add a comment