New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 728216 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: iOS
Pri: 1
Type: Bug



Sign in to add a comment

ios unit tests frequently fails with TIMED_OUT on swarming bots

Project Member Reported by sdefresne@chromium.org, May 31 2017

Issue description

Recently ios unit tests started to fail with the following error during swarming:
shard #0 timed out, took too much time to complete

For example:
https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/227369
https://chromium-swarm.appspot.com/task?id=367828b8bfd5c910&refresh=10&show_raw=1

The test apparently took 22 minutes, but if we look at a successful run (on a different CL, but neither CL significantly change the duration of the tests), the run that succeeded took 2 minutes:

https://chromium-swarm.appspot.com/task?id=36730bf58780cd10&refresh=10&show_raw=1

Since neither CLs change the tests in any significant way, I assume this is because some machine are more overloaded than other and is a problem in Infra. Is iOS over its quota of work? Is it because the second CL was sent to CQ while US was up and thus when the resources have higher usage? Can something be done as this cause the downstream ios waterfall to be closed regularly?
 

Comment 1 by sczs@chromium.org, May 31 2017

Cc: liaoyuke@chromium.org smut@chromium.org
smut@ took a quick look yesterday and thinks its not infra related since there haven't been any recent changes.
liaoyuke@ thinks it might have something to do with iossim, so I'm cc'ing him as well.

Comment 2 by estaab@chromium.org, May 31 2017

Components: -Infra Infra>Client>iOS
Cc: -liaoyuke@chromium.org
Owner: liaoyuke@chromium.org
Status: Assigned (was: Untriaged)
Cc: -stkhapugin@chromium.org jif@chromium.org

Comment 5 by sczs@chromium.org, Jun 1 2017

Cc: -sczs@chromium.org michaeldo@chromium.org
Status: Started (was: Assigned)
Looking at this right now.
There is a possibility this is related to our http_server impl trying to resolve bonjour names for our servers. (https://touch-swarming.appspot.com/task?id=367cdfabfaf43210&refresh=10&show_raw=1)

If we can land the new HttpServer this week, I think we should wait to see if it will reduce this flake. Otherwise, I have a proposed change that may help here: https://codereview.chromium.org/2914253002/.

Comment 8 by s...@google.com, Jun 1 2017

Issue 728105 has been merged into this issue.
liaoyuke@ is looking into this. I don't believe my comment is the same as this issue.
Cc: baxley@chromium.org justincohen@chromium.org peterlaurens@chromium.org rohitrao@chromium.org
Here is my preliminary investigation results.

I was able to reproduce the hang by repeatedly running ios_chrome_unittests locally (it may take up to two hours to repro).

First of all, I removed all the usages of NSPipe, unfortunately, it's still hanging. Then I tried running xcodebuild -xctestrun directly from terminal, and it's still hanging. (I have attached the script at the end of this email)

The above two mentioned experiments denies that the issue is caused by NSTask or NSPipe, so apparently, there is a bug with xcodebuild -xctestrun, and here are some related posts I found online:

https://github.com/lionheart/openradar-mirror/issues/15412
http://devcenter.bitrise.io/ios/known-xcode-issues/#xcode-8-xcodebuild-test-hangs-at-the-very-end-of-the-tests
https://forums.developer.apple.com/thread/39281

What next?
I have the following three thoughts that I would like to verify:
1. The fact that we're reusing the same UDID might be the cause, so I would like to try creating a new simulator instance for each run, and clean it up after the test finishes.
2. Try launching ios_chrome_unittests.app directly using xcrun simctl launch, and see if it makes a difference.
3. Reach out to Mobile Harness and Facebook and see if they have observed similar issues.

Let me know if you have other thoughts.

Thanks,
Yuke

script:
#!/bin/sh

UDID[0]="1E9ABF3A-2C1B-444D-B503-456EC3D388A2"
UDID[1]="CFA3F1B5-548B-48E5-93FA-392FD4B544C2"
UDID[2]="3658CA61-5FB4-40EC-A2EE-7A70A0878306"
UDID[3]="52D4A67D-41AE-4903-A0FE-5A37CAF671AA"

for iter in {1..300}
do
  echo "$iter"
  udid_index=$(( RANDOM % 4 ))
  xcrun killall Simulator
  xcodebuild -xctestrun "/Users/liaoyuke/xctestrun_config.txt" -destination "platform=iOS Simulator,id=${UDID[$udid_index]}" test-without-building
done

xctestrun_config.txt:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>TestTargetName</key>
	<dict>
		<key>OnlyTestIdentifiers</key>
		<array/>
		<key>TestBundlePath</key>
		<string>/Users/liaoyuke/bling/src/out/Debug-iphonesimulator/ios_chrome_unittests.app</string>
		<key>TestHostPath</key>
		<string>/Users/liaoyuke/bling/src/out/Debug-iphonesimulator/ios_chrome_unittests.app</string>
		<key>TestingEnvironmentVariables</key>
		<dict>
			<key>DYLD_FRAMEWORK_PATH</key>
			<string>__PLATFORMS__/iPhoneSimulator.platform/Developer/Library/Frameworks</string>
			<key>DYLD_LIBRARY_PATH</key>
			<string>__PLATFORMS__/iPhoneSimulator.platform/Developer/Library</string>
			<key>IDEiPhoneInternalTestBundleName</key>
			<string>ios_chrome_unittests.app</string>
		</dict>
	</dict>
</dict>
</plist>
update:

1. Creating new simulator instances during each test run doesn't help.

2. I was able to launch ios_chrome_unittests.app using xcrun simctl launch, however, this command simply launches the the tests, but not wait for the tests to finish running or pipe the output, so I'm blocked here.
It turns out that the hang in the middle of tests running is not only happening on ios_chrome_unittests, it's also happening on net_unittests, gfx_unittests etc, across different bots and tests.

while for the components_unittests, I just confirmed it is a different issue, hanging without output at all. However, it only happened once during the past 200 builds, which is quite rare, so I'm not too concerned with this issue, I'll focus on fixing the hang in the middle of tests running issue first.
Project Member

Comment 13 by bugdroid1@chromium.org, Jun 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a27564e55ec5fd0ff168c23d927c68dec145871c

commit a27564e55ec5fd0ff168c23d927c68dec145871c
Author: baxley <baxley@chromium.org>
Date: Wed Jun 07 00:16:13 2017

Remove unit tests from ios-simulator that are timing out.

ios_chrome_unittests and net_unittests are timing out a lot
on the bots, so remove them from the CQ and main.

BUG= 728216 

Review-Url: https://codereview.chromium.org/2925663005
Cr-Commit-Position: refs/heads/master@{#477477}

[modify] https://crrev.com/a27564e55ec5fd0ff168c23d927c68dec145871c/ios/build/bots/tests/common_tests.json

Cc: liaoyuke@chromium.org
Components: Internals>TaskScheduler
Owner: fdoray@chromium.org
Took a quick look and in this case, the task tracker says
 can_run_tasks_ = true
 num_tasks_running_  = 0

So we're basically already idle and yet we're waiting for RunUntilIdle to return. We're going to be waiting for a long time due to a race in the following lines in ScopedTaskEnvironment::RunUntilIdle:
// The closure registered above may never run if the TaskScheduler queue
// starts empty. Post a TaskScheduler tasks to make sure that the queue
// doesn't start empty.
PostTask(FROM_HERE, BindOnce(&DoNothing))

I hacked in a PlatformThread::Sleep 50ms instead of DoNothing, and that seems to have addressed the hang, so it's very likely to be the race (at the expense of slowing the test down).

Because the tests are running blazingly fast, when the system is already idle, and then this PostTask will complete before we hit run_loop.Run(), triggering the condition and keeping the RunLoop running.

We will need another way to check that we're already idle.
Owner: robliao@chromium.org
Project Member

Comment 16 by bugdroid1@chromium.org, Jun 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/77dac22383dca465e09cfbbb61fadfce83c9eede

commit 77dac22383dca465e09cfbbb61fadfce83c9eede
Author: baxley <baxley@chromium.org>
Date: Wed Jun 07 02:47:09 2017

Add recently removed iOS CQ unit tests back to FYI.

Create new fyi_tests.json so we can more easily run different
suites on multiple fyi configurations.

BUG= 728216 

Review-Url: https://codereview.chromium.org/2922253004
Cr-Commit-Position: refs/heads/master@{#477528}

[modify] https://crrev.com/77dac22383dca465e09cfbbb61fadfce83c9eede/ios/build/bots/chromium.fyi/ios-simulator.json
[add] https://crrev.com/77dac22383dca465e09cfbbb61fadfce83c9eede/ios/build/bots/tests/fyi_tests.json

One more race (and likely the casual one).

In
ScopedTaskEnvironment::TestTaskTracker::PerformRunTask

if (num_tasks_running_ == 1 &&
    GetNumPendingUndelayedTasksForTesting() == 1 && queue_empty_closure_)

num_tasks_running_ may == 1 but GetNumPendingUndelayedTasksForTesting() could be > 1 because this number is not decremented until after the PerformRunTask returns. 

If there are two tasks in flight running at the same time, we may context switch after we return but before calling DecrementNumPendingUndelayedTasks(), leaving other threads to see the old undelayed count. As a result, when the last task rolls through, it still thinks there are undelayed tasks when there aren't any.
This is causing failures to many test suites on the iOS waterfall (https://build.chromium.org/p/chromium.mac/builders/ios-simulator/builds/18214) like components_unittests, ios_web_unittests, ios_net_unittests.

I guess that disabling all the test suites is probably not sustainable. robliao: any idea how we can fix those race conditions?
We are hitting two different timeout issues on the waterfall:

1) This issue, which manifests as tests running normally, but hanging before finishing.  These tests have some log output.

2) An iossim issue, in which the tests never seem to run at all.  These tests have no log output.

This bug only covers #1, which to date has only been an issue in ios_chrome_unitests and net_unittests, specifically.  All of the other test suites seem to be the second problem, which is  Issue 694906 .

If anyone sees other test suites with log output that indicates that they hung after starting to run tests, please post links here and we can see if they are also hitting race conditions.
I can think of 2 solutions:

1. Add a counter of pending undelayed tasks in ScopedTaskEnvironment::TestTaskTracker instead of using the one from TaskTracker. Protect all accesses to this counter with |lock_| to avoid subtle race conditions like this one. This require adding a way for a derived class of TaskTracker to be notified when a task is posted.
2. Call DecrementNumPendingUndelayedTasks() at the end of TaskTracker::PerformRunTask() instead of at the end of TaskTracker::RunNextTask(). This fixes this specific error but is less robust to other potential errors.

WDYT?
Project Member

Comment 21 by bugdroid1@chromium.org, Jun 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/88d6f53c69299ac5a899d64904fa795861817736

commit 88d6f53c69299ac5a899d64904fa795861817736
Author: rohitrao <rohitrao@chromium.org>
Date: Wed Jun 07 15:23:33 2017

[ios] Disable WebResourceUtilTest due to risk of timeouts.

 crbug.com/728216  details an issue that could cause
ScopedTaskEnvironment::RunUntilIdle() to hang forever.  This CL disables the
tests that have a chance of hanging until a fix is instituted.

BUG= 728216 

Review-Url: https://codereview.chromium.org/2926133002
Cr-Commit-Position: refs/heads/master@{#477654}

[modify] https://crrev.com/88d6f53c69299ac5a899d64904fa795861817736/ios/chrome/browser/web_resource/web_resource_util_unittest.cc

Comment 22 by gab@chromium.org, Jun 7 2017

Re.
> So we're basically already idle and yet we're waiting for RunUntilIdle to 
> return. We're going to be waiting for a long time due to a race in the 
> following lines in ScopedTaskEnvironment::RunUntilIdle:
> // The closure registered above may never run if the TaskScheduler queue
> // starts empty. Post a TaskScheduler tasks to make sure that the queue
> // doesn't start empty.
> PostTask(FROM_HERE, BindOnce(&DoNothing))

I don't think there's a race there. It's find for the QuitClosure to be called ahead of RunLoop::Run() -- RunLoop explicitly supports that.

The DoNothing() just ensures the TaskTracker hits the OnQueueEmptyClosure() again if it was already empty.

Catching up with other posts...
After working with a few approaches, we've come to the consensus that a virtual method call at the end of TaskTracker::RunNextTask should work and be surgical enough to maintain the existing guarantees of the other counters without a lot of rewriting.

virtual void TaskTracker::OnRunTaskCompleted() {}
Does nothing in the production TaskTracker

For ScopedTaskEnvironment, this is where we check number of pending tasks before posting the quit closure.

A change for this will be out soon.

Project Member

Comment 24 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/48de879ed5cb3ea81d10710caaf33eec1e2c5d08

commit 48de879ed5cb3ea81d10710caaf33eec1e2c5d08
Author: Robert Liao <robliao@chromium.org>
Date: Thu Jun 08 00:00:12 2017

Fix Task Counting Race in ScopedTaskEnvironment::TestTaskTracker

This change introduces OnRunNextTaskCompleted(), called right before
RunNextTask() returns and after all of the task metrics accounting is
complete.

This allows ScopedTaskEnvironment::TestTaskTracker to check the
number of undelayed tasks and number of running tasks in one spot
after all the counters are updated.

Previously, the code performed a lookahead on the number of undelayed
tasks, expected that the number running and number undelayed to be 1
before terminating.

This is racy because there may be multiple tasks in flight and the
number of running tasks could be 1 before all of the undelayed task
accounting is complete (e.g. it could see 2 and then the number will
be 0 immediately after that).

BUG= 728216 

Change-Id: I669f8da1f43c3a40500dbcab1f96d3677b18f2d8
Reviewed-on: https://chromium-review.googlesource.com/527334
Reviewed-by: Gabriel Charette <gab@chromium.org>
Reviewed-by: Francois Doray <fdoray@chromium.org>
Commit-Queue: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#477814}
[modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/task_scheduler/task_tracker.cc
[modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/task_scheduler/task_tracker.h
[modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/test/scoped_task_environment.cc

I'm going to let the change bake for a bit, and then I'll issue reverts one at a time to reenable the tests.
Project Member

Comment 26 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2fccbd0ee411a58f1092c2c036d614da16ca4730

commit 2fccbd0ee411a58f1092c2c036d614da16ca4730
Author: baxley <baxley@chromium.org>
Date: Thu Jun 08 04:35:57 2017

Revert of Add recently removed iOS CQ unit tests back to FYI. (patchset #2 id:20001 of https://codereview.chromium.org/2922253004/ )

Reason for revert:
Now that ios_chrome_unittests are back in common_tests.json, fyi bots have an exception for two instances of ios_chrome_unittests. Same with net_unittests.

Original issue's description:
> Add recently removed iOS CQ unit tests back to FYI.
>
> Create new fyi_tests.json so we can more easily run different
> suites on multiple fyi configurations.
>
> BUG= 728216 
>
> Review-Url: https://codereview.chromium.org/2922253004
> Cr-Commit-Position: refs/heads/master@{#477528}
> Committed: https://chromium.googlesource.com/chromium/src/+/77dac22383dca465e09cfbbb61fadfce83c9eede

TBR=rohitrao@chromium.org,smut@google.com
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 728216 

Review-Url: https://codereview.chromium.org/2924163004
Cr-Commit-Position: refs/heads/master@{#477888}

[modify] https://crrev.com/2fccbd0ee411a58f1092c2c036d614da16ca4730/ios/build/bots/chromium.fyi/ios-simulator.json
[delete] https://crrev.com/916a8290de5e9e41873282f01e11ffa7606c1531/ios/build/bots/tests/fyi_tests.json

Project Member

Comment 27 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce

commit e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce
Author: robliao <robliao@chromium.org>
Date: Thu Jun 08 23:57:43 2017

Revert of [ios] Disable WebResourceUtilTest due to risk of timeouts. (patchset #1 id:1 of https://codereview.chromium.org/2926133002/ )

Reason for revert:
The underlying cause of  http://crbug.com/728216  should be fixed. These tests should be good to go now.

Original issue's description:
> [ios] Disable WebResourceUtilTest due to risk of timeouts.
>
>  crbug.com/728216  details an issue that could cause
> ScopedTaskEnvironment::RunUntilIdle() to hang forever.  This CL disables the
> tests that have a chance of hanging until a fix is instituted.
>
> BUG= 728216 
>
> Review-Url: https://codereview.chromium.org/2926133002
> Cr-Commit-Position: refs/heads/master@{#477654}
> Committed: https://chromium.googlesource.com/chromium/src/+/88d6f53c69299ac5a899d64904fa795861817736

TBR=baxley@chromium.org,liaoyuke@chromium.org,rohitrao@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 728216 

Review-Url: https://codereview.chromium.org/2927223002
Cr-Commit-Position: refs/heads/master@{#478126}

[modify] https://crrev.com/e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce/ios/chrome/browser/web_resource/web_resource_util_unittest.cc

Looks like https://touch-swarming.appspot.com/task?id=36a56f9514533910&refresh=10&show_raw=1 failed with TIMED_OUT.

The application started as there are some output, but then blocked. Does it mean the underlying issue is not completely fixed?
There may still be issues with iossim/NSTask/NSPipe, covered in 709374 I believe.
The NSPipe/NSTask issue in 709374 cause hangs before any output is generated IIUC. In the log added in #28, some output is generated before the app hangs:

Uploading ios_web_inttests.app package contents... DONE.
Installing '(null)'
Install: CreatingStagingDirectory (5%)

Install: ExtractingPackage (15%)

Install: InspectingPackage (20%)

Install: TakingInstallLock (20%)

Install: PreflightingApplication (30%)

Install: InstallingEmbeddedProfile (30%)

Install: VerifyingApplication (40%)

Install: CreatingContainer (50%)

Install: InstallingApplication (60%)

Install: PostflightingApplication (70%)

Install: SandboxingApplication (80%)

Install: GeneratingApplicationMap (90%)

Install: Complete
idevice-app-runner --udid 5d2215775d4988c7e95d58294cf9fab032e2fd3f --start com.google.gtest.ios-web-inttests

Detected presence of a debugger, running without test timeouts.
[23638:3079:0609/043058.322098:100432828341905:WARNING:resource_bundle.cc(353)] locale_file_path.empty() for locale en-US
[==========] Running 35 tests from 8 test cases.
[----------] Global test environment set-up.
[----------] 1 test from BrowserStateWebViewPartitionTest
[ RUN      ] BrowserStateWebViewPartitionTest.LocalStorage
[       OK ] BrowserStateWebViewPartitionTest.LocalStorage (1012 ms)
[----------] 1 test from BrowserStateWebViewPartitionTest (1013 ms total)

[----------] 11 tests from HistoryStateOperationsTest
[ RUN      ] HistoryStateOperationsTest.NoOpPushUnresolvable
[       OK ] HistoryStateOperationsTest.NoOpPushUnresolvable (1953 ms)
[ RUN      ] HistoryStateOperationsTest.NoOpReplaceUnresolvable
[       OK ] HistoryStateOperationsTest.NoOpReplaceUnresolvable (1956 ms)
[ RUN      ] HistoryStateOperationsTest.NoOpPushDifferentScheme
[       OK ] HistoryStateOperationsTest.NoOpPushDifferentScheme (2061 ms)
[ RUN      ] HistoryStateOperationsTest.NoOpRelaceDifferentScheme
[       OK ] HistoryStateOperationsTest.NoOpRelaceDifferentScheme (2010 ms)
[ RUN      ] HistoryStateOperationsTest.NoOpPushDifferentOrigin
[       OK ] HistoryStateOperationsTest.NoOpPushDifferentOrigin (1938 ms)
[ RUN      ] HistoryStateOperationsTest.NoOpReplaceDifferentOrigin
[       OK ] HistoryStateOperationsTest.NoOpReplaceDifferentOrigin (2070 ms)
[ RUN      ] HistoryStateOperationsTest.StateReplacement
[       OK ] HistoryStateOperationsTest.StateReplacement (612 ms)
[ RUN      ] HistoryStateOperationsTest.StateReplacementBackForward
[       OK ] HistoryStateOperationsTest.StateReplacementBackForward (865 ms)
[ RUN      ] HistoryStateOperationsTest.PushState
[       OK ] HistoryStateOperationsTest.PushState (819 ms)
[ RUN      ] HistoryStateOperationsTest.ReplaceStatePostRequest
[       OK ] HistoryStateOperationsTest.ReplaceStatePostRequest (698 ms)
[ RUN      ] HistoryStateOperationsTest.ReplaceStateNoHashChangeEvent
[       OK ] HistoryStateOperationsTest.ReplaceStateNoHashChangeEvent (594 ms)
[----------] 11 tests from HistoryStateOperationsTest (15591 ms total)

[----------] 3 tests from WindowLocationTest
[ RUN      ] WindowLocationTest.WindowLocationAssignUnresolvable
[       OK ] WindowLocationTest.WindowLocationAssignUnresolvable (1357 ms)
[ RUN      ] WindowLocationTest.WindowLocationReplaceUnresolvable
[       OK ] WindowLocationTest.WindowLocationReplaceUnresolvable (1994 ms)
[ RUN      ] WindowLocationTest.WindowLocationSetToDOMString

^ app hangs here.
Oops, the hangs I was referring to are simulator only -- as well as hanging before any output is visible.  Agree with sdefresne@.

liaoyuke@ can you reproduce this new hang and get a trace?

Filed #28/#30 as Issue 731740.

That is a separate issue that I've seen before, so I think that test is just busted and needs to be disabled, investigated, and fixed.

I'm pretty confident that it's unrelated to the other timeout issues we've been seeing.
Another example of test that TIMED_OUT but still displayed output can be seen at https://touch-swarming.appspot.com/task?id=36b55116c2674810&refresh=10&show_raw=1.

Maybe a different bug, so filled  issue 732388  for this new occurrence in case it is a different root cause.
Do we have enough confidence in the change to revert the waterfall changes?
Status: Fixed (was: Started)
Marking fixed. The tests should be enabled and we haven't received any hang reports.

Sign in to add a comment