ios unit tests frequently fails with TIMED_OUT on swarming bots |
||||||||||
Issue descriptionRecently ios unit tests started to fail with the following error during swarming: shard #0 timed out, took too much time to complete For example: https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/227369 https://chromium-swarm.appspot.com/task?id=367828b8bfd5c910&refresh=10&show_raw=1 The test apparently took 22 minutes, but if we look at a successful run (on a different CL, but neither CL significantly change the duration of the tests), the run that succeeded took 2 minutes: https://chromium-swarm.appspot.com/task?id=36730bf58780cd10&refresh=10&show_raw=1 Since neither CLs change the tests in any significant way, I assume this is because some machine are more overloaded than other and is a problem in Infra. Is iOS over its quota of work? Is it because the second CL was sent to CQ while US was up and thus when the resources have higher usage? Can something be done as this cause the downstream ios waterfall to be closed regularly?
,
May 31 2017
,
Jun 1 2017
,
Jun 1 2017
,
Jun 1 2017
,
Jun 1 2017
Looking at this right now.
,
Jun 1 2017
There is a possibility this is related to our http_server impl trying to resolve bonjour names for our servers. (https://touch-swarming.appspot.com/task?id=367cdfabfaf43210&refresh=10&show_raw=1) If we can land the new HttpServer this week, I think we should wait to see if it will reduce this flake. Otherwise, I have a proposed change that may help here: https://codereview.chromium.org/2914253002/.
,
Jun 1 2017
Issue 728105 has been merged into this issue.
,
Jun 2 2017
liaoyuke@ is looking into this. I don't believe my comment is the same as this issue.
,
Jun 5 2017
Here is my preliminary investigation results. I was able to reproduce the hang by repeatedly running ios_chrome_unittests locally (it may take up to two hours to repro). First of all, I removed all the usages of NSPipe, unfortunately, it's still hanging. Then I tried running xcodebuild -xctestrun directly from terminal, and it's still hanging. (I have attached the script at the end of this email) The above two mentioned experiments denies that the issue is caused by NSTask or NSPipe, so apparently, there is a bug with xcodebuild -xctestrun, and here are some related posts I found online: https://github.com/lionheart/openradar-mirror/issues/15412 http://devcenter.bitrise.io/ios/known-xcode-issues/#xcode-8-xcodebuild-test-hangs-at-the-very-end-of-the-tests https://forums.developer.apple.com/thread/39281 What next? I have the following three thoughts that I would like to verify: 1. The fact that we're reusing the same UDID might be the cause, so I would like to try creating a new simulator instance for each run, and clean it up after the test finishes. 2. Try launching ios_chrome_unittests.app directly using xcrun simctl launch, and see if it makes a difference. 3. Reach out to Mobile Harness and Facebook and see if they have observed similar issues. Let me know if you have other thoughts. Thanks, Yuke script: #!/bin/sh UDID[0]="1E9ABF3A-2C1B-444D-B503-456EC3D388A2" UDID[1]="CFA3F1B5-548B-48E5-93FA-392FD4B544C2" UDID[2]="3658CA61-5FB4-40EC-A2EE-7A70A0878306" UDID[3]="52D4A67D-41AE-4903-A0FE-5A37CAF671AA" for iter in {1..300} do echo "$iter" udid_index=$(( RANDOM % 4 )) xcrun killall Simulator xcodebuild -xctestrun "/Users/liaoyuke/xctestrun_config.txt" -destination "platform=iOS Simulator,id=${UDID[$udid_index]}" test-without-building done xctestrun_config.txt: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>TestTargetName</key> <dict> <key>OnlyTestIdentifiers</key> <array/> <key>TestBundlePath</key> <string>/Users/liaoyuke/bling/src/out/Debug-iphonesimulator/ios_chrome_unittests.app</string> <key>TestHostPath</key> <string>/Users/liaoyuke/bling/src/out/Debug-iphonesimulator/ios_chrome_unittests.app</string> <key>TestingEnvironmentVariables</key> <dict> <key>DYLD_FRAMEWORK_PATH</key> <string>__PLATFORMS__/iPhoneSimulator.platform/Developer/Library/Frameworks</string> <key>DYLD_LIBRARY_PATH</key> <string>__PLATFORMS__/iPhoneSimulator.platform/Developer/Library</string> <key>IDEiPhoneInternalTestBundleName</key> <string>ios_chrome_unittests.app</string> </dict> </dict> </dict> </plist>
,
Jun 5 2017
update: 1. Creating new simulator instances during each test run doesn't help. 2. I was able to launch ios_chrome_unittests.app using xcrun simctl launch, however, this command simply launches the the tests, but not wait for the tests to finish running or pipe the output, so I'm blocked here.
,
Jun 5 2017
It turns out that the hang in the middle of tests running is not only happening on ios_chrome_unittests, it's also happening on net_unittests, gfx_unittests etc, across different bots and tests. while for the components_unittests, I just confirmed it is a different issue, hanging without output at all. However, it only happened once during the past 200 builds, which is quite rare, so I'm not too concerned with this issue, I'll focus on fixing the hang in the middle of tests running issue first.
,
Jun 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a27564e55ec5fd0ff168c23d927c68dec145871c commit a27564e55ec5fd0ff168c23d927c68dec145871c Author: baxley <baxley@chromium.org> Date: Wed Jun 07 00:16:13 2017 Remove unit tests from ios-simulator that are timing out. ios_chrome_unittests and net_unittests are timing out a lot on the bots, so remove them from the CQ and main. BUG= 728216 Review-Url: https://codereview.chromium.org/2925663005 Cr-Commit-Position: refs/heads/master@{#477477} [modify] https://crrev.com/a27564e55ec5fd0ff168c23d927c68dec145871c/ios/build/bots/tests/common_tests.json
,
Jun 7 2017
Took a quick look and in this case, the task tracker says can_run_tasks_ = true num_tasks_running_ = 0 So we're basically already idle and yet we're waiting for RunUntilIdle to return. We're going to be waiting for a long time due to a race in the following lines in ScopedTaskEnvironment::RunUntilIdle: // The closure registered above may never run if the TaskScheduler queue // starts empty. Post a TaskScheduler tasks to make sure that the queue // doesn't start empty. PostTask(FROM_HERE, BindOnce(&DoNothing)) I hacked in a PlatformThread::Sleep 50ms instead of DoNothing, and that seems to have addressed the hang, so it's very likely to be the race (at the expense of slowing the test down). Because the tests are running blazingly fast, when the system is already idle, and then this PostTask will complete before we hit run_loop.Run(), triggering the condition and keeping the RunLoop running. We will need another way to check that we're already idle.
,
Jun 7 2017
,
Jun 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/77dac22383dca465e09cfbbb61fadfce83c9eede commit 77dac22383dca465e09cfbbb61fadfce83c9eede Author: baxley <baxley@chromium.org> Date: Wed Jun 07 02:47:09 2017 Add recently removed iOS CQ unit tests back to FYI. Create new fyi_tests.json so we can more easily run different suites on multiple fyi configurations. BUG= 728216 Review-Url: https://codereview.chromium.org/2922253004 Cr-Commit-Position: refs/heads/master@{#477528} [modify] https://crrev.com/77dac22383dca465e09cfbbb61fadfce83c9eede/ios/build/bots/chromium.fyi/ios-simulator.json [add] https://crrev.com/77dac22383dca465e09cfbbb61fadfce83c9eede/ios/build/bots/tests/fyi_tests.json
,
Jun 7 2017
One more race (and likely the casual one).
In
ScopedTaskEnvironment::TestTaskTracker::PerformRunTask
if (num_tasks_running_ == 1 &&
GetNumPendingUndelayedTasksForTesting() == 1 && queue_empty_closure_)
num_tasks_running_ may == 1 but GetNumPendingUndelayedTasksForTesting() could be > 1 because this number is not decremented until after the PerformRunTask returns.
If there are two tasks in flight running at the same time, we may context switch after we return but before calling DecrementNumPendingUndelayedTasks(), leaving other threads to see the old undelayed count. As a result, when the last task rolls through, it still thinks there are undelayed tasks when there aren't any.
,
Jun 7 2017
This is causing failures to many test suites on the iOS waterfall (https://build.chromium.org/p/chromium.mac/builders/ios-simulator/builds/18214) like components_unittests, ios_web_unittests, ios_net_unittests. I guess that disabling all the test suites is probably not sustainable. robliao: any idea how we can fix those race conditions?
,
Jun 7 2017
We are hitting two different timeout issues on the waterfall: 1) This issue, which manifests as tests running normally, but hanging before finishing. These tests have some log output. 2) An iossim issue, in which the tests never seem to run at all. These tests have no log output. This bug only covers #1, which to date has only been an issue in ios_chrome_unitests and net_unittests, specifically. All of the other test suites seem to be the second problem, which is Issue 694906 . If anyone sees other test suites with log output that indicates that they hung after starting to run tests, please post links here and we can see if they are also hitting race conditions.
,
Jun 7 2017
I can think of 2 solutions: 1. Add a counter of pending undelayed tasks in ScopedTaskEnvironment::TestTaskTracker instead of using the one from TaskTracker. Protect all accesses to this counter with |lock_| to avoid subtle race conditions like this one. This require adding a way for a derived class of TaskTracker to be notified when a task is posted. 2. Call DecrementNumPendingUndelayedTasks() at the end of TaskTracker::PerformRunTask() instead of at the end of TaskTracker::RunNextTask(). This fixes this specific error but is less robust to other potential errors. WDYT?
,
Jun 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/88d6f53c69299ac5a899d64904fa795861817736 commit 88d6f53c69299ac5a899d64904fa795861817736 Author: rohitrao <rohitrao@chromium.org> Date: Wed Jun 07 15:23:33 2017 [ios] Disable WebResourceUtilTest due to risk of timeouts. crbug.com/728216 details an issue that could cause ScopedTaskEnvironment::RunUntilIdle() to hang forever. This CL disables the tests that have a chance of hanging until a fix is instituted. BUG= 728216 Review-Url: https://codereview.chromium.org/2926133002 Cr-Commit-Position: refs/heads/master@{#477654} [modify] https://crrev.com/88d6f53c69299ac5a899d64904fa795861817736/ios/chrome/browser/web_resource/web_resource_util_unittest.cc
,
Jun 7 2017
Re. > So we're basically already idle and yet we're waiting for RunUntilIdle to > return. We're going to be waiting for a long time due to a race in the > following lines in ScopedTaskEnvironment::RunUntilIdle: > // The closure registered above may never run if the TaskScheduler queue > // starts empty. Post a TaskScheduler tasks to make sure that the queue > // doesn't start empty. > PostTask(FROM_HERE, BindOnce(&DoNothing)) I don't think there's a race there. It's find for the QuitClosure to be called ahead of RunLoop::Run() -- RunLoop explicitly supports that. The DoNothing() just ensures the TaskTracker hits the OnQueueEmptyClosure() again if it was already empty. Catching up with other posts...
,
Jun 7 2017
After working with a few approaches, we've come to the consensus that a virtual method call at the end of TaskTracker::RunNextTask should work and be surgical enough to maintain the existing guarantees of the other counters without a lot of rewriting.
virtual void TaskTracker::OnRunTaskCompleted() {}
Does nothing in the production TaskTracker
For ScopedTaskEnvironment, this is where we check number of pending tasks before posting the quit closure.
A change for this will be out soon.
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/48de879ed5cb3ea81d10710caaf33eec1e2c5d08 commit 48de879ed5cb3ea81d10710caaf33eec1e2c5d08 Author: Robert Liao <robliao@chromium.org> Date: Thu Jun 08 00:00:12 2017 Fix Task Counting Race in ScopedTaskEnvironment::TestTaskTracker This change introduces OnRunNextTaskCompleted(), called right before RunNextTask() returns and after all of the task metrics accounting is complete. This allows ScopedTaskEnvironment::TestTaskTracker to check the number of undelayed tasks and number of running tasks in one spot after all the counters are updated. Previously, the code performed a lookahead on the number of undelayed tasks, expected that the number running and number undelayed to be 1 before terminating. This is racy because there may be multiple tasks in flight and the number of running tasks could be 1 before all of the undelayed task accounting is complete (e.g. it could see 2 and then the number will be 0 immediately after that). BUG= 728216 Change-Id: I669f8da1f43c3a40500dbcab1f96d3677b18f2d8 Reviewed-on: https://chromium-review.googlesource.com/527334 Reviewed-by: Gabriel Charette <gab@chromium.org> Reviewed-by: Francois Doray <fdoray@chromium.org> Commit-Queue: Robert Liao <robliao@chromium.org> Cr-Commit-Position: refs/heads/master@{#477814} [modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/task_scheduler/task_tracker.cc [modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/task_scheduler/task_tracker.h [modify] https://crrev.com/48de879ed5cb3ea81d10710caaf33eec1e2c5d08/base/test/scoped_task_environment.cc
,
Jun 8 2017
I'm going to let the change bake for a bit, and then I'll issue reverts one at a time to reenable the tests.
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2fccbd0ee411a58f1092c2c036d614da16ca4730 commit 2fccbd0ee411a58f1092c2c036d614da16ca4730 Author: baxley <baxley@chromium.org> Date: Thu Jun 08 04:35:57 2017 Revert of Add recently removed iOS CQ unit tests back to FYI. (patchset #2 id:20001 of https://codereview.chromium.org/2922253004/ ) Reason for revert: Now that ios_chrome_unittests are back in common_tests.json, fyi bots have an exception for two instances of ios_chrome_unittests. Same with net_unittests. Original issue's description: > Add recently removed iOS CQ unit tests back to FYI. > > Create new fyi_tests.json so we can more easily run different > suites on multiple fyi configurations. > > BUG= 728216 > > Review-Url: https://codereview.chromium.org/2922253004 > Cr-Commit-Position: refs/heads/master@{#477528} > Committed: https://chromium.googlesource.com/chromium/src/+/77dac22383dca465e09cfbbb61fadfce83c9eede TBR=rohitrao@chromium.org,smut@google.com # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 728216 Review-Url: https://codereview.chromium.org/2924163004 Cr-Commit-Position: refs/heads/master@{#477888} [modify] https://crrev.com/2fccbd0ee411a58f1092c2c036d614da16ca4730/ios/build/bots/chromium.fyi/ios-simulator.json [delete] https://crrev.com/916a8290de5e9e41873282f01e11ffa7606c1531/ios/build/bots/tests/fyi_tests.json
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce commit e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce Author: robliao <robliao@chromium.org> Date: Thu Jun 08 23:57:43 2017 Revert of [ios] Disable WebResourceUtilTest due to risk of timeouts. (patchset #1 id:1 of https://codereview.chromium.org/2926133002/ ) Reason for revert: The underlying cause of http://crbug.com/728216 should be fixed. These tests should be good to go now. Original issue's description: > [ios] Disable WebResourceUtilTest due to risk of timeouts. > > crbug.com/728216 details an issue that could cause > ScopedTaskEnvironment::RunUntilIdle() to hang forever. This CL disables the > tests that have a chance of hanging until a fix is instituted. > > BUG= 728216 > > Review-Url: https://codereview.chromium.org/2926133002 > Cr-Commit-Position: refs/heads/master@{#477654} > Committed: https://chromium.googlesource.com/chromium/src/+/88d6f53c69299ac5a899d64904fa795861817736 TBR=baxley@chromium.org,liaoyuke@chromium.org,rohitrao@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 728216 Review-Url: https://codereview.chromium.org/2927223002 Cr-Commit-Position: refs/heads/master@{#478126} [modify] https://crrev.com/e8d7f36c67e1fc0d5f4b37682bc72c5be565a3ce/ios/chrome/browser/web_resource/web_resource_util_unittest.cc
,
Jun 9 2017
Looks like https://touch-swarming.appspot.com/task?id=36a56f9514533910&refresh=10&show_raw=1 failed with TIMED_OUT. The application started as there are some output, but then blocked. Does it mean the underlying issue is not completely fixed?
,
Jun 9 2017
There may still be issues with iossim/NSTask/NSPipe, covered in 709374 I believe.
,
Jun 9 2017
The NSPipe/NSTask issue in 709374 cause hangs before any output is generated IIUC. In the log added in #28, some output is generated before the app hangs: Uploading ios_web_inttests.app package contents... DONE. Installing '(null)' Install: CreatingStagingDirectory (5%) Install: ExtractingPackage (15%) Install: InspectingPackage (20%) Install: TakingInstallLock (20%) Install: PreflightingApplication (30%) Install: InstallingEmbeddedProfile (30%) Install: VerifyingApplication (40%) Install: CreatingContainer (50%) Install: InstallingApplication (60%) Install: PostflightingApplication (70%) Install: SandboxingApplication (80%) Install: GeneratingApplicationMap (90%) Install: Complete idevice-app-runner --udid 5d2215775d4988c7e95d58294cf9fab032e2fd3f --start com.google.gtest.ios-web-inttests Detected presence of a debugger, running without test timeouts. [23638:3079:0609/043058.322098:100432828341905:WARNING:resource_bundle.cc(353)] locale_file_path.empty() for locale en-US [==========] Running 35 tests from 8 test cases. [----------] Global test environment set-up. [----------] 1 test from BrowserStateWebViewPartitionTest [ RUN ] BrowserStateWebViewPartitionTest.LocalStorage [ OK ] BrowserStateWebViewPartitionTest.LocalStorage (1012 ms) [----------] 1 test from BrowserStateWebViewPartitionTest (1013 ms total) [----------] 11 tests from HistoryStateOperationsTest [ RUN ] HistoryStateOperationsTest.NoOpPushUnresolvable [ OK ] HistoryStateOperationsTest.NoOpPushUnresolvable (1953 ms) [ RUN ] HistoryStateOperationsTest.NoOpReplaceUnresolvable [ OK ] HistoryStateOperationsTest.NoOpReplaceUnresolvable (1956 ms) [ RUN ] HistoryStateOperationsTest.NoOpPushDifferentScheme [ OK ] HistoryStateOperationsTest.NoOpPushDifferentScheme (2061 ms) [ RUN ] HistoryStateOperationsTest.NoOpRelaceDifferentScheme [ OK ] HistoryStateOperationsTest.NoOpRelaceDifferentScheme (2010 ms) [ RUN ] HistoryStateOperationsTest.NoOpPushDifferentOrigin [ OK ] HistoryStateOperationsTest.NoOpPushDifferentOrigin (1938 ms) [ RUN ] HistoryStateOperationsTest.NoOpReplaceDifferentOrigin [ OK ] HistoryStateOperationsTest.NoOpReplaceDifferentOrigin (2070 ms) [ RUN ] HistoryStateOperationsTest.StateReplacement [ OK ] HistoryStateOperationsTest.StateReplacement (612 ms) [ RUN ] HistoryStateOperationsTest.StateReplacementBackForward [ OK ] HistoryStateOperationsTest.StateReplacementBackForward (865 ms) [ RUN ] HistoryStateOperationsTest.PushState [ OK ] HistoryStateOperationsTest.PushState (819 ms) [ RUN ] HistoryStateOperationsTest.ReplaceStatePostRequest [ OK ] HistoryStateOperationsTest.ReplaceStatePostRequest (698 ms) [ RUN ] HistoryStateOperationsTest.ReplaceStateNoHashChangeEvent [ OK ] HistoryStateOperationsTest.ReplaceStateNoHashChangeEvent (594 ms) [----------] 11 tests from HistoryStateOperationsTest (15591 ms total) [----------] 3 tests from WindowLocationTest [ RUN ] WindowLocationTest.WindowLocationAssignUnresolvable [ OK ] WindowLocationTest.WindowLocationAssignUnresolvable (1357 ms) [ RUN ] WindowLocationTest.WindowLocationReplaceUnresolvable [ OK ] WindowLocationTest.WindowLocationReplaceUnresolvable (1994 ms) [ RUN ] WindowLocationTest.WindowLocationSetToDOMString ^ app hangs here.
,
Jun 9 2017
Oops, the hangs I was referring to are simulator only -- as well as hanging before any output is visible. Agree with sdefresne@. liaoyuke@ can you reproduce this new hang and get a trace?
,
Jun 9 2017
Filed #28/#30 as Issue 731740. That is a separate issue that I've seen before, so I think that test is just busted and needs to be disabled, investigated, and fixed. I'm pretty confident that it's unrelated to the other timeout issues we've been seeing.
,
Jun 12 2017
Another example of test that TIMED_OUT but still displayed output can be seen at https://touch-swarming.appspot.com/task?id=36b55116c2674810&refresh=10&show_raw=1. Maybe a different bug, so filled issue 732388 for this new occurrence in case it is a different root cause.
,
Jun 12 2017
Do we have enough confidence in the change to revert the waterfall changes?
,
Jun 16 2017
Marking fixed. The tests should be enabled and we haven't received any hang reports. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by sczs@chromium.org
, May 31 2017