android telemetry_perf_unittests returning INVALID_TEST_RESULTS |
|||||
Issue descriptionAccording to go/top-cq-flakes, telemetry_perf_unittests on Android is responsible for the vast majority of INVALID_TEST_RESULTS in the last few days. Digging into a specific example: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/166261 We see that 'with_patch', 'without_patch' and 'retry with patch' are all failing with the same error, which suggests a problem either with tip of tree, or possibly device/ADB related. Digging into the failing tasks: https://chromium-swarm.appspot.com/task?id=4256c7d5c3bd3310&refresh=10&show_raw=1 [ran on bot: build395-m1--device1] https://chromium-swarm.appspot.com/task?id=4256f69b0ccd9f10&refresh=10&show_raw=1 [ran on bot: build269-m4--device5] https://chromium-swarm.appspot.com/task?id=4256de1969fbd010&refresh=10&show_raw=1 [ran on bot: build406-m1--device1] The first error occurs on the same test, on the same shard, in all three cases: system_health.memory_mobile/browse:media:googleplaystore:2019 This story just landed on 1/10/2019: https://chromium-review.googlesource.com/c/chromium/src/+/1405268 Seems likely related?
,
Jan 14
Yes, I just added that story where it's timing out. Seems like it might be having trouble finding the particular selector on the web page to click. I'm looking into it.
,
Jan 14
Two things are going wrong here: 1. system_health.memory_mobile/browse:media:googleplaystore:2019 is failing 2. The whole test suite is timing out, causing INVALID_TEST_RESULTS Even in a case where we have failing tests, we really don't want the whole test suite to time out, since INVALID_TEST_RESULTS means that the infrastructure cannot track test results helpfully. I think that the reason that the whole test suite times out is that we retry system_health.memory_mobile/browse:media:googleplaystore:2019 three times, and for each retry, we get stuck waiting for 60 seconds for some command to come back. I see four possible solutions: A. Increase the timeout for this job. B. Reduce the number of story retries. If it fails 3 times in a row, do we really need to try it a fourth time to know it is going to fail? C. Reduce the wait time for that "Timeout after 60s while waiting for JavaScript:window.__mouseClickActionDone" 60 seconds is probably way longer than we need. D. When a timeout signal is sent, we should respond to it appropriately by stopping the job and then quickly writing out the test-results.json before we are force terminated. The only real solution is D since it is the only solution that will stand the test of time.
,
Jan 14
Thanks for investigating! > A. Increase the timeout for this job. Given that we are hitting the timeout when everything is behaving as intended, it does seem like the current timeout is too low. > B. Reduce the number of story retries. If it fails 3 times in a row, do we really need to try it a fourth time to know it is going to fail? If possible, I'd like to keep this consistent with other test suites and perform the same number of retries at the test suite layer. We have a lot of other retry layers built in to the CQ, and I'd like to begin to remove those retry layers in favor of test-suite retries. See https://docs.google.com/document/d/1O9nzVMA6rEe2-rhjsni_8wS9Lw4OnFpK2NE0wWB5VaY/edit > C. Reduce the wait time for that "Timeout after 60s while waiting for JavaScript:window.__mouseClickActionDone" 60 seconds is probably way longer than we need. Agreed. > D. When a timeout signal is sent, we should respond to it appropriately by stopping the job and then quickly writing out the test-results.json before we are force terminated. Agreed.
,
Jan 14
,
Jan 16
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5138baf72452ce8c040b5ae8991d1ac00da106e9 commit 5138baf72452ce8c040b5ae8991d1ac00da106e9 Author: Madeleine Barowsky <mbarowsky@chromium.org> Date: Wed Jan 16 00:25:10 2019 Disable browse:media:googleplaystore:2019 story temporarily. Bug: 921662 Change-Id: Iec3eb9c16a1ba3535dee44dc7ce7b5a23b9df62a Reviewed-on: https://chromium-review.googlesource.com/c/1410366 Commit-Queue: Madeleine Barowsky <mbarowsky@chromium.org> Reviewed-by: Caleb Rouleau <crouleau@chromium.org> Cr-Commit-Position: refs/heads/master@{#622917} [modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/benchmark.csv [modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/page_sets/data/system_health_mobile.json [modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/page_sets/system_health/browsing_stories.py
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
Since disabling my added story yesterday, the test suite has continued to time out: https://ci.chromium.org/b/8924168612932676112 I also did testing locally to try to reproduce the story failure, but it passed even on tip-of-tree. So it seems like an infrastructure issue?
,
Jan 16
(6 days ago)
'with patch' run: https://chromium-swarm.appspot.com/task?id=426fcf0e6beb0710&refresh=10&show_raw=1 'without patch' run: https://chromium-swarm.appspot.com/task?id=426fe16b84951910&refresh=10&show_raw=1 Both timed out. Both have the same error: """ NoAdbError: [Errno 2] No such file or directory Failed to delete temporary file /data/local/tmp/temp_file-238de51387ace: [Errno 2] No such file or directory Exception raised when cleaning story run: """
,
Jan 16
(6 days ago)
This test is responsible for a 80+ build backlog on one of the android CQ bots: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel?limit=200 Uploaded https://chromium-review.googlesource.com/c/chromium/src/+/1415436 for the timeout bump.
,
Jan 16
(6 days ago)
And any "No such file or directory" errors you may see at the end of a TIMED_OUT shard are possibly a consequence of the task not quitting after swarming sends it a SIGTERM. It stays around and continues to run while swarming starts cleaning up the working dir from underneath it. It only quits after an eventual SIGKILL. (We've seen this happen in other places.) See https://chromium.googlesource.com/infra/luci/luci-py/+/master/appengine/swarming/doc/Bot.md#graceful-termination_aka-the-sigterm-and-sigkill-dance for more details.
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72 commit a9f87d1e43bbd275754eea2c55b3f3905d5a2a72 Author: Ben Pastene <bpastene@chromium.org> Date: Wed Jan 16 21:06:45 2019 Bump timeout of telemetry_perf_unittests shards on android-marshmallow-arm64-rel. From 16min to 30min. The suite has some very uneven sharding. Some shard finish in 300s, some in 900s: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/169047 No try to alleviate the CQ bot. NOTRY=true Bug: 921662 Change-Id: Id44f10013604aab164164ff55a154d5dd54be058 Reviewed-on: https://chromium-review.googlesource.com/c/1415436 Reviewed-by: Caleb Rouleau <crouleau@chromium.org> Commit-Queue: Ben Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#623363} [modify] https://crrev.com/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72/testing/buildbot/chromium.android.json [modify] https://crrev.com/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72/testing/buildbot/test_suite_exceptions.pyl
,
Jan 16
(6 days ago)
Thanks for the dance link and the explanation and the timeout increase! I will teach telemetry_perf_unittests the steps soon.
,
Jan 16
(6 days ago)
Ben, could we also shard against more devices, which would make increasing the timeout not as important? This test suite is going to grow over time since we are trying to add more of the Telemetry test cases to it.
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/a9729f33472aa371f0e16e70dbada1d96ee20427 commit a9729f33472aa371f0e16e70dbada1d96ee20427 Author: Caleb Rouleau <crouleau@chromium.org> Date: Wed Jan 16 22:11:41 2019 [Telemetry] Control default action timeout in a single constant. It seems like 60 seconds, which is what everyone seems to be using, is a bit longer than it should be on modern computers. A high timeout means that we are more likely to hit suite-level timeouts and we are more likely to waste time waiting for a broken story rather than moving on to stories that work correctly. In a followup CL, I will see if I can reduce this timeout without breaking a lot of tests. Any tests that break can be edited to not use the default. Bug: chromium:921662 Change-Id: Ib9ba75da6b2041200d07467306fa583aeb6e4613 Reviewed-on: https://chromium-review.googlesource.com/c/1413774 Commit-Queue: Caleb Rouleau <crouleau@chromium.org> Reviewed-by: John Chen <johnchen@chromium.org> [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/key_event.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/scroll.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/load_media.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/mouse_click.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/swipe.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/tap.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/page_action.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/drag.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/pinch.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/action_runner.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/scroll_bounce.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/seek.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/repeatable_scroll.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/wait.py [modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/navigate.py
,
Jan 16
(6 days ago)
Looks like the timeout bump helped. Builds are now passing the test, with shard #1 finishing in 990s (just over its previous 960s timeout): https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/170072 As for increasing the shard count, I'd urge against it if we can (at least in the short-term). The device pool it launches tests into is usually pretty overloaded. We're hitting ~18min pending times in that pool today: https://chromium-swarm.appspot.com/task?id=4271948366e88910. We theoretically shoot for 0 pending times :( so we're definitely a bit overloaded as is. We can start talking about increasing the pool size, but a deployment like that usually takes 1-2 quarters.
,
Jan 17
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85 commit 9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Thu Jan 17 00:48:56 2019 Roll src/third_party/catapult c329a12f2660..a9729f33472a (1 commits) https://chromium.googlesource.com/catapult.git/+log/c329a12f2660..a9729f33472a git log c329a12f2660..a9729f33472a --date=short --no-merges --format='%ad %ae %s' 2019-01-16 crouleau@chromium.org [Telemetry] Control default action timeout in a single constant. Created with: gclient setdep -r src/third_party/catapult@a9729f33472a The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG=chromium:921662 TBR=sullivan@chromium.org Change-Id: I984223f22c89188c95256682e9e5c306d525869c Reviewed-on: https://chromium-review.googlesource.com/c/1416352 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#623470} [modify] https://crrev.com/9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85/DEPS
,
Jan 17
(5 days ago)
I don't think the root cause has been addressed yet, and it seems like it has to do with offline pages. When my story timed out, it was because it encountered DevTools errors trying to load any resources [1]: DevTools console [network]: Failed to load resource: the server responded with a status of 404 () After I removed my story, we saw a different story, system_health.memory_mobile/load:tools:weather, time out with the same error [2]. Interestingly, this case was the sixth memory_mobile story to run. The previous five passed and were loading wprgo recordings with a mix of http and https sites. Currently, filtering go/top-cq-flakes INVALID_TEST_RESULTS for today only (Jan 17) shows chrome_public_test_apk timing out while waiting for a callback from OfflinePageAutoFetchTest.waitForPageAdded. [1]: https://chromium-swarm.appspot.com/task?id=4256c7d5c3bd3310&refresh=10&show_raw=1&wide_logs=true [2]: https://chromium-swarm.appspot.com/task?id=426fcf0e6beb0710&refresh=10&show_raw=1&wide_logs=true
,
Jan 17
(5 days ago)
Thanks for the investigation, Madeleine! I filed issue 923070 to look into that side of this bug, and I think that is important for sure. I don't want this bug to have too many responsibilities, and I think the focus of this one should be to make sure that telemetry_perf_unittests runs never returns INVALID_TEST_RESULTS even if the shard times out. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by estaab@chromium.org
, Jan 14Components: Tests>Telemetry