New issue
Advanced search Search tips

Issue 921662 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 881991



Sign in to add a comment

android telemetry_perf_unittests returning INVALID_TEST_RESULTS

Project Member Reported by erikc...@chromium.org, Jan 14

Issue description

According to go/top-cq-flakes, telemetry_perf_unittests on Android is responsible for the vast majority of INVALID_TEST_RESULTS in the last few days.

Digging into a specific example:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/166261

We see that 'with_patch', 'without_patch' and 'retry with patch' are all failing with the same error, which suggests a problem either with tip of tree, or possibly device/ADB related.

Digging into the failing tasks:
https://chromium-swarm.appspot.com/task?id=4256c7d5c3bd3310&refresh=10&show_raw=1 [ran on bot: build395-m1--device1]
https://chromium-swarm.appspot.com/task?id=4256f69b0ccd9f10&refresh=10&show_raw=1 [ran on bot: build269-m4--device5]
https://chromium-swarm.appspot.com/task?id=4256de1969fbd010&refresh=10&show_raw=1 [ran on bot: build406-m1--device1]

The first error occurs on the same test, on the same shard, in all three cases:
system_health.memory_mobile/browse:media:googleplaystore:2019

This story just landed on 1/10/2019:
https://chromium-review.googlesource.com/c/chromium/src/+/1405268

Seems likely related? 


 
Screen Shot 2019-01-14 at 1.10.58 PM.png
137 KB View Download
Cc: crouleau@chromium.org
Components: Tests>Telemetry
Yes, I just added that story where it's timing out. Seems like it might be having trouble finding the particular selector on the web page to click. I'm looking into it.
Two things are going wrong here:

1. system_health.memory_mobile/browse:media:googleplaystore:2019 is failing

2. The whole test suite is timing out, causing INVALID_TEST_RESULTS

Even in a case where we have failing tests, we really don't want the whole test suite to time out, since INVALID_TEST_RESULTS means that the infrastructure cannot track test results helpfully.

I think that the reason that the whole test suite times out is that we retry system_health.memory_mobile/browse:media:googleplaystore:2019 three times, and for each retry, we get stuck waiting for 60 seconds for some command to come back. I see four possible solutions:

A. Increase the timeout for this job.
B. Reduce the number of story retries. If it fails 3 times in a row, do we really need to try it a fourth time to know it is going to fail?
C. Reduce the wait time for that "Timeout after 60s while waiting for JavaScript:window.__mouseClickActionDone" 60 seconds is probably way longer than we need.
D. When a timeout signal is sent, we should respond to it appropriately by stopping the job and then quickly writing out the test-results.json before we are force terminated.

The only real solution is D since it is the only solution that will stand the test of time.

Thanks for investigating!

> A. Increase the timeout for this job.

Given that we are hitting the timeout when everything is behaving as intended, it does seem like the current timeout is too low.

> B. Reduce the number of story retries. If it fails 3 times in a row, do we really need to try it a fourth time to know it is going to fail?

If possible, I'd like to keep this consistent with other test suites and perform the same number of retries at the test suite layer. We have a lot of other retry layers built in to the CQ, and I'd like to begin to remove those retry layers in favor of test-suite retries. See https://docs.google.com/document/d/1O9nzVMA6rEe2-rhjsni_8wS9Lw4OnFpK2NE0wWB5VaY/edit

> C. Reduce the wait time for that "Timeout after 60s while waiting for JavaScript:window.__mouseClickActionDone" 60 seconds is probably way longer than we need.

Agreed.

> D. When a timeout signal is sent, we should respond to it appropriately by stopping the job and then quickly writing out the test-results.json before we are force terminated.

Agreed.
Labels: -Pri-3 Pri-1
Owner: crouleau@chromium.org
Status: Assigned (was: Untriaged)
Project Member

Comment 6 by bugdroid1@chromium.org, Jan 16

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5138baf72452ce8c040b5ae8991d1ac00da106e9

commit 5138baf72452ce8c040b5ae8991d1ac00da106e9
Author: Madeleine Barowsky <mbarowsky@chromium.org>
Date: Wed Jan 16 00:25:10 2019

Disable browse:media:googleplaystore:2019 story temporarily.

Bug: 921662
Change-Id: Iec3eb9c16a1ba3535dee44dc7ce7b5a23b9df62a
Reviewed-on: https://chromium-review.googlesource.com/c/1410366
Commit-Queue: Madeleine Barowsky <mbarowsky@chromium.org>
Reviewed-by: Caleb Rouleau <crouleau@chromium.org>
Cr-Commit-Position: refs/heads/master@{#622917}
[modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/benchmark.csv
[modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/page_sets/data/system_health_mobile.json
[modify] https://crrev.com/5138baf72452ce8c040b5ae8991d1ac00da106e9/tools/perf/page_sets/system_health/browsing_stories.py

Comment 7 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 8 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Tests>Telemetry

Comment 9 by mbarow...@chromium.org, Jan 16 (6 days ago)

Since disabling my added story yesterday, the test suite has continued to time out:

https://ci.chromium.org/b/8924168612932676112

I also did testing locally to try to reproduce the story failure, but it passed even on tip-of-tree. So it seems like an infrastructure issue?

Comment 10 by erikc...@chromium.org, Jan 16 (6 days ago)

'with patch' run:
https://chromium-swarm.appspot.com/task?id=426fcf0e6beb0710&refresh=10&show_raw=1

'without patch' run:
https://chromium-swarm.appspot.com/task?id=426fe16b84951910&refresh=10&show_raw=1

Both timed out.

Both have the same error: 
"""
  NoAdbError: [Errno 2] No such file or directory
  Failed to delete temporary file /data/local/tmp/temp_file-238de51387ace: [Errno 2] No such file or directory
  Exception raised when cleaning story run: 
"""

Comment 11 by bpastene@chromium.org, Jan 16 (6 days ago)

Cc: bpastene@chromium.org
This test is responsible for a 80+ build backlog on one of the android CQ bots:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel?limit=200

Uploaded https://chromium-review.googlesource.com/c/chromium/src/+/1415436 for the timeout bump.

Comment 12 by bpastene@chromium.org, Jan 16 (6 days ago)

And any "No such file or directory" errors you may see at the end of a TIMED_OUT shard are possibly a consequence of the task not quitting after swarming sends it a SIGTERM. It stays around and continues to run while swarming starts cleaning up the working dir from underneath it. It only quits after an eventual SIGKILL. (We've seen this happen in other places.)

See https://chromium.googlesource.com/infra/luci/luci-py/+/master/appengine/swarming/doc/Bot.md#graceful-termination_aka-the-sigterm-and-sigkill-dance for more details.
Project Member

Comment 13 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72

commit a9f87d1e43bbd275754eea2c55b3f3905d5a2a72
Author: Ben Pastene <bpastene@chromium.org>
Date: Wed Jan 16 21:06:45 2019

Bump timeout of telemetry_perf_unittests shards on android-marshmallow-arm64-rel.

From 16min to 30min.

The suite has some very uneven sharding. Some shard finish in 300s, some
in 900s:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/169047

No try to alleviate the CQ bot.
NOTRY=true

Bug: 921662
Change-Id: Id44f10013604aab164164ff55a154d5dd54be058
Reviewed-on: https://chromium-review.googlesource.com/c/1415436
Reviewed-by: Caleb Rouleau <crouleau@chromium.org>
Commit-Queue: Ben Pastene <bpastene@chromium.org>
Cr-Commit-Position: refs/heads/master@{#623363}
[modify] https://crrev.com/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72/testing/buildbot/chromium.android.json
[modify] https://crrev.com/a9f87d1e43bbd275754eea2c55b3f3905d5a2a72/testing/buildbot/test_suite_exceptions.pyl

Comment 14 by crouleau@chromium.org, Jan 16 (6 days ago)

Thanks for the dance link and the explanation and the timeout increase! I will teach telemetry_perf_unittests the steps soon.

Comment 15 by crouleau@chromium.org, Jan 16 (6 days ago)

Ben, could we also shard against more devices, which would make increasing the timeout not as important? This test suite is going to grow over time since we are trying to add more of the Telemetry test cases to it.
Project Member

Comment 16 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/a9729f33472aa371f0e16e70dbada1d96ee20427

commit a9729f33472aa371f0e16e70dbada1d96ee20427
Author: Caleb Rouleau <crouleau@chromium.org>
Date: Wed Jan 16 22:11:41 2019

[Telemetry] Control default action timeout in a single constant.

It seems like 60 seconds, which is what everyone seems to be using,
is a bit longer than it should be on modern computers. A high
timeout means that we are more likely to hit suite-level timeouts
and we are more likely to waste time waiting for a broken story
rather than moving on to stories that work correctly. In a
followup CL, I will see if I can reduce this timeout without breaking
a lot of tests. Any tests that break can be edited to not use the
default.

Bug: chromium:921662
Change-Id: Ib9ba75da6b2041200d07467306fa583aeb6e4613
Reviewed-on: https://chromium-review.googlesource.com/c/1413774
Commit-Queue: Caleb Rouleau <crouleau@chromium.org>
Reviewed-by: John Chen <johnchen@chromium.org>

[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/key_event.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/scroll.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/load_media.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/mouse_click.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/swipe.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/tap.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/page_action.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/drag.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/pinch.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/action_runner.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/scroll_bounce.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/seek.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/repeatable_scroll.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/wait.py
[modify] https://crrev.com/a9729f33472aa371f0e16e70dbada1d96ee20427/telemetry/telemetry/internal/actions/navigate.py

Comment 17 by bpastene@chromium.org, Jan 16 (6 days ago)

Looks like the timeout bump helped. Builds are now passing the test, with shard #1 finishing in 990s (just over its previous 960s timeout):
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/170072

As for increasing the shard count, I'd urge against it if we can (at least in the short-term). The device pool it launches tests into is usually pretty overloaded. We're hitting ~18min pending times in that pool today: https://chromium-swarm.appspot.com/task?id=4271948366e88910. We theoretically shoot for 0 pending times :( so we're definitely a bit overloaded as is.

We can start talking about increasing the pool size, but a deployment like that usually takes 1-2 quarters.
Project Member

Comment 18 by bugdroid1@chromium.org, Jan 17 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85

commit 9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85
Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Date: Thu Jan 17 00:48:56 2019

Roll src/third_party/catapult c329a12f2660..a9729f33472a (1 commits)

https://chromium.googlesource.com/catapult.git/+log/c329a12f2660..a9729f33472a


git log c329a12f2660..a9729f33472a --date=short --no-merges --format='%ad %ae %s'
2019-01-16 crouleau@chromium.org [Telemetry] Control default action timeout in a single constant.


Created with:
  gclient setdep -r src/third_party/catapult@a9729f33472a

The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.

CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel

BUG=chromium:921662
TBR=sullivan@chromium.org

Change-Id: I984223f22c89188c95256682e9e5c306d525869c
Reviewed-on: https://chromium-review.googlesource.com/c/1416352
Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#623470}
[modify] https://crrev.com/9a9d60ace5ab70b2f6a0c53e30d5137dc905ff85/DEPS

Comment 19 by mbarow...@chromium.org, Jan 17 (5 days ago)

I don't think the root cause has been addressed yet, and it seems like it has to do with offline pages.

When my story timed out, it was because it encountered DevTools errors trying to load any resources [1]:

 DevTools console [network]: Failed to load resource: the server responded with a status of 404 ()

After I removed my story, we saw a different story, system_health.memory_mobile/load:tools:weather, time out with the same error [2]. Interestingly, this case was the sixth memory_mobile story to run. The previous five passed and 
were loading wprgo recordings with a mix of http and https sites.

Currently, filtering go/top-cq-flakes INVALID_TEST_RESULTS for today only (Jan 17) shows chrome_public_test_apk timing out while waiting for a callback from OfflinePageAutoFetchTest.waitForPageAdded.

[1]: https://chromium-swarm.appspot.com/task?id=4256c7d5c3bd3310&refresh=10&show_raw=1&wide_logs=true
[2]: https://chromium-swarm.appspot.com/task?id=426fcf0e6beb0710&refresh=10&show_raw=1&wide_logs=true

Comment 20 by crouleau@chromium.org, Jan 17 (5 days ago)

Thanks for the investigation, Madeleine!
I filed issue 923070 to look into that side of this bug, and I think that is important for sure. I don't want this bug to have too many responsibilities, and I think the focus of this one should be to make sure that telemetry_perf_unittests runs never returns INVALID_TEST_RESULTS even if the shard times out.

Sign in to add a comment