New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 616865 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

flakiness on swarmed telemetry_perf_unittests

Project Member Reported by stip@chromium.org, Jun 2 2016

Issue description

We're seeing some flakiness on telemetry_perf_unittests now that it's been swarmed on https://build.chromium.org/p/chromium.android/builders/Android%20Swarm%20Builder.

See, for example, https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.android%2FAndroid_Swarm_Builder%2F3800%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests_on_Android%2F0%2Fstdout

Exception: Error from worker 7 (traceback follows):
Traceback (most recent call last):
  File "/tmp/runHFB2o4/third_party/catapult/third_party/typ/typ/pool.py", line 159, in _loop
    context_after_pre = pre_fn(host, worker_num, context)
  File "/tmp/runHFB2o4/third_party/catapult/third_party/typ/typ/runner.py", line 755, in _setup_process
    child.context_after_setup = child.setup_fn(child, child.context)
  File "/tmp/runHFB2o4/third_party/catapult/telemetry/telemetry/testing/run_tests.py", line 267, in _SetUpProcess
    android_devices[child.worker_num-1].guid)
IndexError: list index out of range

Looks like it was trying to contact a phone which went offline or something?

 

Comment 1 by stip@chromium.org, Jun 2 2016

Note that Android Tests does not seem to have this issue: https://build.chromium.org/p/chromium.linux/builders/Android%20Tests?numbuilds=100
Cc: dpranke@chromium.org
Yup, at least in that log file it looks like device 03848e98003bfc91 fell offline in the middle of the run.
It's rooting while deleting a temporary file on another thread. Rooting briefly drops the adb connection. I'm not immediately sure why that'd affect swarmed tests but not unswarmed ones.
Yup, I noticed that there's a bunch of stuff happening in parallel that ideally would only be done once. We might want to consider reworking how the android initialization is done in telemetry to do more stuff in telemetry/util/run_tests.py before calling typ.runner.run().
I think the parallelization on Android using typ is just a stop-gap measurement until we have real android swarming sharding support. Once we have that, I think each swarming task should run telemetry test against 1 android device.
I believe we remove temporary files in parallel regardless of typ.
and, looking a bit more, this is more interesting:

(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:118  ********************************************************************************
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:120  Exception on thread TimeoutThread-1-for-MainThread (attempt 1 of 3)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:121  ********************************************************************************
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124  Traceback (most recent call last):
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/timeout_retry.py", line 167, in Run
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      error_log_func=error_log_func)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      self._JoinAll(watcher, timeout)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 158, in _JoinAll
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      thread.ReraiseIfException()
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      self._ret = self._func(*self._args, **self._kwargs)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/timeout_retry.py", line 160, in <lambda>
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs),
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      return f(*args, **kwargs)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124    File "/tmp/runHFB2o4/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 238, in _RunAdbCmd
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124      args, output, status, device_serial)
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124  AdbCommandFailedError: adb devices: failed with exit status 1 and output:
(CRITICAL) 2016-06-02 07:50:31,971 pid=25742  timeout_retry._LogLastException:124  - cannot bind 'tcp:5037': Address already in use
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - ADB server didn't ACK
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - * failed to start daemon *
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - error: cannot connect to daemon: Connection refused
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - error: cannot connect to daemon: Connection refused
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - List of devices attached
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  - * daemon not running. starting it now on port 5037 *
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:124  
(CRITICAL) 2016-06-02 07:50:31,972 pid=25742  timeout_retry._LogLastException:125  ********************************************************************************
... also, telemetry is circumventing devil entirely and doing a raw adb devices call through subprocess :(

https://codesearch.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/android_device.py?rcl=0&l=159
Labels: -Pri-3 Pri-1
John, since you know most about the devil, can you take over this bug?
Owner: jbudorick@chromium.org
Status: Started (was: Untriaged)
Sure.
Project Member

Comment 13 by bugdroid1@chromium.org, Jun 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/57b5d8e39a21344860d49e54f69aa79081cee613

commit 57b5d8e39a21344860d49e54f69aa79081cee613
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Fri Jun 03 15:13:48 2016

Roll src/third_party/catapult/ c502262d9..90899c6d4 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/c502262d988b..90899c6d475e

$ git log c502262d9..90899c6d4 --date=short --no-merges --format='%ad %ae %s'

BUG= 616865 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2035083002
Cr-Commit-Position: refs/heads/master@{#397709}

[modify] https://crrev.com/57b5d8e39a21344860d49e54f69aa79081cee613/DEPS

There has been one failure in telemetry_perf_unittests since my CL rolled, but that was an individual test failure rather than the adb/device failures we were seeing overnight. I'll continue to monitor through the day and close this out if no further failures like this appear.
Saw one similar flake this afternoon in https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/82053. WPR is also doing a raw adb call. I'll look into at least letting it optionally take a path to adb next week.
Project Member

Comment 16 by bugdroid1@chromium.org, Jun 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/735102b128a6dce721c0e111b045947a8409b2a7

commit 735102b128a6dce721c0e111b045947a8409b2a7
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Wed Jun 22 23:11:34 2016

Roll src/third_party/catapult/ ea4633b67..41f6824c5 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/ea4633b677af..41f6824c5521

$ git log ea4633b67..41f6824c5 --date=short --no-merges --format='%ad %ae %s'

BUG= 616865 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2090563005
Cr-Commit-Position: refs/heads/master@{#401448}

[modify] https://crrev.com/735102b128a6dce721c0e111b045947a8409b2a7/DEPS

Project Member

Comment 17 by bugdroid1@chromium.org, Jun 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/df1ed0994b57c32b45d8c17d98a1f537586f812c

commit df1ed0994b57c32b45d8c17d98a1f537586f812c
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Mon Jun 27 13:13:00 2016

Roll src/third_party/catapult/ 310495788..6c4147ba7 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/310495788cee..6c4147ba7e52

$ git log 310495788..6c4147ba7 --date=short --no-merges --format='%ad %ae %s'

BUG= 616865 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2100123002
Cr-Commit-Position: refs/heads/master@{#402159}

[modify] https://crrev.com/df1ed0994b57c32b45d8c17d98a1f537586f812c/DEPS

Status: Fixed (was: Started)

Sign in to add a comment