flakiness on swarmed telemetry_perf_unittests |
|||||
Issue descriptionWe're seeing some flakiness on telemetry_perf_unittests now that it's been swarmed on https://build.chromium.org/p/chromium.android/builders/Android%20Swarm%20Builder. See, for example, https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.android%2FAndroid_Swarm_Builder%2F3800%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests_on_Android%2F0%2Fstdout Exception: Error from worker 7 (traceback follows): Traceback (most recent call last): File "/tmp/runHFB2o4/third_party/catapult/third_party/typ/typ/pool.py", line 159, in _loop context_after_pre = pre_fn(host, worker_num, context) File "/tmp/runHFB2o4/third_party/catapult/third_party/typ/typ/runner.py", line 755, in _setup_process child.context_after_setup = child.setup_fn(child, child.context) File "/tmp/runHFB2o4/third_party/catapult/telemetry/telemetry/testing/run_tests.py", line 267, in _SetUpProcess android_devices[child.worker_num-1].guid) IndexError: list index out of range Looks like it was trying to contact a phone which went offline or something?
,
Jun 2 2016
,
Jun 2 2016
Yup, at least in that log file it looks like device 03848e98003bfc91 fell offline in the middle of the run.
,
Jun 2 2016
It's rooting while deleting a temporary file on another thread. Rooting briefly drops the adb connection. I'm not immediately sure why that'd affect swarmed tests but not unswarmed ones.
,
Jun 2 2016
Yup, I noticed that there's a bunch of stuff happening in parallel that ideally would only be done once. We might want to consider reworking how the android initialization is done in telemetry to do more stuff in telemetry/util/run_tests.py before calling typ.runner.run().
,
Jun 2 2016
I think the parallelization on Android using typ is just a stop-gap measurement until we have real android swarming sharding support. Once we have that, I think each swarming task should run telemetry test against 1 android device.
,
Jun 2 2016
I believe we remove temporary files in parallel regardless of typ.
,
Jun 2 2016
and, looking a bit more, this is more interesting: (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:118 ******************************************************************************** (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:120 Exception on thread TimeoutThread-1-for-MainThread (attempt 1 of 3) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:121 ******************************************************************************** (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 Traceback (most recent call last): (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/timeout_retry.py", line 167, in Run (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 error_log_func=error_log_func) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 self._JoinAll(watcher, timeout) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 158, in _JoinAll (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 thread.ReraiseIfException() (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 self._ret = self._func(*self._args, **self._kwargs) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/utils/timeout_retry.py", line 160, in <lambda> (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs), (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 return f(*args, **kwargs) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 File "/tmp/runHFB2o4/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 238, in _RunAdbCmd (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 args, output, status, device_serial) (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 AdbCommandFailedError: adb devices: failed with exit status 1 and output: (CRITICAL) 2016-06-02 07:50:31,971 pid=25742 timeout_retry._LogLastException:124 - cannot bind 'tcp:5037': Address already in use (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - ADB server didn't ACK (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - * failed to start daemon * (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - error: cannot connect to daemon: Connection refused (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - error: cannot connect to daemon: Connection refused (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - List of devices attached (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 - * daemon not running. starting it now on port 5037 * (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:124 (CRITICAL) 2016-06-02 07:50:31,972 pid=25742 timeout_retry._LogLastException:125 ********************************************************************************
,
Jun 3 2016
... also, telemetry is circumventing devil entirely and doing a raw adb devices call through subprocess :( https://codesearch.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/android_device.py?rcl=0&l=159
,
Jun 3 2016
,
Jun 3 2016
John, since you know most about the devil, can you take over this bug?
,
Jun 3 2016
Sure.
,
Jun 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/57b5d8e39a21344860d49e54f69aa79081cee613 commit 57b5d8e39a21344860d49e54f69aa79081cee613 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Fri Jun 03 15:13:48 2016 Roll src/third_party/catapult/ c502262d9..90899c6d4 (1 commit). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/c502262d988b..90899c6d475e $ git log c502262d9..90899c6d4 --date=short --no-merges --format='%ad %ae %s' BUG= 616865 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2035083002 Cr-Commit-Position: refs/heads/master@{#397709} [modify] https://crrev.com/57b5d8e39a21344860d49e54f69aa79081cee613/DEPS
,
Jun 3 2016
There has been one failure in telemetry_perf_unittests since my CL rolled, but that was an individual test failure rather than the adb/device failures we were seeing overnight. I'll continue to monitor through the day and close this out if no further failures like this appear.
,
Jun 4 2016
Saw one similar flake this afternoon in https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/82053. WPR is also doing a raw adb call. I'll look into at least letting it optionally take a path to adb next week.
,
Jun 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/735102b128a6dce721c0e111b045947a8409b2a7 commit 735102b128a6dce721c0e111b045947a8409b2a7 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Wed Jun 22 23:11:34 2016 Roll src/third_party/catapult/ ea4633b67..41f6824c5 (1 commit). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/ea4633b677af..41f6824c5521 $ git log ea4633b67..41f6824c5 --date=short --no-merges --format='%ad %ae %s' BUG= 616865 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2090563005 Cr-Commit-Position: refs/heads/master@{#401448} [modify] https://crrev.com/735102b128a6dce721c0e111b045947a8409b2a7/DEPS
,
Jun 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/df1ed0994b57c32b45d8c17d98a1f537586f812c commit df1ed0994b57c32b45d8c17d98a1f537586f812c Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Mon Jun 27 13:13:00 2016 Roll src/third_party/catapult/ 310495788..6c4147ba7 (1 commit). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/310495788cee..6c4147ba7e52 $ git log 310495788..6c4147ba7 --date=short --no-merges --format='%ad %ae %s' BUG= 616865 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2100123002 Cr-Commit-Position: refs/heads/master@{#402159} [modify] https://crrev.com/df1ed0994b57c32b45d8c17d98a1f537586f812c/DEPS
,
Jun 27 2016
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by stip@chromium.org
, Jun 2 2016