Bisects failing with purple device_status step |
||||||||||||||||||||||||||||||
Issue descriptionWe're seeing over a dozen bisect failures with a purple device_status step. Can someone from clank infra take a look? https://build.chromium.org/p/tryserver.chromium.perf/builders/android_one_perf_bisect/builds/1776 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/821 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/822 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus6_perf_bisect/builds/2718 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus6_perf_bisect/builds/2719 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_webview_arm64_aosp_perf_bisect/builds/414 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/823 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_webview_arm64_aosp_perf_bisect/builds/415 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/824 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_webview_arm64_aosp_perf_bisect/builds/416 http://build.chromium.org/p/tryserver.chromium.perf/builders/android_webview_arm64_aosp_perf_bisect/builds/418 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/825 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5_perf_bisect/builds/4319 https://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5_perf_bisect/builds/4319/steps/device_status%20%283%29/logs/stdio ADB server didn't ACK * failed to start daemon * error: protocol fault (couldn't read status): Connection reset by peer error: protocol fault (couldn't read status): Connection reset by peer List of devices attached adb server is out of date. killing... C 29.625s Main (TimeoutThread-1-for-MainThread) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['devices', '-l'], retries=2, timeout=30), attempt 1 of 3: AdbCommandFailedError("adb devices -l: failed with exit status 1 and output:\n- cannot bind 'tcp:5037': Address already in use\n- ADB server didn't ACK\n- * failed to start daemon *\n- error: protocol fault (couldn't read status): Connection reset by peer\n- error: protocol fault (couldn't read status): Connection reset by peer\n- List of devices attached\n- adb server is out of date. killing...\n",) I 29.625s TimeoutThread-2-for-MainThread [host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices -l D 58.804s TimeoutThread-2-for-MainThread STDOUT+STDERR: cannot bind 'tcp:5037': Address already in use ADB server didn't ACK * failed to start daemon * error: protocol fault (couldn't read status): Connection reset by peer error: protocol fault (couldn't read status): Connection reset by peer List of devices attached adb server is out of date. killing... C 58.813s Main (TimeoutThread-2-for-MainThread) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['devices', '-l'], retries=2, timeout=30), attempt 2 of 3: AdbCommandFailedError("adb devices -l: failed with exit status 1 and output:\n- cannot bind 'tcp:5037': Address already in use\n- ADB server didn't ACK\n- * failed to start daemon *\n- error: protocol fault (couldn't read status): Connection reset by peer\n- error: protocol fault (couldn't read status): Connection reset by peer\n- List of devices attached\n- adb server is out of date. killing...\n",) I 58.813s TimeoutThread-3-for-MainThread [host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices -l C 88.838s Main Timed out. Dumping threads. C 88.838s Main ******************************************************************************** C 88.838s Main Stack dump for thread 'TimeoutThread-3-for-MainThread' C 88.838s Main ******************************************************************************** C 88.839s Main File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap C 88.839s Main self.__bootstrap_inner() C 88.839s Main File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner C 88.839s Main self.run() C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run C 88.839s Main self._ret = self._func(*self._args, **self._kwargs) C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/timeout_retry.py", line 152, in <lambda> C 88.839s Main child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs), C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl C 88.839s Main return f(*args, **kwargs) C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 249, in _RunAdbCmd C 88.839s Main timeout_retry.CurrentTimeoutThreadGroup().GetRemainingTime()) C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/cmd_helper.py", line 272, in GetCmdStatusAndOutputWithTimeout C 88.839s Main for data in _IterProcessStdout(process, timeout=timeout): C 88.839s Main File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/cmd_helper.py", line 229, in _IterProcessStdout C 88.839s Main read_fds, _, _ = select.select([child_fd], [], [], poll_interval) C 88.839s Main ******************************************************************************** Traceback (most recent call last): File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/tools/device_status.py", line 313, in <module> sys.exit(main()) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/tools/device_status.py", line 282, in main statuses = DeviceStatus(devices, blacklist) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/tools/device_status.py", line 104, in DeviceStatus for a in adb_wrapper.AdbWrapper.Devices(desired_state=None, long_list=True) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 360, in Devices retries=retries) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 380, in _RawDevices output = cls._RunAdbCmd(cmd, timeout=timeout, retries=retries) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/decorators.py", line 57, in timeout_retry_wrapper retry_if_func=retry_if_func) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/timeout_retry.py", line 159, in Run error_log_func=error_log_func) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll self._JoinAll(watcher, timeout) File "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 151, in _JoinAll (len(alive_threads), len(self._threads))) devil.android.device_errors.CommandTimeoutError: Timed out waiting for 1 of 1 threads.
,
Nov 4 2016
,
Nov 4 2016
There have been a lot of issues with adb this week. I'm tempted to call this P0 as it's not really getting the attention it deserves. What's the plan? Stip?
,
Nov 4 2016
#3: As mentioned in #1, there's no evidence in the log that this is similar to the other adb issues we've been seeing lately.
,
Nov 4 2016
,
Nov 7 2016
Right now I'm focusing all my effort on crbug.com/660852.
,
Nov 8 2016
,
Nov 8 2016
,
Nov 8 2016
,
Nov 8 2016
,
Nov 8 2016
,
Nov 8 2016
,
Nov 9 2016
,
Nov 9 2016
Ping on this? It's blocking dozens of bisects.
,
Nov 11 2016
,
Nov 15 2016
,
Nov 15 2016
,
Nov 16 2016
Issue 656584 has been merged into this issue.
,
Nov 16 2016
,
Nov 16 2016
,
Nov 16 2016
Running adb as a background process is really destroying us, especially for bisect. I'm hoping we can move to a more isolated process model soon (i.e., swarming). As for this bug, I am punting on https://bugs.chromium.org/p/chromium/issues/detail?id=660852 while John and I hash out the proper way to fix it and work on this bug instead. I really am worried that we have set ourselves up for tons of these adb errors unless we address it systematically in the future.
,
Nov 16 2016
I'm trying to determine if this is something that happens when we cross a particular CL or if it is persistent.
,
Nov 16 2016
I think you could grab the data on which CLs this is happening on programatically. You can look at comments from 425761728072-pa1bs18esuhp2cp2qfa1u9vb6p1v6kfu@developer.gserviceaccount.com on the blocked bug which have a url matching https://chromeperf.appspot.com/buildbucket_job_status/\d+ Then those urls have data from buildbucket API, including: DATA:{ "parameters": { "properties": { "bisect_config": { "bad_revision" "good_revision"
,
Nov 17 2016
I went through all the linked bugs that had descriptions in the titles that made sense (some of them like https://bugs.chromium.org/p/chromium/issues/detail?id=661495 had crazy high revisions, not sure what's up with that). I got: 427319:427336 427319:427336 427328:427364 427365:427416 427659:427742 427813:428224 428221:428276 428608:428609 428688:428697 428688:428709 429246:429299 429278:429344 429300:429396 429360:429542 429465:429511 429481:429534 429936:429982
,
Nov 17 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8995788364347157264
,
Nov 17 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8995788331168950864
,
Nov 17 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8995788309007888288
,
Nov 17 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8995788286767809968
,
Nov 17 2016
I am beginning to suspect this is a TIME_WAIT issue http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
,
Nov 17 2016
===== BISECT JOB RESULTS ===== Status: failed === Bisection aborted === The bisect was aborted because Bisect cannot identify a culprit: Bisect failed to reproduce the regression with enough confidence. Please contact the the team (see below) if you believe this is in error. ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@427100 57488208 7275173 27 good chromium@427201 56664457 13685181 27 bad Bisect job ran on: android_nexus5X_perf_bisect Bug ID: 662398 Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.twitter system_health.memory_mobile Test Metric: memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg/browse_social/browse_social_twitter Relative Change: None Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/877 Job details: https://chromeperf.appspot.com/buildbucket_job_status/8995788331168950864 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5022129039343616 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Nov 17 2016
===== BISECT JOB RESULTS ===== Status: failed === Bisection aborted === The bisect was aborted because Bisect cannot identify a culprit: Bisect failed to reproduce the regression with enough confidence. Please contact the the team (see below) if you believe this is in error. ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@427200 56664457 13685181 27 good chromium@427301 56664457 13685181 27 bad Bisect job ran on: android_nexus5X_perf_bisect Bug ID: 662398 Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.twitter system_health.memory_mobile Test Metric: memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg/browse_social/browse_social_twitter Relative Change: None Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/878 Job details: https://chromeperf.appspot.com/buildbucket_job_status/8995788309007888288 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5245199171715072 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Nov 17 2016
===== BISECT JOB RESULTS ===== Status: failed === Bisection aborted === The bisect was aborted because Bisect cannot identify a culprit: Bisect failed to reproduce the regression with enough confidence. Please contact the the team (see below) if you believe this is in error. ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@427300 55840706 16881874 27 good chromium@427318 57213625 10088849 27 bad Bisect job ran on: android_nexus5X_perf_bisect Bug ID: 662398 Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.twitter system_health.memory_mobile Test Metric: memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg/browse_social/browse_social_twitter Relative Change: None Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/879 Job details: https://chromeperf.appspot.com/buildbucket_job_status/8995788286767809968 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5570333430513664 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Nov 17 2016
===== BISECT JOB RESULTS ===== Status: failed === Bisection aborted === The bisect was aborted because Bisect cannot identify a culprit: Bisect failed to reproduce the regression with enough confidence. Please contact the the team (see below) if you believe this is in error. ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@427000 57488208 7275173 27 good chromium@427001 57213625 10088849 27 bad Bisect job ran on: android_nexus5X_perf_bisect Bug ID: 662398 Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.twitter system_health.memory_mobile Test Metric: memory:chrome:all_processes:reported_by_chrome:cc:effective_size_avg/browse_social/browse_social_twitter Relative Change: None Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5X_perf_bisect/builds/876 Job details: https://chromeperf.appspot.com/buildbucket_job_status/8995788364347157264 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5578935176265728 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Nov 17 2016
Re #24: "some of them like https://bugs.chromium.org/p/chromium/issues/detail?id=661495 had crazy high revisions, not sure what's up with that" Those are unix timestamps used as point id's for internal bots (because chromium commit positions are not enough to uniquely identify builds). I do agree though, that it is confusing and not very useful. For some time I've thought to file a bug about this, and so I just did now https://github.com/catapult-project/catapult/issues/3014
,
Nov 17 2016
Just in case it's not clear, all the bisects in #30-33 ran the tests successfully and got values, the only part that "failed" is that they didn't reproduce a regression. device_status was fine on them.
,
Nov 17 2016
,
Nov 17 2016
,
Nov 18 2016
,
Dec 7 2016
,
Dec 15 2016
,
Dec 19 2016
,
Dec 19 2016
,
Jan 18 2017
AFAIU this is still unresolved. I am marking most of my bugs as available. My hope is that these kinds of persistent adb issues can be solved when (one day) we can run bisect tests on swarming.
,
Jan 18 2017
John, can you find an owner? P1 + Available is bad practice.
,
Jan 18 2017
Just a note on priority: this affects our ability to bisect backwards in time. I think we've given up on the bugs from November that this affects. So triage here might be more about figuring out: 1) When it's likely to happen again 2) Whether pinpoint would be a viable solution
,
Jan 19 2017
,
Jan 19 2017
I can take a look at this.
,
Feb 3 2017
,
Feb 10 2017
,
Apr 10 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8982659720558435136
,
Apr 11 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8982640091095230272
,
Apr 11 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8982611928266554512
,
Apr 11 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8982611883209017760
,
Dec 4 2017
Moving Infra>Client>Android -> Infra>Client>Chrome+OS=Android
,
Dec 4 2017
|
||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||
Comment 1 by jbudorick@chromium.org
, Nov 4 2016I don't think I've seen a failure like this before: @@@STEP_LOG_LINE@Failure Output@INFO:devil.utils.cmd_helper:[host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:Timed out. Dumping threads.@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:********************************************************************************@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:Stack dump for thread 'TimeoutThread-3-for-MainThread'@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:********************************************************************************@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: self.__bootstrap_inner()@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: self.run()@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: self._ret = self._func(*self._args, **self._kwargs)@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/timeout_retry.py", line 152, in <lambda>@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs),@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: return f(*args, **kwargs)@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 249, in _RunAdbCmd@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: timeout_retry.CurrentTimeoutThreadGroup().GetRemainingTime())@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/cmd_helper.py", line 272, in GetCmdStatusAndOutputWithTimeout@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: for data in _IterProcessStdout(process, timeout=timeout):@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:File: "/b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/devil/utils/cmd_helper.py", line 229, in _IterProcessStdout@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root: read_fds, _, _ = select.select([child_fd], [], [], poll_interval)@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:********************************************************************************@@@ @@@STEP_LOG_LINE@Failure Output@INFO:devil.utils.cmd_helper:[host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb kill-server@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:(TimeoutThread-1-for-MainThread) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['kill-server'], retries=2, timeout=30), attempt 1 of 3: AdbCommandFailedError('adb kill-server: failed with exit status 1 and output:\n- * server not running *\n',)@@@ @@@STEP_LOG_LINE@Failure Output@INFO:devil.utils.cmd_helper:[host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb kill-server@@@ @@@STEP_LOG_LINE@Failure Output@CRITICAL:root:(TimeoutThread-2-for-MainThread) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['kill-server'], retries=2, timeout=30), attempt 2 of 3: AdbCommandFailedError('adb kill-server: failed with exit status 1 and output:\n- * server not running *\n',)@@@ @@@STEP_LOG_LINE@Failure Output@INFO:devil.utils.cmd_helper:[host]> /b/c/b/android_nexus5_perf_bisect/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb kill-server@@@ @@@STEP_LOG_LINE@Failure Output@INFO:root:Try printing formatted exception: <class 'devil.android.device_errors.AdbCommandFailedError'> adb kill-server: failed with exit status 1 and output:@@@ @@@STEP_LOG_LINE@Failure Output@- * server not running *@@@ It is: - running `adb devices`, which times out - trying to kill the adb server, but it can't find one (There are also various protocol faults in the log.) I'm wondering if something else is binding port 5037.