Issue metadata
Sign in to add a comment
|
Device goes missing while running Telemetry benchmark |
||||||||||||||||||||||
Issue descriptionRevision range first seen: ?????? Link to failing step log: Example run: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Nexus6%20Perf%20%281%29/builds/4881 Example run 2: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Nexus6%20Perf%20%281%29/builds/4870 Approximately every 1/5 runs, we see a pattern where system_health.memory_mobile fails and then all remaining benchmarks with device affinity of 0 fail with an error along the lines of: """ E 0.003s Main File not found /b/rr/tmpqEnFVi/w/src/out/step_results/v8.top_25_smooth E 0.003s Main Error occurred. Traceback (most recent call last): File "/b/rr/tmpqEnFVi/w/src/build/android/test_runner.py", line 834, in main return RunTestsCommand(args) File "/b/rr/tmpqEnFVi/w/src/build/android/test_runner.py", line 629, in RunTestsCommand return RunTestsInPlatformMode(args) File "/b/rr/tmpqEnFVi/w/src/build/android/test_runner.py", line 729, in RunTestsInPlatformMode raw_results = test_run.RunTests() File "/b/rr/tmpqEnFVi/w/src/build/android/pylib/local/device/local_device_perf_test_run.py", line 519, in RunTests result_type = self._test_instance.PrintTestOutput() File "/b/rr/tmpqEnFVi/w/src/build/android/pylib/perf/perf_test_instance.py", line 128, in PrintTestOutput raise PersistentDataError('No data for test %s found.' % self._print_step) PersistentDataError: No data for test v8.top_25_smooth found. <Thread(Thread-1, started 139987599120128)> ProcessRead: proc.stdout finished. <Thread(Thread-1, started 139987599120128)> ProcessRead: cleaning up. <Thread(Thread-2, started daemon 139987590727424)> TimedFlush: Finished <Thread(Thread-1, started 139987599120128)> ProcessRead: finished. exit code (as seen by runtest.py): 87 170211 09:14:18 telemetry_utils.py:33 ERROR Error reading telemetry results from /tmp/tmpxEDY_m Error reading telemetry results from /tmp/tmpxEDY_m 170211 09:14:18 telemetry_utils.py:34 ERROR This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. 170211 09:14:18 telemetry_utils.py:33 ERROR Error reading telemetry results from /tmp/tmpxEDY_m 170211 09:14:18 telemetry_utils.py:34 ERROR This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. Error reading telemetry results from /tmp/tmpxEDY_m This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. 170211 09:14:18 telemetry_utils.py:33 ERROR Error reading telemetry results from /tmp/tmpxEDY_m 170211 09:14:18 telemetry_utils.py:34 ERROR This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. Error reading telemetry results from /tmp/tmpxEDY_m This usually means that telemetry did not run, so it could not generate the file. Please check the device running the test. """ On some runs, only a single story fails on system_health.memory_mobile, and that failure appears to be a regular old Chrome crash. On these runs, it seems that the subsequent device 0 benchmarks run just fine. On other runs, however, system_health.memory_mobile seems to fail with a less-innocuous looking error, like: """ WARNING:root:Telemetry leaks these processes: adb (28833) - ['/b/rr/tmpjEe8rk/w/src/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb', '-s', 'ZX1G22KCC9', 'shell', '( dumpsys battery );echo %$?'] Error in sys.exitfunc: INFO:root:Try printing formatted exception: <class 'devil.android.device_errors.CommandTimeoutError'> Timed out waiting for 1 of 1 threads. <traceback object at 0x7f9f69ac6d88> Traceback (most recent call last): _run_exitfuncs at /usr/lib/python2.7/atexit.py:24 func(*targs, **kargs) _wrapped_function at /b/rr/tmpjEe8rk/w/src/third_party/catapult/telemetry/telemetry/internal/util/atexit_with_log.py:11 function(*args, **kwargs) _ReenableChargingIfNeeded at /b/rr/tmpjEe8rk/w/src/third_party/catapult/telemetry/telemetry/internal/platform/power_monitor/android_power_monitor_controller.py:11 if not battery.GetCharging(): timeout_retry_wrapper at /b/rr/tmpjEe8rk/w/src/third_party/catapult/devil/devil/android/decorators.py:57 retry_if_func=retry_if_func) Run at /b/rr/tmpjEe8rk/w/src/third_party/catapult/devil/devil/utils/timeout_retry.py:159 error_log_func=error_log_func) JoinAll at /b/rr/tmpjEe8rk/w/src/third_party/catapult/devil/devil/utils/reraiser_thread.py:186 self._JoinAll(watcher, timeout) _JoinAll at /b/rr/tmpjEe8rk/w/src/third_party/catapult/devil/devil/utils/reraiser_thread.py:151 (len(alive_threads), len(self._threads))) CommandTimeoutError: Timed out waiting for 1 of 1 threads. """ It seems like all thread dumps look similar to this, in that the thread's waiting for some command to be run on the Android device but the Android device isn't acking that the command was run. These runs seems to cause all remaining benchmarks on that device to also fail. rnephew@, is there anything that you can see in the test logs that might indicate what's going so badly wrong that it's impacting later tests? Also CCing perezju@ to let him know that the failing memory benchmarks are probably impacting other benchmarks.
,
Feb 13 2017
,
Feb 13 2017
,
Feb 13 2017
Cc-ing mike as well. I don't know much about android test runner failures. I know there was a known weird case which mike or randy had investigated, is this it?
,
Feb 13 2017
,
Feb 13 2017
The device is getting blacklisted: I 9749.306s device_shard_helper(0) Adding ZX1G22KCC9 to blacklist /b/rr/tmpjEe8rk/w/src/out/bad_devices.json for reason: reboot_timeout https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus6_Perf__1_%2F4881%2F%2B%2Frecipes%2Fsteps%2FSharded_Perf_Tests%2F0%2Fstdout
,
Feb 23 2017
Assign this to benchmark owner. perezju@, Can you please take a look at this. system_health.memory_mobile benchamark is failing on multiple android bots. https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Nexus5%20Perf%20(1) https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Nexus6%20Perf%20(1) https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/4762 (system_health.memory_mobile.reference)
,
Feb 27 2017
,
Feb 28 2017
I've spent a couple of days looking at logs now, documenting here what I've found so far on the Nexus6 bot. Quick highlights: - Issue is not specific to system health, happens on other Telemetry benchmarks too. - Probably happens more often on system health, since it's one of the benchmarks that takes longer to run. - No obvious failure reason in logs (on either buildbot or logcat). - In need for ideas on what to try next. Details: The following pattern is indeed pretty common (showing only exit status, step name for steps with device affinity 0): https://luci-milo.appspot.com/buildbot/chromium.perf/Android%20Nexus6%20Perf%20%281%29/49500 smoothness.key_silk_cases 0 smoothness.key_silk_cases.reference 0 speedometer-ignition 2 system_health.memory_mobile <-- FAILS WITH SOME DEVICE ERROR 4 system_health.memory_mobile.reference <-- THESE DO NOT RUN ANYMORE 4 v8.top_25_smooth 4 v8.top_25_smooth.reference The "deadly" errors I've seen have the form either: ERROR:root:Problem when trying to gather stack trace: (device: ZX1G22KCC9) adb shell '( uiautomator dump /data/local/tmp/temp_file-f21826dd3f501 );echo %$?': failed with exit status 255 and output: - error: device offline or: CRITICAL:root:(TimeoutThread-2-for-MainThread) Exception on TakeScreenshot(ZX1G22KCC9, host_path=/tmp/tmpP991aF.png, timeout=30, retries=3), attempt 2 of 4: AdbCommandFailedError("(device: ZX1G22KCC9) adb shell '( /system/bin/screencap -p /data/local/tmp/temp_file-774aea0118794.png );echo %$?': failed with exit status 255 and output:\n- cannot bind 'tcp:5037': Address already in use\n- ADB server didn't ACK\n- * failed to start daemon *\n- error: protocol fault (couldn't read status): Connection reset by peer\n- adb server is out of date. killing...\n",) Devices may sometimes recover from the deadly error, and it is not specific to system health, it may happen on other benchmarks too, e.g.: == 4912 == 2 smoothness.key_silk_cases <-- Falied with "device not found" 0 smoothness.key_silk_cases.reference <-- Device is fine again! 0 speedometer-ignition 0 system_health.memory_mobile 2 system_health.memory_mobile.reference <-- Device dies with adb out of date 4 v8.top_25_smooth <-- Device gone 4 v8.top_25_smooth.reference I've looked at the logcat from devices (attached one here), but nothing suspicious jumps at me either. Furthermore, it seems that the logcat is interrupted a lot earlier than the death of the device. Helpfully, Telemetry injects some lines on the logcat before/after running each story: On a buildbot log [1] I can find 176 system health stories running (the last one failed); but the logcat [attached] stops a few stories earlier and only contains markers for the first 165. [1]: https://luci-milo.appspot.com/buildbot/chromium.perf/Android%20Nexus6%20Perf%20%281%29/4943 I'll try to see next if this is specific to the Nexus6 bot, or if the same symptoms can be found elsewhere.
,
Feb 28 2017
nednguyen@, is it possible this is related to the potential short-lived websocket issues that we talked about the other day?
,
Mar 1 2017
jbudorick, bpastene: would you have any ideas on how to further debug this? Seems like an infra/device issue.
,
Mar 1 2017
I've got a pet theory on this involving the forwarder leaking ports in some circumstances involving --unmap-all that I'd like to investigate a bit more. If that proves to be unrelated, then no, nothing immediately comes to mind beyond adding more logging.
,
Mar 1 2017
Another data point: mikecase@ and I visited the lab the other week to take a look at all the bots that were blacklisted and not running tests due to "reboot_timeout". Most of the phones responded fine when manually tapping the screen, but failed to show up on adb from the host. Using a debug cable and running 'stop adbd ; start adbd' on the phones themselves fixed it. Though, for models that don't support the debug cable, a full power cycle was needed. Perhaps adbd on the device is getting in some stuck state. Not sure if that's related to the forwarder at all, but it would certainly take some dedicated investigation.
,
Mar 2 2017
Perfbot health sheriff checking in: this is responsible for almost all of the flaky failures that I'm seeing. Fixing this would help significantly to green the waterfall.
,
Mar 2 2017
Juan, are you still the right owner for this? Or do we need help from mobile ops?
,
Mar 2 2017
Upon thinking more about this, I'm actually not sure this is the exact same failure that's causing a lot of redness on the waterfall. So far, I've only seen the subsequent failures on the same device manifest on Nexus 7v2 (1) (https://luci-milo.appspot.com/buildbot/chromium.perf/Android%20Nexus7v2%20Perf%20%281%29/). For all other bots, only that benchmark fails (although the Android device also appears unresponsive).
,
Mar 3 2017
I think I'm going to remove myself from owner, and more help from mobile ops would be great. I'm spending most of my efforts now trying to understand and classify the different kinds of issues (like this one) causing failures on system health benchmarks. Re #12: What other things should we log? (The annoying thing is that, by the time we know we're in troubles, the device is gone and unresponsive.) Re #13: What do others think about doing something like 'stop adbd ; start adbd' between stories?
,
Mar 3 2017
John - help us find an owner to do the investigation and fix this. Based on Ben's investigation - it seems like there's a simpler mitigation to just hard-reboot or restart adbd on devices (we can't do that without a debug cable?) Do we assume this is a problem with adbd/adb, USB or something else? I guess we'd need to investigate.
,
Mar 7 2017
Given that my primary theory involves the forwarder, I'm probably the right person to investigate. Re mitigation -- yeah, I'd like to investigate before saying for certain, though I wonder if adbd death is something we could have the watchdog (https://chromium.googlesource.com/infra/infra/+/master/go/src/infra/tools/device_watchdog) look for.
,
Mar 8 2017
Port leak fix is https://codereview.chromium.org/2736053003/ #17: - re logging: the host_forwarder_log and the currently owned host ports are the first things I'd add. - I don't like the idea of restarting adbd in between stories.
,
Mar 16 2017
Any update on this?
,
Mar 16 2017
The patch in #20 is still on review. One of the reviewers has asked for tests, which is reasonable. The forwarder doesn't have *any* tests at the moment, though, so writing tests so will likely involve writing a mock adb binary & some associated harness work.
,
Mar 16 2017
Got it. Definitely looking forward to seeing this fixed, that's all :-)
,
Mar 16 2017
Do you have any estimate of when you might be able to make those changes by? Is the change significant enough that you feel like it needs tests in order to be done safely, and then promise tests at a later date?
,
Mar 28 2017
update: tests written, CL back on review.
,
Mar 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/86c756ccdc4ea1293fe6bea1c09b5a627baa0989 commit 86c756ccdc4ea1293fe6bea1c09b5a627baa0989 Author: jbudorick <jbudorick@chromium.org> Date: Wed Mar 29 17:33:54 2017 [Android] Fix port leak in the forwarder. The forwarder has been using a CommandLine object to run its `adb -s <serial> forward --remove <port>` command since October. CommandLine::AppendSwitchesAndArguments mangles that adb command into `adb -s --remove <serial> forward <port>` in its attempt to keep switches before arguments. That mangled command is not valid, and adb never unmaps the port as a result. The host forwarder recognized this as a failure, though not a fatal one, and removed the port from its serial->port map regardless of whether the unmap command succeeded. This patch adds a variant of base::GetAppOutputAndError that takes a vector<string> s.t. arguments are not reordered. (A similar variant already exists for base::GetAppOutput.) It also switches all uses of base::GetAppOutputWithExitCode to this new variant. BUG= 691654 Review-Url: https://codereview.chromium.org/2736053003 Cr-Commit-Position: refs/heads/master@{#460440} [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/base/process/launch.h [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/base/process/launch_posix.cc [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/BUILD.gn [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/BUILD.gn [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/command.cc [add] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/host_controllers_manager.cc [add] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/host_controllers_manager.h [add] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/host_controllers_manager_unittest.cc [modify] https://crrev.com/86c756ccdc4ea1293fe6bea1c09b5a627baa0989/tools/android/forwarder2/host_forwarder_main.cc
,
Mar 29 2017
,
Mar 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2de5b0418f39c8ae1068e0fedabe6dd2aee312ca commit 2de5b0418f39c8ae1068e0fedabe6dd2aee312ca Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Thu Mar 30 05:25:57 2017 Roll src/third_party/catapult/ d3a91071e..7067caaf8 (9 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/d3a91071e560..7067caaf846f $ git log d3a91071e..7067caaf8 --date=short --no-merges --format='%ad %ae %s' 2017-03-29 jbudorick [devil] Update the forwarder binaries. 2017-03-29 charliea Stop exporting INITIATOR_TYPE from find_input_expectations.html 2017-03-29 dtu [pinpoint] Job state table for testing. 2017-03-29 eakuefner [eslint] Allow template literals 2017-03-29 hjd [tracing] Move a few exportTo users to short-hand object literals 2017-03-29 simonhatch Dashboard - Add cron.yaml to list of files for deploy. 2017-03-29 lpy [ESLint] Turn on curly rule. 2017-03-29 simonhatch Dashboard - Fix the deprecate_tests cron job. 2017-03-29 hjd [tracing] Fix a few uses of tr.b.Range that snuck in Created with: roll-dep src/third_party/catapult BUG= 691654 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2780383002 Cr-Commit-Position: refs/heads/master@{#460671} [modify] https://crrev.com/2de5b0418f39c8ae1068e0fedabe6dd2aee312ca/DEPS
,
Apr 3 2017
perf folks: thoughts? I don't look at the perf bots as much as I used to, but to me, they do appear to be greener and don't seem to be exhibiting the same kind of failure.
,
Apr 3 2017
This is going to be a bit anecdotal but I here is why I think this helped: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%282%29?numbuilds=200 This bot was particularly bad with things we thought were connected to this problem. The binary update landed in this run: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%282%29/builds/4190 If you look at the last 200 runs, smoothness.gpu_rasterization.top_25_smooth had been failing pretty regularly. After you CL lands it hasn't failed on that bot.
,
Apr 3 2017
,
Apr 4 2017
runs failed infra-f missing 2017-03-31 09:55 1,000 16.3% 0.6% 4.0% 2017-04-03 09:17 1,000 9.0% 0.1% 1.7% I'm sure a good chuck of those improvements are attributable to your CL. Thanks John!
,
Apr 4 2017
Out of curiosity, how far back in time does 1000 runs go?
,
Apr 4 2017
That's the last 20 builds in every system health config/benchmark we run. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by charliea@chromium.org
, Feb 13 2017