Device offline or forwarder failure on chromium.perf |
||||||
Issue descriptionLink to buildbot status page: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5%20Perf%20%281%29 All the tests on device 0d88c7fd25995e62 attached to build13-b1 are failing. Randy, can you help triage since: * The tests show as red, not purple * Error logs show forwarder failures, but also a lot that looks like device failures * This is only happening on one device. Sample log: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5%20Perf%20%281%29/builds/4326/steps/v8.browsing_mobile_ignition/logs/stdio Traceback (most recent call last): RunBenchmark at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py:336 benchmark.ShouldTearDownStateAfterEachStorySetRun()) Run at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py:222 test, finder_options.Copy(), story_set) traced_function at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) __init__ at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/page/shared_page_state.py:101 use_live_traffic=use_live_traffic) traced_function at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) InitializeIfNeeded at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/core/network_controller.py:21 self._network_controller_backend.InitializeIfNeeded(use_live_traffic) InitializeIfNeeded at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py:64 self._platform_backend.GetPortPairForForwarding(local_port)) Create at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/internal/forwarders/android_forwarder.py:25 return AndroidForwarder(self._device, port_pair) __init__ at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/telemetry/telemetry/internal/forwarders/android_forwarder.py:60 [(port_pair.remote_port, port_pair.local_port)], self._device) Map at /b/rr/tmpqo6Lrl/w/src/third_party/catapult/devil/devil/android/forwarder.py:150 exit_code, '\n'.join(output))) HostForwarderError: /b/rr/tmpqo6Lrl/w/src/third_party/catapult/devil/bin/deps/linux2/x86_64/forwarder_host exited with 1:
,
Sep 27 2016
* The tests show as red, not purple The test is failing during the run_benchmark invocation. As far as the infra code is concerned this is not an infra issue. If run_benchmark returned with an infra error code it could be turned purple, but that would be the only way. As of right now it is returning 1. "exit code (as seen by runtest.py): 1" "step returned non-zero exit code: 1" * Error logs show forwarder failures, but also a lot that looks like device failures An error on the device could be causing the forwarder to freak out. * This is only happening on one device. It leads me to believe it might be a device issue, ie. the device is somehow getting into a bad state (since it is happening on single devices and not entire setups). John was doing something about the forwarder recently, but I am not sure what exactly. Adding him.
,
Sep 27 2016
! The N5 (at least) looks almost exactly like issue 634052 . Investigating that was the reason for the changes I was making. Haven't solved it yet, though. Marking as blocked on that as I'm not sure it's a complete dup.
,
Sep 27 2016
... and self-assigning.
,
Sep 28 2016
,
Sep 28 2016
,
Sep 28 2016
One other case that's a bit strange: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf%20%283%29/builds/138 Some of the tests on ZX1G22KZXV are passing, and some are failing with the HostForwarderError.
,
Sep 29 2016
johnw re-flashed and re-connected all the devices except the N5 in bug 651180 . Two of them are better, but we're still seeing the problem with 070b074f: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus7v2%20Perf%20%281%29/builds/3830 Should I ask for a device replacement? Any ideas?
,
Sep 29 2016
I think this has something to do with page_cycler_v2.typical_25, which in some cases appears to time out and then leave the device unusable. This is hard to see, though, because we currently print tests on the build page in alphabetical order within each shard rather than in execution order within each shard. (Still investigating why page_cycler_v2.typical_25 times out and why we don't clean up properly after it does.)
,
Sep 29 2016
nvm, I know what's going on here. We attempt to recover the device after a test failure (in this case, a timeout): https://codesearch.chromium.org/chromium/src/build/android/pylib/local/device/local_device_perf_test_run.py?rcl=0&l=239 This includes rebooting the device, among other things, so when the device comes back, the device forwarder daemon is no longer running. However, when we go to attempt to use the forwarder, we think we've initialized the device already: https://codesearch.chromium.org/chromium/src/third_party/catapult/devil/devil/android/forwarder.py?rcl=0&l=323 so we never restart the device forwarder daemon & the forwarding fails. Should be able to get a fix in today or tomorrow.
,
Sep 29 2016
(...though this will not include a fix for page_cycler_v2.typical_25)
,
Sep 29 2016
I think I was wrong in #10. Still looking.
,
Sep 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3d77e97fb46b4f0a9f9255c30b617abf6721ad33 commit 3d77e97fb46b4f0a9f9255c30b617abf6721ad33 Author: jbudorick <jbudorick@chromium.org> Date: Fri Sep 30 14:59:17 2016 [Android] Add --unmap-all to forwarder2. In some scenarios (e.g., single-device restart), we want to unmap all ports forwarded from a given device up to the host and clear the existing cached adb port for that device. We want to be able to do this even if the calling process doesn't know all of those ports. This change adds the --unmap-all command to forwarder2 to support such use cases. BUG= 634052 , 650674 Review-Url: https://codereview.chromium.org/2381063004 Cr-Commit-Position: refs/heads/master@{#422113} [modify] https://crrev.com/3d77e97fb46b4f0a9f9255c30b617abf6721ad33/tools/android/forwarder2/host_forwarder_main.cc
,
Oct 1 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dccd754c3b5cc5be5c809ffd6a9b742053f25c76 commit dccd754c3b5cc5be5c809ffd6a9b742053f25c76 Author: jbudorick <jbudorick@chromium.org> Date: Sat Oct 01 01:51:20 2016 [Android] Run shell commands from the forwarder without passing fds. The forwarder daemon was running commands with system(). This would give the newly forked process copies of the same file handles held by the daemon, notably including the unix domain socket. If the adb server wasn't already running and the daemon called an adb command, the adb server would be forked from the adb client process with those same file handles -- including the unix domain socket. This would interfere both with shutting down the host forwarder daemon (as we'd see the unix domain socket still held by the adb server) and with subsequent attempts to bring it up (same reason). BUG= 634052 , 650674 Review-Url: https://codereview.chromium.org/2374183008 Cr-Commit-Position: refs/heads/master@{#422263} [modify] https://crrev.com/dccd754c3b5cc5be5c809ffd6a9b742053f25c76/tools/android/forwarder2/host_forwarder_main.cc
,
Oct 1 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/84526ade9b6d246a8834309d0519d2255c0db91d commit 84526ade9b6d246a8834309d0519d2255c0db91d Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Sat Oct 01 08:06:33 2016 Roll src/third_party/catapult/ f00b66029..507bed462 (2 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f00b66029517..507bed4626dd $ git log f00b66029..507bed462 --date=short --no-merges --format='%ad %ae %s' 2016-09-30 jbudorick [telemetry] Update {device,host}_forwarder binaries. 2016-09-30 jbudorick [devil] Use --unmap-all in Forwarder.UnmapAllDevicePorts. BUG= 634052 , 650674 , 634052 , 650674 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2378773016 Cr-Commit-Position: refs/heads/master@{#422308} [modify] https://crrev.com/84526ade9b6d246a8834309d0519d2255c0db91d/DEPS
,
Oct 1 2016
From looking at the logs this morning, I think this is fixed now, but a bunch of the bots are failing to upload to the dashboard.
,
Oct 3 2016
The dashboard upload failures are fixed, so you should be able to verify now.
,
Nov 16 2016
Perf bothealth Sheriff Ping: Pri-1 bugs should be pinged daily, and checked to make sure someone is following up. John@ can we close this bug as fixed?
,
Nov 16 2016
I think so. We'll open new forwarder bugs as necessary. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by sullivan@chromium.org
, Sep 27 2016