system_health.common_desktop often timing out on Mac Retina |
||||||
Issue descriptionEvery now and then this benchmark times out on: https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/ From latest 20 builds ('T' is timeout, '-' is success): mac-retina common T--T-T--T-TT-TT---T- Example recent build: https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/616 The step says: "shard #0 timed out, took too much time to complete" Interestingly the test only ran for 1 hour 39 minutes, and the log shows that the benchmark run was actually interrupted with an exception: (ERROR) 2017-05-03 18:41:59,280 battor_wrapper._FlashBattOr:158 Git hash returned from BattOr was not as expected: [0503/184159.279148:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 145, in _FlashBattOr device_git_hash = self.GetFirmwareGitHash() File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 373, in GetFirmwareGitHash int(self._git_hash, 16) ValueError: invalid literal for int() with base 16: '[0503/184159.279148:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT' (WARNING) 2017-05-03 18:41:59,281 shared_page_state.DumpStateUponFailure:131 Cannot dump browser state: No browser. Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 80, in _RunStoryAndProcessErrorIfNeeded test.WillRunStory(state.platform) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py", line 285, in WillRunStory platform.tracing_controller.StartTracing(self._tbm_options.config) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py", line 43, in StartTracing self._tracing_controller_backend.StartTracing(tracing_config, timeout) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 91, in StartTracing started = agent.StartAgentTracing(config, timeout) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 73, in StartAgentTracing self._battor.StartTracing() File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 216, in StartTracing self._SendBattOrCommand(self._START_TRACING_CMD) File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 326, in _SendBattOrCommand status = self._SendBattOrCommandImpl(cmd) File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 321, in _SendBattOrCommandImpl self._battor_shell.stdin.write('%s\n' % cmd) IOError: [Errno 32] Broken pipe https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F616%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout What is the timeout for this step? Was it really a shard timing out?
,
May 4 2017
I recall that swarming has some logic to kill stucked tests. Assign to Charlie for triaging because this is battor related.
,
May 4 2017
I think as an immediate fix restarting the BattOr will likely help stop it from timing out.
Its printing out results like it does at the end of telemetry run:
[ PASSED ] 11 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] load:games:alphabetty@{'case': 'load', 'group': 'games'}
1 FAILED TEST
Does telemetry have logic for checking if a subprocess is left open before closing out? Its possible the BattOr binary subprocess is hung and not closing properly. No proof this is what is happening, just brain storming since this appears to only happen on BattOr runs.
Adding pschmidt@, if you get a chance could you unplug the barrel jacks and usb plug from the battor on this device and plug them back in?
,
May 4 2017
To #3: we don't have yet in https://github.com/catapult-project/catapult/blob/master/common/battor/battor/battor_wrapper.py#L171
,
May 4 2017
The log message from that isn't in the run, so that function isn't being called when self._battor_shell is not set to None.
,
May 4 2017
Done. Battor on build7-b1 has been reset.
,
May 8 2017
Intermittent BattOr failures that you are seeing is likely due to a known software issue with the battor_agent. We are addressing this issue in https://crbug.com/699581 . There still could be an issue where there is a stray battor_agent task left running from a failed run though.
,
May 8 2017
I am pretty sure that swarming infra have some mechanism to kill all the stray tasks between test runs. +maruel to confirm this.
,
May 10 2017
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/eab99009edae854674365dd3c228976b7e32d24b commit eab99009edae854674365dd3c228976b7e32d24b Author: Stephen Martinis <martiniss@chromium.org> Date: Thu May 25 01:41:43 2017 Bump system_health.memory_desktop timeout Changing it from the default of 2 hours to 3 hours. Also corrects a comment. Bug: 718366 Change-Id: I74498d4503a425d307882873ad6a159f57a899c8 Reviewed-on: https://chromium-review.googlesource.com/497288 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Cr-Commit-Position: refs/heads/master@{#474515} [modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/testing/buildbot/chromium.perf.fyi.json [modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/testing/buildbot/chromium.perf.json [modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/tools/perf/core/perf_data_generator.py
,
Jun 22 2017
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by nedngu...@google.com
, May 4 2017