New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 718366 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 699581



Sign in to add a comment

system_health.common_desktop often timing out on Mac Retina

Project Member Reported by perezju@chromium.org, May 4 2017

Issue description

Every now and then this benchmark times out on:
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/

From latest 20 builds ('T' is timeout, '-' is success):
mac-retina common   T--T-T--T-TT-TT---T-

Example recent build:
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/616

The step says:
"shard #0 timed out, took too much time to complete"

Interestingly the test only ran for 1 hour 39 minutes, and the log shows that the benchmark run was actually interrupted with an exception:

(ERROR) 2017-05-03 18:41:59,280 battor_wrapper._FlashBattOr:158  Git hash returned from BattOr was not as expected: [0503/184159.279148:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 145, in _FlashBattOr
    device_git_hash = self.GetFirmwareGitHash()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 373, in GetFirmwareGitHash
    int(self._git_hash, 16)
ValueError: invalid literal for int() with base 16: '[0503/184159.279148:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT'
(WARNING) 2017-05-03 18:41:59,281 shared_page_state.DumpStateUponFailure:131  Cannot dump browser state: No browser.
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 80, in _RunStoryAndProcessErrorIfNeeded
    test.WillRunStory(state.platform)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py", line 285, in WillRunStory
    platform.tracing_controller.StartTracing(self._tbm_options.config)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py", line 43, in StartTracing
    self._tracing_controller_backend.StartTracing(tracing_config, timeout)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 91, in StartTracing
    started = agent.StartAgentTracing(config, timeout)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 73, in StartAgentTracing
    self._battor.StartTracing()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 216, in StartTracing
    self._SendBattOrCommand(self._START_TRACING_CMD)
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 326, in _SendBattOrCommand
    status = self._SendBattOrCommandImpl(cmd)
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 321, in _SendBattOrCommandImpl
    self._battor_shell.stdin.write('%s\n' % cmd)
IOError: [Errno 32] Broken pipe

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F616%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

What is the timeout for this step? Was it really a shard timing out?
 
Cc: charliea@chromium.org
 Issue 718303  has been merged into this issue.
Cc: -charliea@chromium.org rnep...@chromium.org
Owner: charliea@chromium.org
Status: Assigned (was: Untriaged)
I recall that swarming has some logic to kill stucked tests. Assign to Charlie for triaging because this is battor related.
Cc: pschmidt@chromium.org
Components: Infra>Labs
I think as an immediate fix restarting the BattOr will likely help stop it from timing out. 

Its printing out results like it does at the end of telemetry run:
[  PASSED  ] 11 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ]  load:games:alphabetty@{'case': 'load', 'group': 'games'}

1 FAILED TEST

Does telemetry have logic for checking if a subprocess is left open before closing out? Its possible the BattOr binary subprocess is hung and not closing properly. No proof this is what is happening, just brain storming since this appears to only happen on BattOr runs.

Adding pschmidt@, if you get a chance could you unplug the barrel jacks and usb plug from the battor on this device and plug them back in?

The log message from that isn't in the run, so that function isn't being called when self._battor_shell is not set to None.
Done.  Battor on build7-b1 has been reset.
Intermittent BattOr failures that you are seeing is likely due to a known software issue with the battor_agent. We are addressing this issue in  https://crbug.com/699581 .

There still could be an issue where there is a stray battor_agent task left running  from a failed run though.
Cc: mar...@chromium.org
I am pretty sure that swarming infra have some mechanism to kill all the stray tasks between test runs.

+maruel to confirm this.
Blockedon: 699581
Labels: -Pri-3 Pri-2
Project Member

Comment 10 by bugdroid1@chromium.org, May 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/eab99009edae854674365dd3c228976b7e32d24b

commit eab99009edae854674365dd3c228976b7e32d24b
Author: Stephen Martinis <martiniss@chromium.org>
Date: Thu May 25 01:41:43 2017

Bump system_health.memory_desktop timeout

Changing it from the default of 2 hours to 3 hours. 
Also corrects a comment.

Bug:  718366 
Change-Id: I74498d4503a425d307882873ad6a159f57a899c8
Reviewed-on: https://chromium-review.googlesource.com/497288
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#474515}
[modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/testing/buildbot/chromium.perf.fyi.json
[modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/eab99009edae854674365dd3c228976b7e32d24b/tools/perf/core/perf_data_generator.py

Status: Fixed (was: Assigned)

Sign in to add a comment