New issue
Advanced search Search tips

Issue 752270 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 757484
Owner:
Closed: Aug 2017
Cc:
EstimatedDays: ----
NextAction: 2017-08-07
OS: ----
Pri: 1
Type: ----



Sign in to add a comment

media.tough_video_cases_tbmv2 failing on 2 builders

Project Member Reported by martiniss@chromium.org, Aug 3 2017

Issue description

media.tough_video_cases_tbmv2 failing on 2 builders

Builders failed on: 
- Mac Air 10.11 Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Mac%20Air%2010.11%20Perf
- Mac Retina Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Mac%20Retina%20Perf


This is failing fairly consistently.

This log is present fairly consistently: 
(ERROR) 2017-08-03 00:10:49,030 battor_wrapper._FlashBattOr:175  Git hash returned from BattOr was not as expected: [0803/001049.029353:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 162, in _FlashBattOr
    device_git_hash = self.GetFirmwareGitHash()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 392, in GetFirmwareGitHash
    int(self._git_hash, 16)
ValueError: invalid literal for int() with base 16: '[0803/001049.029353:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'

charliea@, could this be causing the test to fail? Can you take a look?

Will disable this test on mac.
 
Cc: crouleau@chromium.org
Mind pointing to a specific log of a failure?

It's unlikely that a battOr failure is specific to this test case. BattOr is used in the same way in every test. Maybe it's just because this is the first test or something like that?
Cc: vhang@chromium.org
Looking at the logs, it looks like the whole suite is failing. That test is just the first test to try to run. We're better off asking +Vince if something is wrong with the BattOr on that device.
Ah ok, I thought there was only one story the benchmark ran. My bad. 

Vince, can someone from labs look at the bot to see if the battor is messed up or something?
Components: Infra>Labs
Sorry for the delay: I'm baffled about why the serial logs aren't getting uploaded here. 

Adding Infra>Labs: could you please take a look at the BattOr on this machine and let us know what the blinking pattern is? Could you also please unplug and replace the power cable running to it?

Comment 6 by jo...@google.com, Aug 4 2017

LED pattern was solid red (no blinks) on build127-b1.

Reset the power connections to the battor and it's blinking orange now. Do you still need the power cable replaced? If so, are we talking about the adapter side, or the laptop side? (I don't see any spares in the vicinity, at first glance).

Thanks.
After looking more closely, it looks like what happened is that 

1) Telemetry tried to flash the BattOr, but was unsuccessful for some reason.

2) While trying to recover from the failed flash, it tried to stop the BattOr shell (https://cs.chromium.org/chromium/src/third_party/catapult/common/battor/battor/battor_wrapper.py?type=cs&q=%22git+hash+returned%22&sq=package:chromium&l=176). This first tries to stop the shell gracefully if it's still running, but falls back to a simple kill if it times out trying to stop the shell gracefully.
3) The shell was still running, so it tried to stop it gracefully. However, there was a race condition between checking if the shell was still running and actually gracefully requesting the shutdown, and in the meanwhile, the shell died. Because we only catch TimeoutExceptions when trying to gracefully kill the shell, the "broken pipe" exception that signaled the failed shutdown propagated outwards, killing Telemetry and doing so without ever uploading the serial logs to cloud storage.

It's hard to say exactly what happened to the BattOr to cause the initial failures without the serial logs, but a clear outcome of this is that we need to catch *all* exceptions during StopShell(), not just TimeoutExceptions.
Thanks! Sorry, my wording was terrible. I meant "unplug and replug"... not "unplug and replace". What you did was exactly what I was hoping for :-)
Components: -Infra>Labs
NextAction: 2017-08-07
Setting NextAction until Monday, when we can check if this is working yet.
Project Member

Comment 11 by bugdroid1@chromium.org, Aug 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/720c8ddf5c5f442062ec595c02e644e631f6ed33

commit 720c8ddf5c5f442062ec595c02e644e631f6ed33
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Sat Aug 05 06:08:48 2017

Roll src/third_party/catapult/ 0fb50e3f8..33a9271eb (4 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/0fb50e3f84ef..33a9271eb3cf

$ git log 0fb50e3f8..33a9271eb --date=short --no-merges --format='%ad %ae %s'
2017-08-04 achuith Disable testSmokeStartingWebPageReplayGoServer on chromeos.
2017-08-04 nednguyen Add markdown version of run_telemetry_tests documentation
2017-08-04 charliea Catch all BattOr shell graceful shutdown failures
2017-08-04 xunjieli [wpr-go] Update README

Created with:
  roll-dep src/third_party/catapult
BUG=750323, 752270 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: Ia77bd543b748caad646d357f7cedc4724604fe6a
Reviewed-on: https://chromium-review.googlesource.com/602745
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#492231}
[modify] https://crrev.com/720c8ddf5c5f442062ec595c02e644e631f6ed33/DEPS

The NextAction date has arrived: 2017-08-07
Mergedinto: 753759
Status: Duplicate (was: Assigned)
Mergedinto: -753759 757484

Sign in to add a comment