New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 737176 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 757508
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: 1
NextAction: 2017-07-20
OS: Mac
Pri: 2
Type: Bug

Blocked on:
issue 728496



Sign in to add a comment

Several Mac BattOrs wedged

Project Member Reported by rnep...@chromium.org, Jun 27 2017

Issue description

These Battors need their SDCards reinserted to make them work again. Hopefully this will not happen again, we have a new firmware version.

Mac 10.12 Perf
https://build.chromium.org/p/chromium.perf/builders/Mac%2010.12%20Perf/builds/650
Bot id:build162-m1


Mac Retina Perf
https://build.chromium.org/p/chromium.perf/builders/Mac%20Retina%20Perf/builds/819
Bot id: 'build30-b4'
Bot id: 'build6-b1'


Mac Air 10.11 Perf
https://build.chromium.org/p/chromium.perf/builders/Mac%20Air%2010.11%20Perf/builds/986
Bot id: 'build127-b1'

 
Before reinserting the SD cards in these, can we first just try power cycling them to see if that solves the problem? If that does not work, that lets us know that there is some sort of fault with the connection to the SD card or the SD card itself.
Summary: Power Cycle Mac BattOrs (was: Remove and reinsert SDCards on Mac BattOrs)
Changing the summary to reflect the new request. 
Owner: pschmidt@chromium.org
Hi Peter, can you also do this to the other bots listed here too? Thanks!

Comment 4 by pschm...@google.com, Jun 27 2017

Status: Assigned (was: Untriaged)
build30-m4 was power cycled per  crbug.com/732532   (no flashing orange led)

build6-b1 was power cycled (also had no flashing orange led)

build127-b1 was power cycled (it had a steady red led but now is flashing orange)

build162-m1 is a mini in the golo that has no battor attached?
Owner: rnep...@chromium.org
Oops. That problem was not related to BattOr and I put it in there accidentally. Thanks for power cycling them! I'll take the bug and make sure they are working properly. 
Thanks for doing that Peter, we are now getting closer to a diagnosis of the issue.

The lack of orange flashing LED on build30-m4 and build6-b1 indicates that they can not find their SD cards. This confirms my suspicion that there is something wrong with their SD cards. It seems like they are acting like their SD card is not there at all.

Peter, can you open up those BattOrs and pull out and put back in their SD cards to see if that fixes the problem (results in orange LEDs blinking constantly on power up)?

If it does not, we will need to swap in new SD cards, and Mellow will need those old ones sent to us so we can further investigate the problem.
Peter, do you happen to remember if the BattOrs at build30-m4 and build6-b1 start flashing orange again once you unplugged/replugged them and before pulling out the SD card? 

Aaron, my understanding might be wrong, but if the problem lies in their SD cards, we'd expect them to continue not flashing even if we unplugged/replugged them until we fixed any SD card issues.
Sorry i think there was some confusion. I realize now that peter's notes are likely the state of the LEDs before the pulled the plug. It does look like build30-m4 and build6-b1 are now in a happy state again. That likely means that the SD cards got into some sort of wedged state where the BattOr wasn't able to communicate with them at all.

The bug that existed in the BattOr firmware before the recent patch was that the firmware would wedge itself it did not detect an SD card, this is why we didn't see any LED flashes. This also could be bad because maybe the SD card initialization failed once, but would have succeeded if it was tried again. That's why I changed the BattOr firmware in the recent patch to reset the BattOr if the SD card initialization fails.

This means that the way we will know if this issue is hit again is if we see BattOrs that are constantly resetting (indicated by single 0x00 bytes sent over the wire).

Comment 9 by pschm...@google.com, Jun 28 2017

Yes, state of LEDs was before I reset the power on the battors.  
Status: Fixed (was: Assigned)
Sounds good. I'm going to mark this as fixed, then. Huge thanks to Aaron for debugging the issue and suggesting the fix, Randy for formally requesting the fix, and Peter for carrying it out :-)
Status: Assigned (was: Fixed)
Summary: Several Mac BattOrs wedged (was: Power Cycle Mac BattOrs)
Reopening this bug (and changing the summary) because several of these BattOrs failed again. It turns out that there is another bug causing this. The issue is caused by the retry all commands fix for  crbug.com/699581 . It introduced a new bug in StopTracing that looks like the following:

The battor_agent would try to retry StopTracing when bytes are lost (as they often are on Mac) and this would lead it to send a new control message (EEPROM Read). However, the BattOr didn't care that this new message came and it would continue happily sending the trace over UART. The battor_agent also wouldn't care that the BattOr was not listening and so it would continue retrying commands. Eventually the BattOr would get wedged and/or the FTDI driver on Mac would crash.

Randy I am assigning this to you, can you please update the BattOr firmware to include the latest commit (https://github.com/aschulm/battor/commit/3d63c6998f03333986902e9c1ae17309335d9181).

After Randy is done that, I will assign it to Peter so he can power cycle the following BattOrs: [build30-b4, build127-b1].


Project Member

Comment 12 by bugdroid1@chromium.org, Jul 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4a9ba5fb466528856ed90a8d780aa6fcaca865ee

commit 4a9ba5fb466528856ed90a8d780aa6fcaca865ee
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Mon Jul 10 13:21:48 2017

Roll src/third_party/catapult/ f3726edb4..9f7e1bcf9 (1 commit)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f3726edb45da..9f7e1bcf98c3

$ git log f3726edb4..9f7e1bcf9 --date=short --no-merges --format='%ad %ae %s'
2017-07-10 charliea Upload a new version of the BattOr firmware

Created with:
  roll-dep src/third_party/catapult
BUG= 737176 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: I72ba9b1cc5ae80fb18c0380faab5e8cc6642ef9e
Reviewed-on: https://chromium-review.googlesource.com/565061
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#485251}
[modify] https://crrev.com/4a9ba5fb466528856ed90a8d780aa6fcaca865ee/DEPS

EstimatedDays: 1
Owner: pschmidt@chromium.org
pschmidt@. Can you please unplug the BattOrs on build30-b4 and build127-b1 again and report on their LED condition? They are wedged again, but we think that the patch that just landed will fix the problems that we have been having with these two BattOrs.

Thanks for your help!
Owner: pschm...@google.com
Actually, there are four BattOrs that need restarted:

build132-b1 (Mac Pro 10.11 Perf)
build130-b1 (Mac Pro 10.11 Perf)
build127-b1 (Mac Retina Perf)
build30-b4 (one of our trybots)
I think 127-b1 is on an Air, and 30-b4 is a normal Retina waterfall battor.

I'm not sure about the Mac pro ones as we probably don't have BattOrs on Mac pros right?

The list I sent I got from looking at failures on the waterfall, there may be try bots that I am missing.
 Issue 740683  has been merged into this issue.
Sorry, the Mac Pros are just really poorly named: they're actually Macbook Pros (I assumed non-Retina), and we do in fact have BattOrs on them.
Components: Speed>Benchmarks>Waterfall
Owner: pschmidt@chromium.org
On build132-b1 had steady red led.  Restarted it.
On build130-b1 had normal flashing orange led.  Restarted it.
On build127-b1 had steady red led. Restarted it.
On build30-b4 had steady red led.  Restarted it.

Bonus extra:  The battor on build125-b1 had solid red led.  Restarted it.
Peter, It turns out the fix didn't apply correctly, so the same BattOrs rewedged themselves. We just landed a fix in https://codereview.chromium.org/2976863002/.

So do you mind doing a quick replug again on these four BattOrs so they can take the new patch?

Sorry for the many unpluggings.
NextAction: 2017-07-13
Actually sorry hold off on that request, we need to wait for the current run to end for it to pull in the new patch.

Let's plan to do the power cycle tomorrow after the patch dust has settled.


Argh, too late.  I came straight from the lab to the shuttle and just reset them.

I'll do it again later.

Project Member

Comment 25 by bugdroid1@chromium.org, Jul 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/75ca571e32649553e10c1dfed6b0a4f3a033150b

commit 75ca571e32649553e10c1dfed6b0a4f3a033150b
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Wed Jul 12 21:00:45 2017

Roll src/third_party/catapult/ 08d8c9f08..6c40c273a (4 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/08d8c9f08602..6c40c273a7fe

$ git log 08d8c9f08..6c40c273a --date=short --no-merges --format='%ad %ae %s'
2017-07-12 dproy Fixes catapult vulcanizer inline script ordering
2017-07-12 xunjieli Roll wpr-go forward to 11be1ed696ba1029960ca3b55bb369222dff183a
2017-07-12 xunjieli [wpr-go] Make installroot.go as a separate step.
2017-07-12 charliea Update the version_in_cs for the BattOr firmware

Created with:
  roll-dep src/third_party/catapult
BUG= 737176 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: Iba91255958fd1b494dcf2188e2e20022811cc93c
Reviewed-on: https://chromium-review.googlesource.com/568590
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#486087}
[modify] https://crrev.com/75ca571e32649553e10c1dfed6b0a4f3a033150b/DEPS

The NextAction date has arrived: 2017-07-13
Hi Peter, did you get a chance to restart these BattOrs yet?

Comment 28 by jo...@google.com, Jul 14 2017

Hi Charlie. Peter's OOO today. Not sure if he got to them or not. Are they still showing as wedged? Let us know and we can reset them now.

Thanks.

Comment 29 by jo...@google.com, Jul 14 2017

Cc: jo...@chromium.org

Comment 30 by jo...@google.com, Jul 14 2017

PS: The current status of these (I have not touched anything yet).

build132-b1 flashing orange LED
build130-b1 flashing orange LED
build127-b1 flashing orange LED
build30-b4 flashing orange LED

Perhaps Peter got to them as alluded to in #24.

Either way, let us know if any further action is needed. Thanks.
NextAction: 2017-07-20
It looks to me like battor.steady_state is healthy again on the Macbook Pros, which is a good sign: https://chromeperf.appspot.com/report?sid=05ca5741ce1336a5d0bf982cafec8f65c4594f19440a3a30b30c35af00fc7e63

Further, the Mac Pro 10.11 Perf bots (https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Pro%2010.11%20Perf/) and Mac Retina Perf bots (https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/) haven't had any failures since that initial one.

I'm going to set a NextAction to check back on this in a week, because the only way that we can be sure that this has gone away is to wait.
Components: -Infra>Labs
Owner: charliea@chromium.org
Labels: -Pri-1 Pri-2
The NextAction date has arrived: 2017-07-20
Looking back at this to determine if our fix was effective.

I'm still seeing occasional failures of all BattOr tests.

Here are some examples: 

1) battor.trivial_pages failing yesterday on Mac Air 10.11 Perf: https://goo.gl/VV2t5E
2) battor.steady_state failing yesterday on Mac Air 10.11 Perf: https://goo.gl/x9Yhqo
3) battor.steady_state failing yesterday on Mac Retina Perf: https://goo.gl/z2Wiin

The serial log from failure #1 (battor_serial_log_1.txt) indicates that all reads are timing out. To me, this suggests that the BattOr might be wedged. However, the confusing part is that the BattOr literally *just* collected a trace seconds before. How could it work well one second and be unresponsive the next?

The serial log form #2 (battor_serial_log_2.txt) had an interesting problem: when streaming back the results, one of the messages failed with TOO_MANY_START_BYTES (irrecoverable error 4). Then, the next message failed with NO_START_BYTE (irrecoverable error 2). This seems unlikely to be a coincidence. However, we don't log the actual bytes when reading data frames because there are too many of them and doing so would cause us to drop data frames (given Mac's... sensitive serial buffer), so we don't have much insight into what the data actually looks like.

Looking at the serial log from failure #3 (battor_serial_log_3.txt), it looks like the BattOr reset when it was asked for the sample count in order to record the clock sync marker. Unfortunately, this isn't an error that we can recover from and is probably due to a bug in the firmware. Aaron, I remember that you were looking into this a while ago. Do you have any idea what might be happening here?

What really sucks is that these three errors seem to each be different from each other.


battor_serial_log_3.txt
4.3 KB View Download
battor_serial_log_1.txt
7.2 KB View Download
battor_serial_log_2.txt
113 KB View Download
Blockedon: 728496
Status: Fixed (was: Assigned)
Mergedinto: 757508
Status: Duplicate (was: Fixed)

Sign in to add a comment