Issue metadata
Sign in to add a comment
|
Several Mac BattOrs wedged |
||||||||||||||||||||||||
Issue descriptionThese Battors need their SDCards reinserted to make them work again. Hopefully this will not happen again, we have a new firmware version. Mac 10.12 Perf https://build.chromium.org/p/chromium.perf/builders/Mac%2010.12%20Perf/builds/650 Bot id:build162-m1 Mac Retina Perf https://build.chromium.org/p/chromium.perf/builders/Mac%20Retina%20Perf/builds/819 Bot id: 'build30-b4' Bot id: 'build6-b1' Mac Air 10.11 Perf https://build.chromium.org/p/chromium.perf/builders/Mac%20Air%2010.11%20Perf/builds/986 Bot id: 'build127-b1'
,
Jun 27 2017
Changing the summary to reflect the new request.
,
Jun 27 2017
Hi Peter, can you also do this to the other bots listed here too? Thanks!
,
Jun 27 2017
build30-m4 was power cycled per crbug.com/732532 (no flashing orange led) build6-b1 was power cycled (also had no flashing orange led) build127-b1 was power cycled (it had a steady red led but now is flashing orange) build162-m1 is a mini in the golo that has no battor attached?
,
Jun 27 2017
Oops. That problem was not related to BattOr and I put it in there accidentally. Thanks for power cycling them! I'll take the bug and make sure they are working properly.
,
Jun 27 2017
Thanks for doing that Peter, we are now getting closer to a diagnosis of the issue. The lack of orange flashing LED on build30-m4 and build6-b1 indicates that they can not find their SD cards. This confirms my suspicion that there is something wrong with their SD cards. It seems like they are acting like their SD card is not there at all. Peter, can you open up those BattOrs and pull out and put back in their SD cards to see if that fixes the problem (results in orange LEDs blinking constantly on power up)? If it does not, we will need to swap in new SD cards, and Mellow will need those old ones sent to us so we can further investigate the problem.
,
Jun 28 2017
Peter, do you happen to remember if the BattOrs at build30-m4 and build6-b1 start flashing orange again once you unplugged/replugged them and before pulling out the SD card? Aaron, my understanding might be wrong, but if the problem lies in their SD cards, we'd expect them to continue not flashing even if we unplugged/replugged them until we fixed any SD card issues.
,
Jun 28 2017
Sorry i think there was some confusion. I realize now that peter's notes are likely the state of the LEDs before the pulled the plug. It does look like build30-m4 and build6-b1 are now in a happy state again. That likely means that the SD cards got into some sort of wedged state where the BattOr wasn't able to communicate with them at all. The bug that existed in the BattOr firmware before the recent patch was that the firmware would wedge itself it did not detect an SD card, this is why we didn't see any LED flashes. This also could be bad because maybe the SD card initialization failed once, but would have succeeded if it was tried again. That's why I changed the BattOr firmware in the recent patch to reset the BattOr if the SD card initialization fails. This means that the way we will know if this issue is hit again is if we see BattOrs that are constantly resetting (indicated by single 0x00 bytes sent over the wire).
,
Jun 28 2017
Yes, state of LEDs was before I reset the power on the battors.
,
Jun 28 2017
Sounds good. I'm going to mark this as fixed, then. Huge thanks to Aaron for debugging the issue and suggesting the fix, Randy for formally requesting the fix, and Peter for carrying it out :-)
,
Jul 6 2017
Reopening this bug (and changing the summary) because several of these BattOrs failed again. It turns out that there is another bug causing this. The issue is caused by the retry all commands fix for crbug.com/699581 . It introduced a new bug in StopTracing that looks like the following: The battor_agent would try to retry StopTracing when bytes are lost (as they often are on Mac) and this would lead it to send a new control message (EEPROM Read). However, the BattOr didn't care that this new message came and it would continue happily sending the trace over UART. The battor_agent also wouldn't care that the BattOr was not listening and so it would continue retrying commands. Eventually the BattOr would get wedged and/or the FTDI driver on Mac would crash. Randy I am assigning this to you, can you please update the BattOr firmware to include the latest commit (https://github.com/aschulm/battor/commit/3d63c6998f03333986902e9c1ae17309335d9181). After Randy is done that, I will assign it to Peter so he can power cycle the following BattOrs: [build30-b4, build127-b1].
,
Jul 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4a9ba5fb466528856ed90a8d780aa6fcaca865ee commit 4a9ba5fb466528856ed90a8d780aa6fcaca865ee Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Mon Jul 10 13:21:48 2017 Roll src/third_party/catapult/ f3726edb4..9f7e1bcf9 (1 commit) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f3726edb45da..9f7e1bcf98c3 $ git log f3726edb4..9f7e1bcf9 --date=short --no-merges --format='%ad %ae %s' 2017-07-10 charliea Upload a new version of the BattOr firmware Created with: roll-dep src/third_party/catapult BUG= 737176 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I72ba9b1cc5ae80fb18c0380faab5e8cc6642ef9e Reviewed-on: https://chromium-review.googlesource.com/565061 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#485251} [modify] https://crrev.com/4a9ba5fb466528856ed90a8d780aa6fcaca865ee/DEPS
,
Jul 10 2017
pschmidt@. Can you please unplug the BattOrs on build30-b4 and build127-b1 again and report on their LED condition? They are wedged again, but we think that the patch that just landed will fix the problems that we have been having with these two BattOrs. Thanks for your help!
,
Jul 10 2017
,
Jul 11 2017
Actually, there are four BattOrs that need restarted: build132-b1 (Mac Pro 10.11 Perf) build130-b1 (Mac Pro 10.11 Perf) build127-b1 (Mac Retina Perf) build30-b4 (one of our trybots)
,
Jul 11 2017
I think 127-b1 is on an Air, and 30-b4 is a normal Retina waterfall battor. I'm not sure about the Mac pro ones as we probably don't have BattOrs on Mac pros right? The list I sent I got from looking at failures on the waterfall, there may be try bots that I am missing.
,
Jul 11 2017
Issue 740683 has been merged into this issue.
,
Jul 11 2017
Sorry, the Mac Pros are just really poorly named: they're actually Macbook Pros (I assumed non-Retina), and we do in fact have BattOrs on them.
,
Jul 11 2017
,
Jul 11 2017
,
Jul 11 2017
On build132-b1 had steady red led. Restarted it. On build130-b1 had normal flashing orange led. Restarted it. On build127-b1 had steady red led. Restarted it. On build30-b4 had steady red led. Restarted it. Bonus extra: The battor on build125-b1 had solid red led. Restarted it.
,
Jul 12 2017
Peter, It turns out the fix didn't apply correctly, so the same BattOrs rewedged themselves. We just landed a fix in https://codereview.chromium.org/2976863002/. So do you mind doing a quick replug again on these four BattOrs so they can take the new patch? Sorry for the many unpluggings.
,
Jul 12 2017
Actually sorry hold off on that request, we need to wait for the current run to end for it to pull in the new patch. Let's plan to do the power cycle tomorrow after the patch dust has settled.
,
Jul 12 2017
Argh, too late. I came straight from the lab to the shuttle and just reset them. I'll do it again later.
,
Jul 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/75ca571e32649553e10c1dfed6b0a4f3a033150b commit 75ca571e32649553e10c1dfed6b0a4f3a033150b Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Wed Jul 12 21:00:45 2017 Roll src/third_party/catapult/ 08d8c9f08..6c40c273a (4 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/08d8c9f08602..6c40c273a7fe $ git log 08d8c9f08..6c40c273a --date=short --no-merges --format='%ad %ae %s' 2017-07-12 dproy Fixes catapult vulcanizer inline script ordering 2017-07-12 xunjieli Roll wpr-go forward to 11be1ed696ba1029960ca3b55bb369222dff183a 2017-07-12 xunjieli [wpr-go] Make installroot.go as a separate step. 2017-07-12 charliea Update the version_in_cs for the BattOr firmware Created with: roll-dep src/third_party/catapult BUG= 737176 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: Iba91255958fd1b494dcf2188e2e20022811cc93c Reviewed-on: https://chromium-review.googlesource.com/568590 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#486087} [modify] https://crrev.com/75ca571e32649553e10c1dfed6b0a4f3a033150b/DEPS
,
Jul 13 2017
The NextAction date has arrived: 2017-07-13
,
Jul 14 2017
Hi Peter, did you get a chance to restart these BattOrs yet?
,
Jul 14 2017
Hi Charlie. Peter's OOO today. Not sure if he got to them or not. Are they still showing as wedged? Let us know and we can reset them now. Thanks.
,
Jul 14 2017
,
Jul 14 2017
PS: The current status of these (I have not touched anything yet). build132-b1 flashing orange LED build130-b1 flashing orange LED build127-b1 flashing orange LED build30-b4 flashing orange LED Perhaps Peter got to them as alluded to in #24. Either way, let us know if any further action is needed. Thanks.
,
Jul 14 2017
It looks to me like battor.steady_state is healthy again on the Macbook Pros, which is a good sign: https://chromeperf.appspot.com/report?sid=05ca5741ce1336a5d0bf982cafec8f65c4594f19440a3a30b30c35af00fc7e63 Further, the Mac Pro 10.11 Perf bots (https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Pro%2010.11%20Perf/) and Mac Retina Perf bots (https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf/) haven't had any failures since that initial one. I'm going to set a NextAction to check back on this in a week, because the only way that we can be sure that this has gone away is to wait.
,
Jul 14 2017
,
Jul 14 2017
,
Jul 20 2017
The NextAction date has arrived: 2017-07-20
,
Jul 24 2017
Looking back at this to determine if our fix was effective. I'm still seeing occasional failures of all BattOr tests. Here are some examples: 1) battor.trivial_pages failing yesterday on Mac Air 10.11 Perf: https://goo.gl/VV2t5E 2) battor.steady_state failing yesterday on Mac Air 10.11 Perf: https://goo.gl/x9Yhqo 3) battor.steady_state failing yesterday on Mac Retina Perf: https://goo.gl/z2Wiin The serial log from failure #1 (battor_serial_log_1.txt) indicates that all reads are timing out. To me, this suggests that the BattOr might be wedged. However, the confusing part is that the BattOr literally *just* collected a trace seconds before. How could it work well one second and be unresponsive the next? The serial log form #2 (battor_serial_log_2.txt) had an interesting problem: when streaming back the results, one of the messages failed with TOO_MANY_START_BYTES (irrecoverable error 4). Then, the next message failed with NO_START_BYTE (irrecoverable error 2). This seems unlikely to be a coincidence. However, we don't log the actual bytes when reading data frames because there are too many of them and doing so would cause us to drop data frames (given Mac's... sensitive serial buffer), so we don't have much insight into what the data actually looks like. Looking at the serial log from failure #3 (battor_serial_log_3.txt), it looks like the BattOr reset when it was asked for the sample count in order to record the clock sync marker. Unfortunately, this isn't an error that we can recover from and is probably due to a bug in the firmware. Aaron, I remember that you were looking into this a while ago. Do you have any idea what might be happening here? What really sucks is that these three errors seem to each be different from each other.
,
Jul 28 2017
,
Aug 30 2017
,
Aug 30 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by aschulman@chromium.org
, Jun 27 2017