power_state:rec fails for samus on servo V3
Reported by
jrbarnette@chromium.org,
Apr 22 2016
|
|||||||||||
Issue description
Looking at the servo logs on servo V3 devices configured
for samus, you see the following error popping up from time
to time:
2016-04-22 22:17:41,732 - Servod - ERROR - Getting usbpd_ec3po_interp_connect: 'NoneType' object has no attribute 'get_interp_connect'
2016-04-22 22:17:41,734 - Servod - ERROR - Problem initializing usbpd_ec3po_interp_connect -> on :: 'NoneType' object has no attribute 'get_interp_connect'
In particular, this problem occurs when trying to force samus into
recovery mode via `dut-control power_state:rec`.
The failures in setting recovery mode are serious, because they
cause servo repair on samus to fail.
,
Apr 22 2016
,
Apr 22 2016
Samus uses this power_state driver:
servo/drv/cros_ec_pd_softrec_power.py
That overrides the parent class _cold_reset to
include this call:
self._interface.set('usbpd_reset', 'on')
The parent class is here:
servo/drv/cros_ec_softrec_power.py
That class implements power_state:rec using call call
to _cold_reset().
If there's no usbpd_reset control on servo V3, that would
explain the failure.
,
Apr 22 2016
There is a usbpd_reset control that servo_v3 should be importing. I should have been more specific. I don't believe that there is support for the USB PD *UART* for servo v3. Taking a step back, is this actually causing a failure though? Those messages would be printed right when servod starts up. It's trying to initialize the USB PD UART to be connected. But, like I mentioned there's no USB PD UART available for servo v3. Currently, I'm attempting to image a servo v3 of my own and will test with a samus.
,
Apr 22 2016
> Taking a step back, is this actually causing a failure
> though? Those messages would be printed right when
> servod starts up. It's trying to initialize the USB PD
> UART to be connected. But, like I mentioned there's no
> USB PD UART available for servo v3.
The error does occur at servod startup:
2016-04-22 22:17:41,725 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
2016-04-22 22:17:41,732 - Servod - ERROR - Getting usbpd_ec3po_interp_connect: 'NoneType' object has no attribute 'get_interp_connect'
2016-04-22 22:17:41,734 - Servod - ERROR - Problem initializing usbpd_ec3po_interp_connect -> on :: 'NoneType' object has no attribute 'get_interp_connect'
2016-04-22 22:17:41,735 - Servod - INFO - Initialized usbpd_ec3po_interp_connect to on
2016-04-22 22:17:41,737 - servod - INFO - Listening on 0.0.0.0 port 9999
Additionally, the message re-occurs periodically afterward.
In particular, the message re-occurs after a failure to enter
recovery mode.
,
Apr 22 2016
+nsanders@ There are a couple issues here. I was talking to nsanders@ and he mentioned that there actually is servo v3 PD MCU UART support, but it requires a reworked servo. I was skipping initializing the ec3po consoles for the PD MCU because there wasn't support for them. But, it sounds like that's no longer the case. I can add the init then. But, more importantly. Why does the PD MCU UART need to be involved in the recovery process?
,
Apr 22 2016
,
Apr 22 2016
Some background: https://bugs.chromium.org/p/chromium/issues/detail?id=567842
,
Apr 23 2016
As in comment 3, it looks like servo_samus_overlay.xml depends on pd support for it's power_state interface.
,
Apr 23 2016
Re comment #9: I've been searching the code, and I can't find
the connection (in software) to explain how we get from
the power_state control for samus to the missing usbpd_ec3po_interp_connect
control.
Here's the evidence I've got:
* First, the samus overlay definitely asks for the
usbpd_ec3po_interp_connect control: that control
appears in usbpd.xml, and the samus overlay includes
that file.
* Second, the control definitely fails to initialize;
the log messages are making that clear.
* Third, samus clearly fails to enter recovery mode;
when servo repair tries to run "chromeos-install",
it gets the classic error message:
Error: Source does not look like a removable device: /dev/sda
That error occurs because the DUT powered on in normal
mode, rather than booting from USB in recovery mode.
* Fourth, the error complaint from the missing usbpd_ec3po_interp_connect
control recurs as a consequence of the same process that
that fails to get the DUT into recovery mode.
It's not proven that "missing usbpd_ec3po_interp_connect" and
"fails to get into recovery mode" are actually the same bug, but
I'm having a hard time believing that there are two bugs, both
triggered by repair, but only one of which matters.
,
Apr 23 2016
Following up, here's the full code used for power_state:rec on
samus:
def _power_on_bytype(self, rec_mode, rec_type='rec'):
if rec_mode == self.REC_ON:
# Hold warm reset so the AP doesn't boot when EC reboots.
self._interface.set('warm_reset', 'on')
# Reset the EC to force it back into RO code; this clears
# the EC_IN_RW signal, so the system CPU will trust the
# upcoming recovery mode request.
self._cold_reset()
# Restart the EC, but leave the system CPU off...
self._interface.set('ec_uart_regexp', 'None')
self._interface.set('ec_uart_cmd', 'reboot ap-off')
time.sleep(self._reset_recovery_time)
# Now the EC keeps the AP off. Release warm reset before powering
# on the AP.
self._interface.set('warm_reset', 'off')
# ... and tell the EC to tell the CPU we're in recovery mode.
self._interface.set('ec_uart_cmd',
self._HOSTEVENT_RECOVERY_CMD[rec_type])
time.sleep(self._RECOVERY_DETECTION_DELAY)
self._power_on_ap()
def _cold_reset(self):
self._interface.set('cold_reset', 'on')
self._interface.set('usbpd_reset', 'on')
time.sleep(self._reset_hold_time)
self._interface.set('usbpd_reset', 'off')
self._interface.set('cold_reset', 'off')
# After the reset, give the EC the time it needs to
# re-initialize.
time.sleep(self._reset_recovery_time)
def _power_on_ap(self):
self._interface.power_short_press()
,
Apr 23 2016
I haven't found the connection either. The only thing I can think of is maybe the interface indices are wrong, and we're trying to send EC commands to the PD MCU when we should be sending them to the main EC. But even then, I would expect different error messages. - I downloaded 8212.0.0 beaglebone_servo official test image and loaded it on a servo v3 that I have here at my desk. - started servod by, "servod -b samus" - in another terminal, I entered "dut-control power_state:rec". - I see an INFO message of "Pressing power button for 0.200 seconds." But, no additional error messages like you mentioned. Additionally, the message complains about the usbpd_ec3po_interp_connect control. That should only be used in the initialization of the interface and the flash_ec script. But in the flashing case, it would be trying to *set* the control's value, not *get*. So it seems to me that servod is trying to initialize the interface again?
,
Apr 23 2016
I think I have one hypothesis as to why the "Problem initializing usbpd_ec3po_interp_connect -> on ::" messages pop up. Like I mentioned in Comment #12, I would expect this message to appear when initializing the interface. It seems that FAFT, as part of the FirmwareTest.initialize method calls the FAFTBase.initialize method. That ends up calling a Servo.initialize_dut method, which reinitializes all the controls again as a part of hwinit(). Since the verbose flag is not set, only errors are printed, which yields our error string here. So maybe after the repair job fails, another FAFT test is started that calls the initialize() method. Therefore, I don't think that the error messages are reported as a result of the recovery failure, but instead as a result of starting a new test. Especially because the PD MCU UART shouldn't be involved with going into recovery mode at all. Setting the reset lines would be done via a GPIO driver and not via a console command over the UART. Furthermore, the PD MCU (usbpd) doesn't have the ability to reset the AP itself. So issuing the "reboot ap-off" console command doesn't make sense for that MCU. Along the same lines, the PD MCU cannot issue a host event directly to the AP, so it must pass through the EC. This is done via the "pdevent" console command. For power_state:rec, the command issued is "hostevent set 0x4000", again which doesn't make sense for the PD MCU.
,
Apr 23 2016
FAFT is irrelevant to this failure: No FAFT code ran during the interval in question. However, Servo.initialize_dut() is called every time we create a Host with a servo, which is a necessary pre-condition to the repair operation that fails. So, it sounds like the errors from usbpd_ec3po_interp_connect aren't causing the power_state:rec failures. That said, the message shouldn't be happening. I've filed bug 606107 for that problem. We still need to explain why power_state:rec is failing for samus. AFAIK, the feature was tested to work on samus at some time in the past, so we've had a regression.
,
Apr 23 2016
Ah okay, that makes sense then. I'll try to come up with a way to hide those messages. It's not super straightforward as the way interfaces are defined for the different servos and overlays are shared is a bit messy. I also agree that we should figure out the reason why power_state:rec is failing. When I tried it on my desk, My device just rebooted a few times, but didn't actually enter recovery mode.
,
Apr 25 2016
Setting usbpd_reset to ON/OFF is achieved through a dedicated GPIO pin. It is not through the PD UART. So it should work fine on Servo v3.
The problem happened on setting the initial value of usbpd_ec3po_interp_connect to "on", like the time of calling Servo.initialize_dut(). Check the usbpd.xml, init="on":
<control>
<name>usbpd_ec3po_interp_connect</name>
<doc>State indicating if interpreter is listening to the EC UART.</doc>
<params interface="9" drv="ec3po_driver" map="onoff" init="on"
subtype="interp_connect"></params>
</control>
However, it found the _interface is None and caused the error, like the error messages said:
def _Set_interp_connect(self, state):
"""Set the interpreter's connection state to the UART.
Args:
state: A boolean indicating whether to connect or disconnect the
intepreter from the UART.
"""
self._interface.set_interp_connect(state)
Check the _init_ec3po_uart() method which initializes the _interface:
elif ((vid, pid) in servo_interfaces.SERVO_V2_DEFAULTS or
(vid, pid) in servo_interfaces.SERVO_V3_DEFAULTS):
# Determine if it's a PD interface or just main EC console.
if interface['index'] == servo_interfaces.EC3PO_USBPD_INTERFACE_NUM:
try:
# Obtain the raw EC UART PTY and create the EC-3PO interface.
raw_ec_uart = self.get('raw_usbpd_uart_pty')
except NameError:
# This overlay doesn't have a USB PD MCU, so skip init.
self._logger.info('No PD MCU UART.')
return None
except AttributeError:
# This overlay has no get method for the interface so skip init.
self._logger.warn('No interface for PD MCU UART.')
self._logger.warn('Usually, this happens because the interface is set'
' incorrectly. If you\'re overriding an existing'
' interface, be sure to update the interface lists'
' for your board at the end of '
'servo/servo_interfaces.py')
return None
As getting raw_usbpd_uart_pty fails (no PD UART on Servo v3), it makes the _interface to be a None.
,
Jun 9 2016
I think I found the root cause to this issue. Although, if this issue is about the failure for the DUT to reach the recovery screen at all, it's not the issue I root caused then. See https://bugs.chromium.org/p/chromium/issues/detail?id=617832#c6 for my explanation.
,
Jun 10 2016
> I think I found the root cause to this issue. Ah, that helps. I was just about to poke this issue: This problem has been going on for too long, and I'd really like to see it get some attention.
,
Jun 14 2016
I have a samus device from the lab, and I've confirmed that it fails the simple test case of `dut-control power_state:rec`. We'll arrange delivery offline.
,
Jun 18 2016
Posting an update. Last night, I ran 50 iterations of the Servo_PowerStateController_USBPluggedin test. 46/50 of them passed and all the failures were due to power_state:on not actually turning the DUT on, but really it was the Ethernet not being able to connect. I visually confirmed this on 2 of the failures. The version of Chrome OS that the DUT was running was R53-8444.0.0 samus-cheets dev-channel test image. I wanted to then run many iterations of the firmware_RecoveryButton test, however the XMLRPCServer couldn't start on the image installed. The usual fix for that is to install a different image. I then attempted to install the latest dev-channel test image and reproduced the issue where the device comes up in normal mode instead of recovery. I even tried changing USB flash drives and got the same result. I *did* notice a pattern however: 0) Trigger recovery mode by 3 finger salute (esc + refresh + power) 1) Flash USB flash drive using cros flash 2) Insert flash drive into DUT 3) Device comes up in normal mode 4) Power down DUT and remove USB 5) Trigger recovery mode again via 3 finger salute 6) Insert USB 7) DUT displays message that "Device inserted does not contain Chrome OS" Repeating steps 4-6 at this point always yields step 7. I even tried changing the USB flash drive itself, but achieved the same result. I examined the logs and did not see an EC or PD MCU reset... I then decided to zero out the flash drive using dd and tried cros flash again. The DUT managed to actually boot into recovery mode. I installed the latest samus-cheets dev-channel image, R53-8466.0.0, with which firmware_RecoveryButton could actually run. Over the weekend, I'll be running a stress test of the firmware_RecoveryButton test which uses power_state:rec to enter recovery mode. I'm doing the same test with a chell as well which is exhibiting similar symptoms. Note that in order to log the PD MCU UART output, I am using a servo v2. However, the steps taken in power_state:rec are the same for servo v2 and v3. My "guess" is that now that I have the DUT able to boot off of the recovery media, it will have no trouble doing so as long as the media doesn't change. But, let's see what actually happens.
,
Jun 18 2016
I don't understand the point of the stress test. The behavior I've been seeing is 100% reproducible: If you say "dut-control power_state:rec" with the USB stick unplugged, the DUT boots immediately in normal mode; it never goes to the firmware screen. Are you not reproducing that problem consistently? Are you running a different test case?
,
Jun 18 2016
Oh, I thought the failure was booting into normal mode with the USB stick plugged in. But no, I have not seen the failure of the DUT *not* coming to the firmware screen. It's always gone to the firmware screen.
,
Jun 20 2016
The behavior I saw was this:
Unplug all the USB storage.
Run 'dut-control power_state:rec'
After this sequence, the DUT boots from internal storage to
Chrome OS in normal mode.
The problem is 100% reproducible for the two devices I
delivered. If you're not seeing it, you've got something
different in your environment. The most likely difference
is Servo V2 vs. Servo V3, but I'm willing to entertain other
theories.
,
Jun 20 2016
Okay, I tried with a servo v3 and reproduced the issue for both devices.
,
Jun 20 2016
I've noticed something interesting. If I open the EC console using minicom on the servo v3 itself, the device is significantly more likely to enter recovery mode. Or, if I enable ec_uart_capture:on, it's also more likely to enter recovery mode. Those facts along with the fact that it doesn't occur with servo v2 is leading me to believe it might be a host issue. Servo v3 is pretty slow in comparison. I wonder if the executions of the resets aren't taking place like we imagine it. Still need to continue debug.
,
Jun 21 2016
Another day, another update. I know I've said this before, but I feel like I've now got a solid theory. I can now reliably reproduce the behaviour for both samus and chell on servo v2. I did this by modifying the _cold_reset() method in cros_ec_pd_softrec_power.py by inserting a delay after the release of the cold_reset. Here are the events that happen: 0) Release the cold_reset and usbpd_reset 1) PD MCU comes up in RO 2) EC comes up in RO 3) EC sysjumps to RW 4) EC tells PD MCU to sysjump to RW 5) EC tells PD MCU to don’t sysjump until a reboot 6) EC reboots keeping AP off 7) "hostevent set 0x4000" is entered at the EC console. (requesting recovery mode) 8) Power button is pressed and released 9) EC tells PD MCU to perform a cold reboot 10) EC is then confused that PD MCU is in RO, when the EC saw that the PD MCU was in RW. 11) Assuming the PD MCU has crashed, the EC panics and reboots, clearing the events. 12) DUT boots into normal mode because it has forgotten all about the recovery request. Now, when I say "EC tells PD MCU" it's really just relaying the message from the AP. What seems to be happening is that when the cold_reset is released, the AP boots far enough into normal mode that it sends a message for the PD MCU to sysjump to it's RW image. When the EC reboots keeping the AP held it reset, it's too late. At this point, the PD MCU is already in RW. When the AP begins booting to recovery, it sees that the PD MCU is in RW and tells it to reboot itself to RO. Now that the EC sees the PD MCU in RO when it was in RW, it thinks the PD MCU has crashed and the EC panics and resets the system. At this point, the recovery request is forgotten and the DUT boots into normal mode. Now, I was only able to reproduce this on servo v2 by inserting the delay. I also noticed that asserting warm_reset does not work as the code assumes. For samus and chell, asserting warm_reset only triggers and AP reset instead of *holding* the AP in reset. (I've filed bug 621997 for that). If warm_reset worked as expected, then there would be no issue. Unfortunately, this causes a race from the release of the cold_reset to issuing the command "reboot ap-off". If the AP boots far enough into normal that it tells the PD MCU to jump to its RW image, this failure will occur. Furthermore, I've noticed a delay on servo v3 in opening the PTY to enter the "reboot ap-off" command. I even once observed a delay of 10 seconds! This could be due to a number of reasons, one possibly being the speed of the beaglebone and its current load. Regardless, one thing I would like to try after lunch is opening the PTY before releasing the cold_reset. I'm hoping that if we already have the PTY opened, sending the command to the EC will be quick and more importantly, before the AP boots up far enough.
,
Jun 23 2016
Alright, I've solved the race condition and uploaded a CL to fix the power state driver here: https://chromium-review.googlesource.com/#/c/355580/ I've tested on both samus and chell for both servo v2 and v3 and I can reliably boot into recovery mode. For platforms with PD MCUs and a warm_reset that actually holds the AP in reset, they should use the cros_ec_softrec_power driver instead of the cros_ec_pd_softrec_power driver. The latter is now reserved for platforms where warm_reset doesn't actually hold the AP in reset. I'm also going to merge the chell issue into this issue.
,
Jun 23 2016
Issue 617832 has been merged into this issue.
,
Aug 3 2016
Issue 634032 has been merged into this issue.
,
Aug 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/2678a988493de095861e0edd7966a6a38983f6f7 commit 2678a988493de095861e0edd7966a6a38983f6f7 Author: Aseda Aboagye <aaboagye@google.com> Date: Thu Jun 23 01:34:08 2016 cros_ec_pd_softrec_power: Remove race condition. In order to enter recovery mode, _power_on_bytype() relied upon warm_reset keeping the AP in reset. However, it was found that for some platforms (samus, chell), warm_reset did not actually hold the AP in reset. This caused a race condition where if the AP booted up far enough in normal mode, it would send a message to the PD MCU to jump to its RW image. This would happen before the EC had a chance to reboot and hold the AP in reset. Failing to do so, when the request to boot into recovery mode arrived, the AP would notice that the PD MCU was not in its RO image. It would then send a message to the PD MCU to hard reboot so that it would boot in its RO image. However, the EC would notice that the PD MCU was now in its RO image when it was previously in its RW image. This caused the EC to panic because it assumed that the PD MCU had crashed. When the EC panicked, it reset the entire system and the recovery request was forgotten. In order to reliably enter recovery mode, _power_on_bytype() now does the following: - Hold PD MCU in reset. - Assert and deassert cold_reset. - Reboot the EC and have the EC now hold the AP in reset. - Release PD MCU reset. - Enter recovery mode. - Power down AP. - Enter recovery mode again. By holding the PD MCU in reset until we've rebooted the EC keeping the AP in reset, the PD MCU will never get the message to jump to its RW image. A side-effect of this however, is that in our first recovery boot, the recovery reason will be that software sync failed to check the version. That's why we perform two recovery boots such that the recovery reason will be "button press" once the function completes. For platforms, regardless if they have a PD MCU or not, that have a warm_reset control that actually holds the AP in reset, they should use the cros_ec_softrec_power driver instead of this one. BUG= chromium:606062 BUG= chromium:617832 BRANCH=None TEST=platform_ServoPowerStateController passed for chell & samus on both servo v2 and v3. CQ-DEPEND=CL:355505 Change-Id: I4a577ec788a54c8efffa72450586ee95e194251e Signed-off-by: Aseda Aboagye <aaboagye@google.com> Reviewed-on: https://chromium-review.googlesource.com/355580 Commit-Ready: Aseda Aboagye <aaboagye@chromium.org> Tested-by: Aseda Aboagye <aaboagye@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_glados_overlay.xml [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_samus_overlay.xml [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/drv/cros_ec_pd_softrec_power.py
,
Aug 9 2016
Will need to verify in the lab whenever an update to the beaglebones is offered.
,
Aug 29 2016
,
Oct 7 2016
,
Oct 31 2016
Inspected a repair job for a samus in the lab with a servo OS image of 8866.0.0. power_state:rec was able to boot the DUT off of USB. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by aaboagye@chromium.org
, Apr 22 2016