New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 606062 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Aug 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

power_state:rec fails for samus on servo V3

Reported by jrbarnette@chromium.org, Apr 22 2016

Issue description

Looking at the servo logs on servo V3 devices configured
for samus, you see the following error popping up from time
to time:
    2016-04-22 22:17:41,732 - Servod - ERROR - Getting usbpd_ec3po_interp_connect: 'NoneType' object has no attribute 'get_interp_connect'
    2016-04-22 22:17:41,734 - Servod - ERROR - Problem initializing usbpd_ec3po_interp_connect -> on :: 'NoneType' object has no attribute 'get_interp_connect'

In particular, this problem occurs when trying to force samus into
recovery mode via `dut-control power_state:rec`.

The failures in setting recovery mode are serious, because they
cause servo repair on samus to fail.

 
Cc: waihong@chromium.org
These messages are generated because servo v3 does not currently support USB PD.

I'm not sure what recovery is trying to do here with the PD MCU. +waihong@ because I think he was working on servo V3 + usb pd.

Let me take a closer look at what the recovery is trying to do.
Cc: englab-sys-cros@google.com
Samus uses this power_state driver:
    servo/drv/cros_ec_pd_softrec_power.py

That overrides the parent class _cold_reset to
include this call:
    self._interface.set('usbpd_reset', 'on')

The parent class is here:
    servo/drv/cros_ec_softrec_power.py

That class implements power_state:rec using call call
to _cold_reset().

If there's no usbpd_reset control on servo V3, that would
explain the failure.

There is a usbpd_reset control that servo_v3 should be importing. I should have been more specific. I don't believe that there is support for the USB PD *UART* for servo v3.

Taking a step back, is this actually causing a failure though? Those messages would be printed right when servod starts up. It's trying to initialize the USB PD UART to be connected. But, like I mentioned there's no USB PD UART available for servo v3.

Currently, I'm attempting to image a servo v3 of my own and will test with a samus.
> Taking a step back, is this actually causing a failure
> though? Those messages would be printed right when
> servod starts up. It's trying to initialize the USB PD
> UART to be connected. But, like I mentioned there's no
> USB PD UART available for servo v3.

The error does occur at servod startup:
    2016-04-22 22:17:41,725 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
    2016-04-22 22:17:41,732 - Servod - ERROR - Getting usbpd_ec3po_interp_connect: 'NoneType' object has no attribute 'get_interp_connect'
    2016-04-22 22:17:41,734 - Servod - ERROR - Problem initializing usbpd_ec3po_interp_connect -> on :: 'NoneType' object has no attribute 'get_interp_connect'
    2016-04-22 22:17:41,735 - Servod - INFO - Initialized usbpd_ec3po_interp_connect to on
    2016-04-22 22:17:41,737 - servod - INFO - Listening on 0.0.0.0 port 9999


Additionally, the message re-occurs periodically afterward.
In particular, the message re-occurs after a failure to enter
recovery mode.

+nsanders@

There are a couple issues here. I was talking to nsanders@ and he mentioned that there actually is servo v3 PD MCU UART support, but it requires a reworked servo. I was skipping initializing the ec3po consoles for the PD MCU because there wasn't support for them. But, it sounds like that's no longer the case. I can add the init then.

But, more importantly. Why does the PD MCU UART need to be involved in the recovery process?
Cc: nsanders@chromium.org
As in comment 3, it looks like servo_samus_overlay.xml depends on pd support for it's power_state interface.
Re comment #9:  I've been searching the code, and I can't find
the connection (in software) to explain how we get from
the power_state control for samus to the missing usbpd_ec3po_interp_connect
control.

Here's the evidence I've got:
  * First, the samus overlay definitely asks for the
    usbpd_ec3po_interp_connect control:  that control
    appears in usbpd.xml, and the samus overlay includes
    that file.
  * Second, the control definitely fails to initialize;
    the log messages are making that clear.
  * Third, samus clearly fails to enter recovery mode;
    when servo repair tries to run "chromeos-install",
    it gets the classic error message:
        Error: Source does not look like a removable device: /dev/sda
    That error occurs because the DUT powered on in normal
    mode, rather than booting from USB in recovery mode.
  * Fourth, the error complaint from the missing usbpd_ec3po_interp_connect
    control recurs as a consequence of the same process that
    that fails to get the DUT into recovery mode.

It's not proven that "missing usbpd_ec3po_interp_connect" and
"fails to get into recovery mode" are actually the same bug, but
I'm having a hard time believing that there are two bugs, both
triggered by repair, but only one of which matters.

Following up, here's the full code used for power_state:rec on
samus:

  def _power_on_bytype(self, rec_mode, rec_type='rec'):
    if rec_mode == self.REC_ON:
      # Hold warm reset so the AP doesn't boot when EC reboots.
      self._interface.set('warm_reset', 'on')
      # Reset the EC to force it back into RO code; this clears
      # the EC_IN_RW signal, so the system CPU will trust the
      # upcoming recovery mode request.
      self._cold_reset()
      # Restart the EC, but leave the system CPU off...
      self._interface.set('ec_uart_regexp', 'None')
      self._interface.set('ec_uart_cmd', 'reboot ap-off')
      time.sleep(self._reset_recovery_time)
      # Now the EC keeps the AP off. Release warm reset before powering
      # on the AP.
      self._interface.set('warm_reset', 'off')
      # ... and tell the EC to tell the CPU we're in recovery mode.
      self._interface.set('ec_uart_cmd',
                          self._HOSTEVENT_RECOVERY_CMD[rec_type])
      time.sleep(self._RECOVERY_DETECTION_DELAY)
    self._power_on_ap()

  def _cold_reset(self):
    self._interface.set('cold_reset', 'on')
    self._interface.set('usbpd_reset', 'on')
    
    time.sleep(self._reset_hold_time)

    self._interface.set('usbpd_reset', 'off')
    self._interface.set('cold_reset', 'off')
    # After the reset, give the EC the time it needs to
    # re-initialize.
    time.sleep(self._reset_recovery_time)

  def _power_on_ap(self):
    self._interface.power_short_press()

I haven't found the connection either. The only thing I can think of is maybe the interface indices are wrong, and we're trying to send EC commands to the PD MCU when we should be sending them to the main EC. But even then, I would expect different error messages.

- I downloaded 8212.0.0 beaglebone_servo official test image and loaded it on a servo v3 that I have here at my desk.
- started servod by, "servod -b samus"
- in another terminal, I entered "dut-control power_state:rec".
- I see an INFO message of "Pressing power button for 0.200 seconds." But, no additional error messages like you mentioned. 

Additionally, the message complains about the usbpd_ec3po_interp_connect control. That should only be used in the initialization of the interface and the flash_ec script. But in the flashing case, it would be trying to *set* the control's value, not *get*. So it seems to me that servod is trying to initialize the interface again?
I think I have one hypothesis as to why the "Problem initializing usbpd_ec3po_interp_connect -> on ::" messages pop up.

Like I mentioned in Comment #12, I would expect this message to appear when initializing the interface. It seems that FAFT, as part of the FirmwareTest.initialize method calls the FAFTBase.initialize method. That ends up calling a Servo.initialize_dut method, which reinitializes all the controls again as a part of hwinit(). Since the verbose flag is not set, only errors are printed, which yields our error string here.

So maybe after the repair job fails, another FAFT test is started that calls the initialize() method.

Therefore, I don't think that the error messages are reported as a result of the recovery failure, but instead as a result of starting a new test. Especially because the PD MCU UART shouldn't be involved with going into recovery mode at all. Setting the reset lines would be done via a GPIO driver and not via a console command over the UART. Furthermore, the PD MCU (usbpd) doesn't have the ability to reset the AP itself. So issuing the "reboot ap-off" console command doesn't make sense for that MCU. Along the same lines, the PD MCU cannot issue a host event directly to the AP, so it must pass through the EC. This is done via the "pdevent" console command. For power_state:rec, the command issued is "hostevent  set 0x4000", again which doesn't make sense for the PD MCU.
Summary: power_state:rec fails for samus on servo V3 (was: Servo V3 errors on samus)
FAFT is irrelevant to this failure:  No FAFT code ran during
the interval in question.  However, Servo.initialize_dut()
is called every time we create a Host with a servo, which is
a necessary pre-condition to the repair operation that fails.

So, it sounds like the errors from usbpd_ec3po_interp_connect
aren't causing the power_state:rec failures.  That said, the
message shouldn't be happening.  I've filed  bug 606107  for that
problem.

We still need to explain why power_state:rec is failing for samus.
AFAIK, the feature was tested to work on samus at some time in
the past, so we've had a regression.

Ah okay, that makes sense then.

I'll try to come up with a way to hide those messages. It's not super straightforward as the way interfaces are defined for the different servos and overlays are shared is a bit messy.

I also agree that we should figure out the reason why power_state:rec is failing. When I tried it on my desk, My device just rebooted a few times, but didn't actually enter recovery mode. 
Setting usbpd_reset to ON/OFF is achieved through a dedicated GPIO pin. It is not through the PD UART. So it should work fine on Servo v3.

The problem happened on setting the initial value of usbpd_ec3po_interp_connect to "on", like the time of calling Servo.initialize_dut(). Check the usbpd.xml, init="on":

  <control>
    <name>usbpd_ec3po_interp_connect</name>
    <doc>State indicating if interpreter is listening to the EC UART.</doc>
    <params interface="9" drv="ec3po_driver" map="onoff" init="on"
    subtype="interp_connect"></params>
  </control>


However, it found the _interface is None and caused the error, like the error messages said:

  def _Set_interp_connect(self, state):
    """Set the interpreter's connection state to the UART.

    Args:
      state: A boolean indicating whether to connect or disconnect the
        intepreter from the UART.
    """
    self._interface.set_interp_connect(state)


Check the _init_ec3po_uart() method which initializes the _interface:

    elif ((vid, pid) in servo_interfaces.SERVO_V2_DEFAULTS or
        (vid, pid) in servo_interfaces.SERVO_V3_DEFAULTS):
      # Determine if it's a PD interface or just main EC console.
      if interface['index'] == servo_interfaces.EC3PO_USBPD_INTERFACE_NUM:
        try:
          # Obtain the raw EC UART PTY and create the EC-3PO interface.
          raw_ec_uart = self.get('raw_usbpd_uart_pty')
        except NameError:
          # This overlay doesn't have a USB PD MCU, so skip init.
          self._logger.info('No PD MCU UART.')
          return None
        except AttributeError:
          # This overlay has no get method for the interface so skip init.
          self._logger.warn('No interface for PD MCU UART.')
          self._logger.warn('Usually, this happens because the interface is set'
                            ' incorrectly.  If you\'re overriding an existing'
                            ' interface, be sure to update the interface lists'
                            ' for your board at the end of '
                            'servo/servo_interfaces.py')
          return None


As getting raw_usbpd_uart_pty fails (no PD UART on Servo v3), it makes the _interface to be a None.
I think I found the root cause to this issue. Although, if this issue is about the failure for the DUT to reach the recovery screen at all, it's not the issue I root caused then. See https://bugs.chromium.org/p/chromium/issues/detail?id=617832#c6 for my explanation.
> I think I found the root cause to this issue.

Ah, that helps.

I was just about to poke this issue:  This problem has been
going on for too long, and I'd really like to see it get some
attention.

I have a samus device from the lab, and I've confirmed that it
fails the simple test case of `dut-control power_state:rec`.

We'll arrange delivery offline.

Posting an update.

Last night, I ran 50 iterations of the Servo_PowerStateController_USBPluggedin test. 46/50 of them passed and all the failures were due to power_state:on not actually turning the DUT on, but really it was the Ethernet not being able to connect. I visually confirmed this on 2 of the failures.

The version of Chrome OS that the DUT was running was R53-8444.0.0 samus-cheets dev-channel test image. I wanted to then run many iterations of the firmware_RecoveryButton test, however the XMLRPCServer couldn't start on the image installed. The usual fix for that is to install a different image. I then attempted to install the latest dev-channel test image and reproduced the issue where the device comes up in normal mode instead of recovery. I even tried changing USB flash drives and got the same result. I *did* notice a pattern however:

0) Trigger recovery mode by 3 finger salute (esc + refresh + power)
1) Flash USB flash drive using cros flash
2) Insert flash drive into DUT
3) Device comes up in normal mode
4) Power down DUT and remove USB
5) Trigger recovery mode again via 3 finger salute
6) Insert USB
7) DUT displays message that "Device inserted does not contain Chrome OS"

Repeating steps 4-6 at this point always yields step 7. I even tried changing the USB flash drive itself, but achieved the same result.

I examined the logs and did not see an EC or PD MCU reset...

I then decided to zero out the flash drive using dd and tried cros flash again. The DUT managed to actually boot into recovery mode. I installed the latest samus-cheets dev-channel image, R53-8466.0.0, with which firmware_RecoveryButton could actually run.

Over the weekend, I'll be running a stress test of the firmware_RecoveryButton test which uses power_state:rec to enter recovery mode. I'm doing the same test with a chell as well which is exhibiting similar symptoms.

Note that in order to log the PD MCU UART output, I am using a servo v2. However, the steps taken in power_state:rec are the same for servo v2 and v3.

My "guess" is that now that I have the DUT able to boot off of the recovery media, it will have no trouble doing so as long as the media doesn't change. But, let's see what actually happens.
I don't understand the point of the stress test.  The behavior
I've been seeing is 100% reproducible:  If you say "dut-control
power_state:rec" with the USB stick unplugged, the DUT boots
immediately in normal mode; it never goes to the firmware screen.
Are you not reproducing that problem consistently?  Are you running
a different test case?

Oh, I thought the failure was booting into normal mode with the USB stick plugged in. But no, I have not seen the failure of the DUT *not* coming to the firmware screen. It's always gone to the firmware screen.
The behavior I saw was this:
    Unplug all the USB storage.
    Run 'dut-control power_state:rec'
After this sequence, the DUT boots from internal storage to
Chrome OS in normal mode.

The problem is 100% reproducible for the two devices I
delivered.  If you're not seeing it, you've got something
different in your environment.  The most likely difference
is Servo V2 vs. Servo V3, but I'm willing to entertain other
theories.

Okay, I tried with a servo v3 and reproduced the issue for both devices. 
I've noticed something interesting. If I open the EC console using minicom on the servo v3 itself, the device is significantly more likely to enter recovery mode. Or, if I enable ec_uart_capture:on, it's also more likely to enter recovery mode.

Those facts along with the fact that it doesn't occur with servo v2 is leading me to believe it might be a host issue. Servo v3 is pretty slow in comparison. I wonder if the executions of the resets aren't taking place like we imagine it.

Still need to continue debug.

Another day, another update. I know I've said this before, but I feel like I've now got a solid theory.

I can now reliably reproduce the behaviour for both samus and chell on servo v2. I did this by modifying the _cold_reset() method in cros_ec_pd_softrec_power.py by inserting a delay after the release of the cold_reset.

Here are the events that happen:
0)  Release the cold_reset and usbpd_reset
1)  PD MCU comes up in RO
2)  EC comes up in RO
3)  EC sysjumps to RW
4)  EC tells PD MCU to sysjump to RW
5)  EC tells PD MCU to don’t sysjump until a reboot
6)  EC reboots keeping AP off
7)  "hostevent set 0x4000" is entered at the EC console. (requesting recovery mode)
8)  Power button is pressed and released
9)  EC tells PD MCU to perform a cold reboot
10) EC is then confused that PD MCU is in RO, when the EC saw that the PD MCU was in RW.
11) Assuming the PD MCU has crashed, the EC panics and reboots, clearing the events.
12) DUT boots into normal mode because it has forgotten all about the recovery request.

Now, when I say "EC tells PD MCU" it's really just relaying the message from the AP. What seems to be happening is that when the cold_reset is released, the AP boots far enough into normal mode that it sends a message for the PD MCU to sysjump to it's RW image. When the EC reboots keeping the AP held it reset, it's too late. At this point, the PD MCU is already in RW. When the AP begins booting to recovery, it sees that the PD MCU is in RW and tells it to reboot itself to RO. Now that the EC sees the PD MCU in RO when it was in RW, it thinks the PD MCU has crashed and the EC panics and resets the system. At this point, the recovery request is forgotten and the DUT boots into normal mode.

Now, I was only able to reproduce this on servo v2 by inserting the delay. I also noticed that asserting warm_reset does not work as the code assumes. For samus and chell, asserting warm_reset only triggers and AP reset instead of *holding* the AP in reset. (I've filed bug 621997 for that). If warm_reset worked as expected, then there would be no issue.

Unfortunately, this causes a race from the release of the cold_reset to issuing the command "reboot ap-off". If the AP boots far enough into normal that it tells the PD MCU to jump to its RW image, this failure will occur.

Furthermore, I've noticed a delay on servo v3 in opening the PTY to enter the "reboot ap-off" command. I even once observed a delay of 10 seconds! This could be due to a number of reasons, one possibly being the speed of the beaglebone and its current load.

Regardless, one thing I would like to try after lunch is opening the PTY before releasing the cold_reset. I'm hoping that if we already have the PTY opened, sending the command to the EC will be quick and more importantly, before the AP boots up far enough.


Status: Started (was: Assigned)
Alright, I've solved the race condition and uploaded a CL to fix the power state driver here: https://chromium-review.googlesource.com/#/c/355580/

I've tested on both samus and chell for both servo v2 and v3 and I can reliably boot into recovery mode.

For platforms with PD MCUs and a warm_reset that actually holds the AP in reset, they should use the cros_ec_softrec_power driver instead of the cros_ec_pd_softrec_power driver. The latter is now reserved for platforms where warm_reset doesn't actually hold the AP in reset.

I'm also going to merge the chell issue into this issue.
Cc: tbroch@chromium.org aaboagye@chromium.org sha...@chromium.org
 Issue 617832  has been merged into this issue.
Cc: bmahadev@chromium.org tienchang@chromium.org krisr@chromium.org sbasi@chromium.org jongkyuj@chromium.org jrbarnette@chromium.org
 Issue 634032  has been merged into this issue.
Project Member

Comment 30 by bugdroid1@chromium.org, Aug 9 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/2678a988493de095861e0edd7966a6a38983f6f7

commit 2678a988493de095861e0edd7966a6a38983f6f7
Author: Aseda Aboagye <aaboagye@google.com>
Date: Thu Jun 23 01:34:08 2016

cros_ec_pd_softrec_power: Remove race condition.

In order to enter recovery mode, _power_on_bytype() relied upon
warm_reset keeping the AP in reset.  However, it was found that for some
platforms (samus, chell), warm_reset did not actually hold the AP in
reset.  This caused a race condition where if the AP booted up far
enough in normal mode, it would send a message to the PD MCU to jump to
its RW image.  This would happen before the EC had a chance to reboot
and hold the AP in reset. Failing to do so, when the request to boot
into recovery mode arrived, the AP would notice that the PD MCU was not
in its RO image. It would then send a message to the PD MCU to hard
reboot so that it would boot in its RO image.  However, the EC would
notice that the PD MCU was now in its RO image when it was previously in
its RW image.  This caused the EC to panic because it assumed that the
PD MCU had crashed.  When the EC panicked, it reset the entire system
and the recovery request was forgotten.

In order to reliably enter recovery mode, _power_on_bytype() now does the
following:

 - Hold PD MCU in reset.
 - Assert and deassert cold_reset.
 - Reboot the EC and have the EC now hold the AP in reset.
 - Release PD MCU reset.
 - Enter recovery mode.
 - Power down AP.
 - Enter recovery mode again.

By holding the PD MCU in reset until we've rebooted the EC keeping the
AP in reset, the PD MCU will never get the message to jump to its RW
image.  A side-effect of this however, is that in our first recovery
boot, the recovery reason will be that software sync failed to check the
version.  That's why we perform two recovery boots such that the
recovery reason will be "button press" once the function completes.

For platforms, regardless if they have a PD MCU or not, that have a
warm_reset control that actually holds the AP in reset, they should use
the cros_ec_softrec_power driver instead of this one.

BUG= chromium:606062 
BUG= chromium:617832 
BRANCH=None
TEST=platform_ServoPowerStateController passed for chell & samus on both
servo v2 and v3.

CQ-DEPEND=CL:355505

Change-Id: I4a577ec788a54c8efffa72450586ee95e194251e
Signed-off-by: Aseda Aboagye <aaboagye@google.com>
Reviewed-on: https://chromium-review.googlesource.com/355580
Commit-Ready: Aseda Aboagye <aaboagye@chromium.org>
Tested-by: Aseda Aboagye <aaboagye@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_glados_overlay.xml
[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_samus_overlay.xml
[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/drv/cros_ec_pd_softrec_power.py

Status: Fixed (was: Started)
Will need to verify in the lab whenever an update to the beaglebones is offered.
Labels: VerifyIn-54
Labels: VerifyIn-55
Status: Verified (was: Fixed)
Inspected a repair job for a samus in the lab with a servo OS image of 8866.0.0. power_state:rec was able to boot the DUT off of USB.

Sign in to add a comment