New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 617832 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 606062
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

power_state:rec fails for chell with servo V3

Reported by jrbarnette@chromium.org, Jun 7 2016

Issue description

I recently tried to force re-install from USB for 7 different
chell devices.  They all failed with the symptom below:

2016-06-06 16:48:17 | INFO       | Installing image through chromeos-install.
2016-06-06 16:48:17 | DEBUG      | Running (ssh) 'chromeos-install --yes'
2016-06-06 16:48:18 | ERROR      | [stderr] Error: Source does not look like a removable device: /dev/mmcblk0

This means that when we powered on with 'power_state:rec', the
chell DUT booted off of internal storage, rather than from USB in
recovery mode.

This feature was reported working, at least for EVT (I'm having
trouble seeing that it worked for DVT), so this is a regression.

A fix here is urgent:  This feature is mandatory if we're to support
the DUTs in the lab.  In particular, there's a scheduled deployment
that's likely to break because of this bug.

 
Owner: aaboagye@chromium.org
Status: Started (was: Available)
Okay, I'll start to look into it.
jrbarnette@, do you happen to have firmware versions of the chells that this fails from? I grabbed a chell from a coworker and using a servo v3 on my desk with the 8368.0.0 image, it was able to enter recovery mode. The USB image I have on my stick must be signed with different keys, because I get the error stating that the image inserted does not contain Chrome OS. But, I WAS able to enter recovery mode.

My chell version info:
localhost ~ # cat /etc/lsb-release 
CHROMEOS_RELEASE_APPID={114DBB0F-4507-F7FC-99B6-4BCAB6A39725}
CHROMEOS_BOARD_APPID={114DBB0F-4507-F7FC-99B6-4BCAB6A39725}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_RELEASE_BOARD=chell
CHROMEOS_DEVSERVER=
GOOGLE_RELEASE=8249.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=8249
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=52
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=8249.0.0 (Official Build) dev-channel chell test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_VERSION=8249.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2

It has a dirty EC image on there though. Tomorrow, I can try to get official images on there.
localhost ~ # ectool version
RO version:    chell_v1.1.4624-e9aa7c0-dirty
RW version:    chell_v1.1.4624-e9aa7c0-dirty
Firmware copy: RO
Build info:    chell_v1.1.4624-e9aa7c0-dirty 2016-05-10 14:22:40 shawnn@shawnn.mtv.corp.google.com
localhost ~ #   


Some bios info
localhost ~ # crossystem | grep fw
ecfw_act               = RO                             # Active EC firmware
fwb_tries              = 0                              # Try firmware B count (writable)
fw_vboot2              = 1                              # 1 if firmware was selected by vboot2 or 0 otherwise
fwid                   = Google_Chell.7820.64.0         # Active firmware ID
fwupdate_tries         = 0                              # Times to try OS firmware update (writable, inside kern_nv)
fw_tried               = A                              # Firmware tried this boot (vboot2)
fw_try_count           = 0                              # Number of times to try fw_try_next (writable)
fw_try_next            = A                              # Firmware to try next (vboot2,writable)
fw_result              = success                        # Firmware result this boot (vboot2,writable)
fw_prev_tried          = A                              # Firmware tried on previous boot (vboot2)
fw_prev_result         = unknown                        # Firmware result of previous boot (vboot2)
mainfw_act             = A                              # Active main firmware
mainfw_type            = developer                      # Active main firmware type
ro_fwid                = Google_Chell.7820.64.0         # Read-only firmware ID
tpm_fwver              = 0x00010001                     # Firmware version stored in TPM
tried_fwb              = 0                              # Tried firmware B before A this boot
OK, I went back to one of the 7 chell DUTs that originally failed.
Here are the bits from crossystem of interest:

$ ssh chromeos4-row12-rack10-host1 crossystem | grep fw
ecfw_act               = RW                             # Active EC firmware
fwb_tries              = 0                              # Try firmware B count (writable)
fw_vboot2              = 1                              # 1 if firmware was selected by vboot2 or 0 otherwise
fwid                   = Google_Chell.7820.98.0         # Active firmware ID
fwupdate_tries         = 0                              # Times to try OS firmware update (writable, inside kern_nv)
fw_tried               = A                              # Firmware tried this boot (vboot2)
fw_try_count           = 0                              # Number of times to try fw_try_next (writable)
fw_try_next            = A                              # Firmware to try next (vboot2,writable)
fw_result              = success                        # Firmware result this boot (vboot2,writable)
fw_prev_tried          = A                              # Firmware tried on previous boot (vboot2)
fw_prev_result         = unknown                        # Firmware result of previous boot (vboot2)
mainfw_act             = A                              # Active main firmware
mainfw_type            = normal                         # Active main firmware type
ro_fwid                = Google_Chell.7820.98.0         # Read-only firmware ID
tpm_fwver              = 0x00010001                     # Firmware version stored in TPM
tried_fwb              = 0                              # Tried firmware B before A this boot

Also hwid:
hwid                   = CHELL B25-A2C-34Z              # Hardware ID

okay, I've updated my chell with official images now and I am still able to enter recovery mode and boot off of the USB. I've been told that this chell is a DVT2 unit.

Here's the current version information.

localhost ~ # ectool version
RO version:    chell_v1.9.175-0464cfe
RW version:    chell_v1.9.175-0464cfe
Firmware copy: RW
Build info:    chell_v1.9.175-0464cfe 2016-04-25 02:33:55 @build123-m2

localhost ~ # crossystem | grep fw
ecfw_act               = RW                             # Active EC firmware
fwb_tries              = 0                              # Try firmware B count (writable)
fw_vboot2              = 1                              # 1 if firmware was selected by vboot2 or 0 otherwise
fwid                   = Google_Chell.7820.98.0         # Active firmware ID
fwupdate_tries         = 0                              # Times to try OS firmware update (writable, inside kern_nv)
fw_tried               = A                              # Firmware tried this boot (vboot2)
fw_try_count           = 0                              # Number of times to try fw_try_next (writable)
fw_try_next            = A                              # Firmware to try next (vboot2,writable)
fw_result              = success                        # Firmware result this boot (vboot2,writable)
fw_prev_tried          = A                              # Firmware tried on previous boot (vboot2)
fw_prev_result         = success                        # Firmware result of previous boot (vboot2)
mainfw_act             = A                              # Active main firmware
mainfw_type            = normal                         # Active main firmware type
ro_fwid                = Google_Chell.7820.98.0         # Read-only firmware ID
tpm_fwver              = 0x00010001                     # Firmware version stored in TPM
tried_fwb              = 0                              # Tried firmware B before A this boot
localhost ~ #


I'll try to see if I can poke around for logs on that DUT. 
Couldn't find the logs for that specific failure, but I was running more experiments on my side yesterday. That particular device mentioned was locked for investigating some other issue.

I managed to reproduce it once on a chell in the lab and not since. I'm going to continue to attempt to reproduce it.
I think I found the root cause! I'm also thinking that  bug 606062  is the same issue, but possibly not if the device fails to reach the recovery screens at all. 

Anyways, I found a samus that would consistently fail the test firmware_UserRequestRecovery, this DUT was chromeos1-row1-rack11-host4. Luckily I had access to this lab so I could visually inspect what was going on. I was able to confirm that the device *was* in fact reaching the recovery screen.

Here's what I believed happened:

- User issues request to enter recovery mode (either through power_state:rec, which uses the EC console, or crossystem to set the recovery_request byte in NV storage)
- User the issues a reset to the device (warm reset or rebooting the EC)
- DUT reaches the recovery screen to insert recovery media
- Recovery media, the USB flash drive, is inserted.
- The image on the USB flash drive is actually BAD and immediately causes the DUT to reset which then boots up into normal mode.
- The test checks the mode and see that it's either developer or normal mode and the recovery_reason byte is set to 0, indicating that it's not in recovery mode.

So I believe the reason we were seeing this was due to a bad image on the USB attached to the servo v3. I confirmed this by swapping the USB flash drives and saw proper behaviour with the DUT in the lab, and the same behaviour with the DUT on my desk.

My suggestion would be to reflash the image on the USB flash drives and try again. Also, if I'm correct, I wonder how we got in this state. I think that an action item would be to verify the image on the USB stick once we reimage it. That could be by trying to boot into recovery mode  off of this stick and verifying that we actually do.

A potential confirmation could be the following on each DUT that's already in a good state:

- set the USB mux selects to "servo_sees_usbkey"
- set "power_state:off", then "power_state:rec"
- wait 1 min and ping the DUT. If no answer, the DUT should be at the recovery screen.
This should confirm that the DUT can reach the recovery screen. If the DUT comes up to normal mode however, then that's a separate issue.

- Reset the DUT (power_state:off then power_state:on)
- Wait for the DUT to come up.
- Run the firmware_UserRequestRecovery test.
- If it fails with following signature, then it's the same issue and the image on the USB flash drive may be bad.

16:56:48 INFO | autoserv| -[FAFT]-[ end wait_for_client ]-----
16:56:48 INFO | autoserv| Expected recovery boot, request recovery again.
16:56:48 INFO | autoserv| -[FAFT]-[ start stepstate_checker ]----------
16:56:48 INFO | autoserv| calling <bound method FAFTCheckers.crossystem_checker of <autotest_lib.server.cros.faft.utils.faft_checkers.FAFTCheckers object at 0x7fd4a9bd5e50>> with args ({'recovery_reason': '193', 'mainfw_type': 'recovery'}
,)
16:56:48 INFO | autoserv| Expected 'recovery_reason' value '193' but got '0'
16:56:48 INFO | autoserv| Expected 'mainfw_type' value 'recovery' but got 'normal'
16:56:48 INFO | autoserv| Autotest caught exception when running test:
16:56:48 INFO | autoserv| Traceback (most recent call last):
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 600, in _exec
16:56:48 INFO | autoserv| _call_test_function(self.execute, *p_args, **p_dargs)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 804, in _call_test_function
16:56:48 INFO | autoserv| return func(*args, **dargs)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 461, in execute
16:56:48 INFO | autoserv| dargs)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
16:56:48 INFO | autoserv| postprocess_profiled_run, args, dargs)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 376, in _call_run_once
16:56:48 INFO | autoserv| self.run_once(*args, **dargs)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/site_tests/firmware_UserRequestRecovery/firmware_UserRequestRecovery.py", line 58, in run_once
16:56:48 INFO | autoserv| 'recovery_reason' : vboot.RECOVERY_REASON['US_TEST'],
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/cros/faft/firmware_test.py", line 1158, in check_state
16:56:48 INFO | autoserv| self._call_action(func, check_status=True)
16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/cros/faft/firmware_test.py", line 1085, in _call_action
16:56:48 INFO | autoserv| (error_msg, info_msg, str(ret)))
16:56:48 INFO | autoserv| TestFail: Not succeed: calling <bound method FAFTCheckers.crossystem_checker of <autotest_lib.server.cros.faft.utils.faft_checkers.FAFTCheckers object at 0x7fd4a9bd5e50>> with args ({'recovery_reason': '193', 'ma
infw_type': 'recovery'},) returning False
> So I believe the reason we were seeing this was due to a
> bad image on the USB attached to the servo v3. I confirmed
> this by swapping the USB flash drives and saw proper
> behaviour with the DUT in the lab, and the same behaviour
> with the DUT on my desk.

In both this case (for chell) and in the other (for samus), the
failed test case included flashing a known good software image
onto the USB stick immediately before booting in recovery mode.

In the various failures, I didn't study in depth what happened during
the flashing process, and I didn't check the validity of the image
afterward, so I can't rule out some sort of silent failure.  However,
there are no known bugs causing such silent failures, and any bug
would be affecting all boards, not just samus and chell.

Yes, the fact that it has only seemed to effect samus and chell is quite interesting... but a silent failure and the result of swapping the USB sticks makes sense.

I'm tempted to add some verification of the image and then update the USB sticks, preemptively close it, then re-open if the issue persists.

At the very least, the image verification would be an improvement to what exists today and would rule out that failure reason.
> Yes, the fact that it has only seemed to effect samus and
> chell is quite interesting... but a silent failure and
> the result of swapping the USB sticks makes sense.

The silent failure theory "makes sense" only in that some of
the observed symptoms are consistent with that event.
However, the overall evidence is against that theory:
  * As noted, if there were a problem with silent corruption,
    it wouldn't be restricted to just two boards, and it wouldn't
    show consistently on every DUT of that board.  Yet, that's
    the symptom I'm seeing.
  * I just logged in to one of the chell servos that previously
    failed with this symptom, and looked at the USB stick.
    The USB stick shows no corruption: there's a valid GPT with
    mountable file systems in all the right places.

*I'm* not finding it consistent on every DUT. I just picked a chell at random (chromeos1-row1-rack10-host1) and ran firmware_UserRequestRecovery which passed.

Do you have a list of these devices? I'll also try to find a DVT samus after lunch to test again.
> *I'm* not finding it consistent on every DUT. I just
> picked a chell at random (chromeos1-row1-rack10-host1)
> and ran firmware_UserRequestRecovery which passed.

I haven't grokked the complete test flow, but I'm reasonably
certain that 'firmware_UserRequestRecovery' doesn't exercise
servo's "power_state:rec" control.

To prove that the problem is or isn't happening, you need to
run a test like platform_ServoPowerStateController:usb.  The
lab tools also include a command called 'repair_test' that can
exercise this feature.

That's correct, firmware_UserRequestRecovery uses crossystem to issue a recovery request.

I went back and ran firmware_RecoveryButton on the chell @ chromeos1-row1-rack10-host1 and it also passed. This test does use the power_state:rec control. I additionally ran that test against the samus on my desk as well, which also passed.


I additionally ran the platform_ServoPowerStateControler_USBPluggedIn test which  passed for the chell in the lab.


OK.  The chell DUTs used for general testing _do_ show the failure.
Here's one:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=66371289

I picked that DUT somewhat at random from available DUTs in
pool:suites.  I'd expect similar results for any other DUT in
that pool.

Components: OS>Firmware>EC
+shawnn@

Thanks for the link jrbarnette@, I've now believe I'm closer to finding the root cause for this specific failure for chell.

As shown in the logs in your test, that test fails with the following message:

TestFail: power_state:rec didn't stay at recovery screen.. DUT turns back on after it is turned off.

To get to this stage, we would have:
 - turned the DUT off (power_state:off)
 - disabled power to the usb port (prtctl4_pwren:off)
 - switch USB mux to servo (usb_mux_sel1:servo_sees_usbkey)
 - boot into recovery mode (power_state:rec)

Since the USB flash drive is "unplugged", the DUT should remain at the recovery screen.

Now, I did those exact steps and found that the DUT did indeed boot up to normal instead of staying in recovery screen. I then repeated my actions and found that it worked as intended, leading me to believe that it's flaky for some reason.

The power_state:rec control for chell uses the EC console to boot to the recovery screen. I enabled more verbose logging in EC-3PO to view the transactions on the EC console.

I found that the commands are indeed going through, but we encounter an EC panic because the PD MCU appears to be in RO when it was previously in RW and it didn't explicitly jump to RO.

Since the EC panics and reboots, the system reboots and that's why we boot into normal mode instead of remaining at the recovery screen.

Here's the output of the servod log mixed with the EC console before the panic.

2016-06-10·21:51:10,404·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'CH·pwrbtn=LOW]^M                                                                                                                                                       
[2.545770·PB·task·10·=·was-off,·wait·188757]^M                                                                                                                                                                                                
[2.546393·power·state·0·=·G3,·in·0x0030]^M                                                                                                                                                                                                    
[2.546764·power·state·5·=·G3->S5,·in·0x0030]^M                                                                                                                                                                                                
[2.558328·power·state·1·=·S5,·in·0x0030]^M                                                                                                                                                                                                    
[2.569338·SW·0x07]^M                                                                                                                                                                                                                          
[2.569676·event·set·0x00000004]^M                                                                                                                                                                                                             
[2.571152·RSMRST:·1]^M                                                                                                                                                                                                                        
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,549·-·Servod·-·DEBUG·-·servo_server.py:797:set·-·name(pwr_button)·wr_val(release)                                                                                                                                         
2016-06-10·21:51:10,553·-·Servod·-·DEBUG·-·servo_server.py:464:_get_param_drv·-·                                                                                                                                                              
2016-06-10·21:51:10,562·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[2.735184·PB·task·10·=·was-off]^M                                                                                                                                      
[2.735682·PB·task·4·=·held,·wait·-1]^M                                                                                                                                                                                                        
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,567·-·SystemConfig·-·DEBUG·-·system_config.py:315:_lookup·-·lookup·of·map·press                                                                                                                                           
2016-06-10·21:51:10,572·-·Driver·-·DEBUG·-·tca6416.py:103:set·-·                                                                                                                                                                              
2016-06-10·21:51:10,576·-·Driver·-·DEBUG·-·tca6416.py:110:set·-·Set·to·input·because·its·io·type·is·PU                                                                                                                                        
2016-06-10·21:51:10,580·-·I2cReg·-·DEBUG·-·i2c_reg.py:94:_read_reg·-·                                                                                                                                                                         
2016-06-10·21:51:10,583·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·                                                                                                                                                                           
2016-06-10·21:51:10,589·-·I2cReg·-·DEBUG·-·i2c_reg.py:108:_write_reg·-·                                                                                                                                                                       
2016-06-10·21:51:10,593·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·                                                                                                                                                                           
2016-06-10·21:51:10,599·-·Servod·-·DEBUG·-·servo_server.py:744:get·-·name(pwr_button)                                                                                                                                                         
2016-06-10·21:51:10,603·-·Servod·-·DEBUG·-·servo_server.py:464:_get_param_drv·-·                                                                                                                                                              
2016-06-10·21:51:10,607·-·Driver·-·DEBUG·-·tca6416.py:81:get·-·                                                                                                                                                                               
2016-06-10·21:51:10,610·-·I2cReg·-·DEBUG·-·i2c_reg.py:94:_read_reg·-·                                                                                                                                                                         
2016-06-10·21:51:10,614·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·                                                                                                                                                                           
2016-06-10·21:51:10,620·-·SystemConfig·-·DEBUG·-·system_config.py:315:_lookup·-·lookup·of·map·press                                                                                                                                           
2016-06-10·21:51:10,624·-·Servod·-·DEBUG·-·servo_server.py:753:get·-·pwr_button·=·release                                                                                                                                                     
2016-06-10·21:51:10,673·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[2.831356·KB·disable_scanning_mask·changed:·0x00000000]^M                                                                                                              
[2.831903·power·button·released]^M                                                                                                                                                                                                            
[2.832359·PB·released]^M                                                                                                                                                                                                                      
[2.832605·PB·task·6·=·released]^M                                                                                                                                                                                                             
[2.832903·PB·PCH·pwrbtn=HIGH]^M                                                                                                                                                                                                               
[2.833202·PB·task·0·=·idle,·wait·-1]^M                                                                                                                                                                                                        
[2.834096·pow'                                                                                                                                                                                                                                
2016-06-10·21:51:10,693·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'er·state·6·=·S5->S3,·in·0x003f]^M                                                                                                                                      
[2.834725·chipset·->·S3]^M                                                                                                                                                                                                                    
[2.835122·power·state·2·=·S3,·in·0x003f]^M                                                                                                                                                                                                    
[2.835514·power·state·7·=·S3->S0,·in·0x003f]^M                                                                                                                                                                                                
^M[2.836029·Port·80:·0x1001][2.836457·chipset·->·S0]^M                                                                                                                                                                                        
[2.838548·power·state·3·=·S0,·in·0x003f]^M                                                                                                                                                                                                    
[2.882477·SW·0x05]^M                                                                                                                                                                                                                          
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,810·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.013693·LPC·RESET#·deasserted]^M                                                                                                                                     
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,816·-·EC3PO.Interpreter·-·DEBUG·-·/dev/ttyO1·-·EC·has·data                                                                                                                                                                
2016-06-10·21:51:10,829·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.021969·Port·80:·0x10]'                                                                                                                                            
2016-06-10·21:51:10,933·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.142585·Port·80:·0x1002][3.142856·LPC·RESET#·asserted]^M                                                                                                           
[3.149076·LPC·RESET#·deasserted]^M                                                                                                                                                                                                            
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,962·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.156969·Port·80:·0x10]^M[3.161969·Port·80:·0x24]^M[3.167969·Port·80:·0x90]^M[3.170969·Port·80:·0x2a][3.173284·HC·0x87]^M                                           
[3.174523·HC·0x4002]^M                                                                                                                                                                                                                        
'                                                                                                                                                                                                                                             
2016-06-10·21:51:10,986·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.180892·HC·0x40d2]^M                                                                                                                                                 
[3.185211·HC·0x87]^M                                                                                                                                                                                                                          
[3.186646·HC·0x02]^M                                                                                                                                                                                                                          
[3.187904·PI3USB9281·I2C·write·failed]^M                                                                                                                                                                                                      
'                                                                                                                                                                                                                                             
2016-06-10·21:51:11,035·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.208764·HC·0x87]^M 
PD·crash                                                                                                                                                                                                                                   
===·PROCESS·EXCEPTION:·00·======·xPSR:·00000000·===                                                                                                                                                                                        
r0·:0011e0a1·r1·:00000007·r2·:ffffffff·r3·:64000000                                                                                                                                                                          
r4·:dead6662·r5·:00000000 r6·:00000000·r7·:00000000                                                 
r8·:00000000·r9·:00000000·r10:00000000·r11:00000000                                                                                                                                                                                         
r12:00016401·sp·:0011e090·lr·:00001800·pc·:00000000                                                                                                                                                               
                                                                                                                                                                                                                                            
mmfs·=·0,·shcsr·=·70000,·hfsr·=·0,·dfsr·=·0                                                                                                                                                                                                 
                                                                                                                                                                                                                                            
===========·Process·Stack·Contents·===========                                                                                                                                                                                             
0011e0b0:·00000000·00108465·00000000·00112f45                                                                                                                                                                                               
0011e0c0:·deadd00d·deadd00d·deadd00d·deadd00d                                                                                                                                                                                              
0011e0d0:·deadd00d·deadd00d·deadd00d·deadd00d                                                                                                                                                                                              
0011e0e0:·deadd00d·deadd00d·deadd00d·deadd00d                                                                                                                                          
                                                                                                                                                                                                                                          
Rebooting...                


So to summarize, the actions leading to the EC panic are the following:

- Issue a cold_reset
- reboot ap-off
- hostevent set 0x4000 (recovery button)
- power button press


This failure is different than the failure I encountered with samus yesterday. For the samus case, the DUT would remain at the recovery screen.
Cc: sha...@chromium.org
(actually +shawnn@)

I guess the next thing to do is to take a look at what was happening on the PD MCU. It seems that the PD MCU got in a bad state while this was happening. Unfortunately, servo v3 doesn't have access to the PD MCU UART.

jrbarnette@, is it possible to get this exact device so I can try and connect a servo v2 to it and look at the PD MCU console output?
I have a chell device from the lab, and I've confirmed that it
fails the simple test case of `dut-control power_state:rec`.

We'll arrange delivery offline.

Just wanted to post an update.

Today, I've been able to reproduce the failure with the chell received. However, the DUT boots up in normal mode without the EC or PD crashing. Essentially, I haven't seen the failure that I mentioned in comment #14. I tried using cros flash to flash the USB flash drive with the image that was already installed on the chell (R52-8350.8.0), but attempting to recover off of that image always causes the DUT to come up in normal mode. I tried re-running cros flash several times and achieved the same result.

I've finally managed to find an image that boots up into recovery mode for the chell (the latest dev-channel test image: R53-8466.0.0) and will be running a stress test over the weekend by running many iterations of the firmware_RecoveryButton test. This uses power_state:rec to enter recovery mode.

Note that in order to log the PD MCU UART output, I am using a servo v2. However, the steps taken in power_state:rec are the same for servo v2 and v3.
Mergedinto: 606062
Status: Duplicate (was: Started)
After examining the PD logs, I realized that  bug 606062  had the same root cause as this.
Project Member

Comment 19 by bugdroid1@chromium.org, Aug 9 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/2678a988493de095861e0edd7966a6a38983f6f7

commit 2678a988493de095861e0edd7966a6a38983f6f7
Author: Aseda Aboagye <aaboagye@google.com>
Date: Thu Jun 23 01:34:08 2016

cros_ec_pd_softrec_power: Remove race condition.

In order to enter recovery mode, _power_on_bytype() relied upon
warm_reset keeping the AP in reset.  However, it was found that for some
platforms (samus, chell), warm_reset did not actually hold the AP in
reset.  This caused a race condition where if the AP booted up far
enough in normal mode, it would send a message to the PD MCU to jump to
its RW image.  This would happen before the EC had a chance to reboot
and hold the AP in reset. Failing to do so, when the request to boot
into recovery mode arrived, the AP would notice that the PD MCU was not
in its RO image. It would then send a message to the PD MCU to hard
reboot so that it would boot in its RO image.  However, the EC would
notice that the PD MCU was now in its RO image when it was previously in
its RW image.  This caused the EC to panic because it assumed that the
PD MCU had crashed.  When the EC panicked, it reset the entire system
and the recovery request was forgotten.

In order to reliably enter recovery mode, _power_on_bytype() now does the
following:

 - Hold PD MCU in reset.
 - Assert and deassert cold_reset.
 - Reboot the EC and have the EC now hold the AP in reset.
 - Release PD MCU reset.
 - Enter recovery mode.
 - Power down AP.
 - Enter recovery mode again.

By holding the PD MCU in reset until we've rebooted the EC keeping the
AP in reset, the PD MCU will never get the message to jump to its RW
image.  A side-effect of this however, is that in our first recovery
boot, the recovery reason will be that software sync failed to check the
version.  That's why we perform two recovery boots such that the
recovery reason will be "button press" once the function completes.

For platforms, regardless if they have a PD MCU or not, that have a
warm_reset control that actually holds the AP in reset, they should use
the cros_ec_softrec_power driver instead of this one.

BUG= chromium:606062 
BUG= chromium:617832 
BRANCH=None
TEST=platform_ServoPowerStateController passed for chell & samus on both
servo v2 and v3.

CQ-DEPEND=CL:355505

Change-Id: I4a577ec788a54c8efffa72450586ee95e194251e
Signed-off-by: Aseda Aboagye <aaboagye@google.com>
Reviewed-on: https://chromium-review.googlesource.com/355580
Commit-Ready: Aseda Aboagye <aaboagye@chromium.org>
Tested-by: Aseda Aboagye <aaboagye@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_glados_overlay.xml
[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_samus_overlay.xml
[modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/drv/cros_ec_pd_softrec_power.py

Sign in to add a comment