Issue metadata
Sign in to add a comment
|
power_state:rec fails for chell with servo V3
Reported by
jrbarnette@chromium.org,
Jun 7 2016
|
||||||||||||||||||||||||
Issue descriptionI recently tried to force re-install from USB for 7 different chell devices. They all failed with the symptom below: 2016-06-06 16:48:17 | INFO | Installing image through chromeos-install. 2016-06-06 16:48:17 | DEBUG | Running (ssh) 'chromeos-install --yes' 2016-06-06 16:48:18 | ERROR | [stderr] Error: Source does not look like a removable device: /dev/mmcblk0 This means that when we powered on with 'power_state:rec', the chell DUT booted off of internal storage, rather than from USB in recovery mode. This feature was reported working, at least for EVT (I'm having trouble seeing that it worked for DVT), so this is a regression. A fix here is urgent: This feature is mandatory if we're to support the DUTs in the lab. In particular, there's a scheduled deployment that's likely to break because of this bug.
,
Jun 7 2016
jrbarnette@, do you happen to have firmware versions of the chells that this fails from? I grabbed a chell from a coworker and using a servo v3 on my desk with the 8368.0.0 image, it was able to enter recovery mode. The USB image I have on my stick must be signed with different keys, because I get the error stating that the image inserted does not contain Chrome OS. But, I WAS able to enter recovery mode.
My chell version info:
localhost ~ # cat /etc/lsb-release
CHROMEOS_RELEASE_APPID={114DBB0F-4507-F7FC-99B6-4BCAB6A39725}
CHROMEOS_BOARD_APPID={114DBB0F-4507-F7FC-99B6-4BCAB6A39725}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_RELEASE_BOARD=chell
CHROMEOS_DEVSERVER=
GOOGLE_RELEASE=8249.0.0
CHROMEOS_RELEASE_BUILD_NUMBER=8249
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=52
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=8249.0.0 (Official Build) dev-channel chell test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_VERSION=8249.0.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
It has a dirty EC image on there though. Tomorrow, I can try to get official images on there.
localhost ~ # ectool version
RO version: chell_v1.1.4624-e9aa7c0-dirty
RW version: chell_v1.1.4624-e9aa7c0-dirty
Firmware copy: RO
Build info: chell_v1.1.4624-e9aa7c0-dirty 2016-05-10 14:22:40 shawnn@shawnn.mtv.corp.google.com
localhost ~ #
Some bios info
localhost ~ # crossystem | grep fw
ecfw_act = RO # Active EC firmware
fwb_tries = 0 # Try firmware B count (writable)
fw_vboot2 = 1 # 1 if firmware was selected by vboot2 or 0 otherwise
fwid = Google_Chell.7820.64.0 # Active firmware ID
fwupdate_tries = 0 # Times to try OS firmware update (writable, inside kern_nv)
fw_tried = A # Firmware tried this boot (vboot2)
fw_try_count = 0 # Number of times to try fw_try_next (writable)
fw_try_next = A # Firmware to try next (vboot2,writable)
fw_result = success # Firmware result this boot (vboot2,writable)
fw_prev_tried = A # Firmware tried on previous boot (vboot2)
fw_prev_result = unknown # Firmware result of previous boot (vboot2)
mainfw_act = A # Active main firmware
mainfw_type = developer # Active main firmware type
ro_fwid = Google_Chell.7820.64.0 # Read-only firmware ID
tpm_fwver = 0x00010001 # Firmware version stored in TPM
tried_fwb = 0 # Tried firmware B before A this boot
,
Jun 7 2016
OK, I went back to one of the 7 chell DUTs that originally failed. Here are the bits from crossystem of interest: $ ssh chromeos4-row12-rack10-host1 crossystem | grep fw ecfw_act = RW # Active EC firmware fwb_tries = 0 # Try firmware B count (writable) fw_vboot2 = 1 # 1 if firmware was selected by vboot2 or 0 otherwise fwid = Google_Chell.7820.98.0 # Active firmware ID fwupdate_tries = 0 # Times to try OS firmware update (writable, inside kern_nv) fw_tried = A # Firmware tried this boot (vboot2) fw_try_count = 0 # Number of times to try fw_try_next (writable) fw_try_next = A # Firmware to try next (vboot2,writable) fw_result = success # Firmware result this boot (vboot2,writable) fw_prev_tried = A # Firmware tried on previous boot (vboot2) fw_prev_result = unknown # Firmware result of previous boot (vboot2) mainfw_act = A # Active main firmware mainfw_type = normal # Active main firmware type ro_fwid = Google_Chell.7820.98.0 # Read-only firmware ID tpm_fwver = 0x00010001 # Firmware version stored in TPM tried_fwb = 0 # Tried firmware B before A this boot Also hwid: hwid = CHELL B25-A2C-34Z # Hardware ID
,
Jun 7 2016
okay, I've updated my chell with official images now and I am still able to enter recovery mode and boot off of the USB. I've been told that this chell is a DVT2 unit. Here's the current version information. localhost ~ # ectool version RO version: chell_v1.9.175-0464cfe RW version: chell_v1.9.175-0464cfe Firmware copy: RW Build info: chell_v1.9.175-0464cfe 2016-04-25 02:33:55 @build123-m2 localhost ~ # crossystem | grep fw ecfw_act = RW # Active EC firmware fwb_tries = 0 # Try firmware B count (writable) fw_vboot2 = 1 # 1 if firmware was selected by vboot2 or 0 otherwise fwid = Google_Chell.7820.98.0 # Active firmware ID fwupdate_tries = 0 # Times to try OS firmware update (writable, inside kern_nv) fw_tried = A # Firmware tried this boot (vboot2) fw_try_count = 0 # Number of times to try fw_try_next (writable) fw_try_next = A # Firmware to try next (vboot2,writable) fw_result = success # Firmware result this boot (vboot2,writable) fw_prev_tried = A # Firmware tried on previous boot (vboot2) fw_prev_result = success # Firmware result of previous boot (vboot2) mainfw_act = A # Active main firmware mainfw_type = normal # Active main firmware type ro_fwid = Google_Chell.7820.98.0 # Read-only firmware ID tpm_fwver = 0x00010001 # Firmware version stored in TPM tried_fwb = 0 # Tried firmware B before A this boot localhost ~ # I'll try to see if I can poke around for logs on that DUT.
,
Jun 8 2016
Couldn't find the logs for that specific failure, but I was running more experiments on my side yesterday. That particular device mentioned was locked for investigating some other issue. I managed to reproduce it once on a chell in the lab and not since. I'm going to continue to attempt to reproduce it.
,
Jun 9 2016
I think I found the root cause! I'm also thinking that bug 606062 is the same issue, but possibly not if the device fails to reach the recovery screens at all. Anyways, I found a samus that would consistently fail the test firmware_UserRequestRecovery, this DUT was chromeos1-row1-rack11-host4. Luckily I had access to this lab so I could visually inspect what was going on. I was able to confirm that the device *was* in fact reaching the recovery screen. Here's what I believed happened: - User issues request to enter recovery mode (either through power_state:rec, which uses the EC console, or crossystem to set the recovery_request byte in NV storage) - User the issues a reset to the device (warm reset or rebooting the EC) - DUT reaches the recovery screen to insert recovery media - Recovery media, the USB flash drive, is inserted. - The image on the USB flash drive is actually BAD and immediately causes the DUT to reset which then boots up into normal mode. - The test checks the mode and see that it's either developer or normal mode and the recovery_reason byte is set to 0, indicating that it's not in recovery mode. So I believe the reason we were seeing this was due to a bad image on the USB attached to the servo v3. I confirmed this by swapping the USB flash drives and saw proper behaviour with the DUT in the lab, and the same behaviour with the DUT on my desk. My suggestion would be to reflash the image on the USB flash drives and try again. Also, if I'm correct, I wonder how we got in this state. I think that an action item would be to verify the image on the USB stick once we reimage it. That could be by trying to boot into recovery mode off of this stick and verifying that we actually do. A potential confirmation could be the following on each DUT that's already in a good state: - set the USB mux selects to "servo_sees_usbkey" - set "power_state:off", then "power_state:rec" - wait 1 min and ping the DUT. If no answer, the DUT should be at the recovery screen. This should confirm that the DUT can reach the recovery screen. If the DUT comes up to normal mode however, then that's a separate issue. - Reset the DUT (power_state:off then power_state:on) - Wait for the DUT to come up. - Run the firmware_UserRequestRecovery test. - If it fails with following signature, then it's the same issue and the image on the USB flash drive may be bad. 16:56:48 INFO | autoserv| -[FAFT]-[ end wait_for_client ]----- 16:56:48 INFO | autoserv| Expected recovery boot, request recovery again. 16:56:48 INFO | autoserv| -[FAFT]-[ start stepstate_checker ]---------- 16:56:48 INFO | autoserv| calling <bound method FAFTCheckers.crossystem_checker of <autotest_lib.server.cros.faft.utils.faft_checkers.FAFTCheckers object at 0x7fd4a9bd5e50>> with args ({'recovery_reason': '193', 'mainfw_type': 'recovery'} ,) 16:56:48 INFO | autoserv| Expected 'recovery_reason' value '193' but got '0' 16:56:48 INFO | autoserv| Expected 'mainfw_type' value 'recovery' but got 'normal' 16:56:48 INFO | autoserv| Autotest caught exception when running test: 16:56:48 INFO | autoserv| Traceback (most recent call last): 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 600, in _exec 16:56:48 INFO | autoserv| _call_test_function(self.execute, *p_args, **p_dargs) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 804, in _call_test_function 16:56:48 INFO | autoserv| return func(*args, **dargs) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 461, in execute 16:56:48 INFO | autoserv| dargs) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry 16:56:48 INFO | autoserv| postprocess_profiled_run, args, dargs) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/client/common_lib/test.py", line 376, in _call_run_once 16:56:48 INFO | autoserv| self.run_once(*args, **dargs) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/site_tests/firmware_UserRequestRecovery/firmware_UserRequestRecovery.py", line 58, in run_once 16:56:48 INFO | autoserv| 'recovery_reason' : vboot.RECOVERY_REASON['US_TEST'], 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/cros/faft/firmware_test.py", line 1158, in check_state 16:56:48 INFO | autoserv| self._call_action(func, check_status=True) 16:56:48 INFO | autoserv| File "/build/samus/usr/local/build/autotest/server/cros/faft/firmware_test.py", line 1085, in _call_action 16:56:48 INFO | autoserv| (error_msg, info_msg, str(ret))) 16:56:48 INFO | autoserv| TestFail: Not succeed: calling <bound method FAFTCheckers.crossystem_checker of <autotest_lib.server.cros.faft.utils.faft_checkers.FAFTCheckers object at 0x7fd4a9bd5e50>> with args ({'recovery_reason': '193', 'ma infw_type': 'recovery'},) returning False
,
Jun 10 2016
> So I believe the reason we were seeing this was due to a > bad image on the USB attached to the servo v3. I confirmed > this by swapping the USB flash drives and saw proper > behaviour with the DUT in the lab, and the same behaviour > with the DUT on my desk. In both this case (for chell) and in the other (for samus), the failed test case included flashing a known good software image onto the USB stick immediately before booting in recovery mode. In the various failures, I didn't study in depth what happened during the flashing process, and I didn't check the validity of the image afterward, so I can't rule out some sort of silent failure. However, there are no known bugs causing such silent failures, and any bug would be affecting all boards, not just samus and chell.
,
Jun 10 2016
Yes, the fact that it has only seemed to effect samus and chell is quite interesting... but a silent failure and the result of swapping the USB sticks makes sense. I'm tempted to add some verification of the image and then update the USB sticks, preemptively close it, then re-open if the issue persists. At the very least, the image verification would be an improvement to what exists today and would rule out that failure reason.
,
Jun 10 2016
> Yes, the fact that it has only seemed to effect samus and
> chell is quite interesting... but a silent failure and
> the result of swapping the USB sticks makes sense.
The silent failure theory "makes sense" only in that some of
the observed symptoms are consistent with that event.
However, the overall evidence is against that theory:
* As noted, if there were a problem with silent corruption,
it wouldn't be restricted to just two boards, and it wouldn't
show consistently on every DUT of that board. Yet, that's
the symptom I'm seeing.
* I just logged in to one of the chell servos that previously
failed with this symptom, and looked at the USB stick.
The USB stick shows no corruption: there's a valid GPT with
mountable file systems in all the right places.
,
Jun 10 2016
*I'm* not finding it consistent on every DUT. I just picked a chell at random (chromeos1-row1-rack10-host1) and ran firmware_UserRequestRecovery which passed. Do you have a list of these devices? I'll also try to find a DVT samus after lunch to test again.
,
Jun 10 2016
> *I'm* not finding it consistent on every DUT. I just > picked a chell at random (chromeos1-row1-rack10-host1) > and ran firmware_UserRequestRecovery which passed. I haven't grokked the complete test flow, but I'm reasonably certain that 'firmware_UserRequestRecovery' doesn't exercise servo's "power_state:rec" control. To prove that the problem is or isn't happening, you need to run a test like platform_ServoPowerStateController:usb. The lab tools also include a command called 'repair_test' that can exercise this feature.
,
Jun 10 2016
That's correct, firmware_UserRequestRecovery uses crossystem to issue a recovery request. I went back and ran firmware_RecoveryButton on the chell @ chromeos1-row1-rack10-host1 and it also passed. This test does use the power_state:rec control. I additionally ran that test against the samus on my desk as well, which also passed. I additionally ran the platform_ServoPowerStateControler_USBPluggedIn test which passed for the chell in the lab.
,
Jun 10 2016
OK. The chell DUTs used for general testing _do_ show the failure.
Here's one:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=66371289
I picked that DUT somewhat at random from available DUTs in
pool:suites. I'd expect similar results for any other DUT in
that pool.
,
Jun 11 2016
+shawnn@
Thanks for the link jrbarnette@, I've now believe I'm closer to finding the root cause for this specific failure for chell.
As shown in the logs in your test, that test fails with the following message:
TestFail: power_state:rec didn't stay at recovery screen.. DUT turns back on after it is turned off.
To get to this stage, we would have:
- turned the DUT off (power_state:off)
- disabled power to the usb port (prtctl4_pwren:off)
- switch USB mux to servo (usb_mux_sel1:servo_sees_usbkey)
- boot into recovery mode (power_state:rec)
Since the USB flash drive is "unplugged", the DUT should remain at the recovery screen.
Now, I did those exact steps and found that the DUT did indeed boot up to normal instead of staying in recovery screen. I then repeated my actions and found that it worked as intended, leading me to believe that it's flaky for some reason.
The power_state:rec control for chell uses the EC console to boot to the recovery screen. I enabled more verbose logging in EC-3PO to view the transactions on the EC console.
I found that the commands are indeed going through, but we encounter an EC panic because the PD MCU appears to be in RO when it was previously in RW and it didn't explicitly jump to RO.
Since the EC panics and reboots, the system reboots and that's why we boot into normal mode instead of remaining at the recovery screen.
Here's the output of the servod log mixed with the EC console before the panic.
2016-06-10·21:51:10,404·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'CH·pwrbtn=LOW]^M
[2.545770·PB·task·10·=·was-off,·wait·188757]^M
[2.546393·power·state·0·=·G3,·in·0x0030]^M
[2.546764·power·state·5·=·G3->S5,·in·0x0030]^M
[2.558328·power·state·1·=·S5,·in·0x0030]^M
[2.569338·SW·0x07]^M
[2.569676·event·set·0x00000004]^M
[2.571152·RSMRST:·1]^M
'
2016-06-10·21:51:10,549·-·Servod·-·DEBUG·-·servo_server.py:797:set·-·name(pwr_button)·wr_val(release)
2016-06-10·21:51:10,553·-·Servod·-·DEBUG·-·servo_server.py:464:_get_param_drv·-·
2016-06-10·21:51:10,562·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[2.735184·PB·task·10·=·was-off]^M
[2.735682·PB·task·4·=·held,·wait·-1]^M
'
2016-06-10·21:51:10,567·-·SystemConfig·-·DEBUG·-·system_config.py:315:_lookup·-·lookup·of·map·press
2016-06-10·21:51:10,572·-·Driver·-·DEBUG·-·tca6416.py:103:set·-·
2016-06-10·21:51:10,576·-·Driver·-·DEBUG·-·tca6416.py:110:set·-·Set·to·input·because·its·io·type·is·PU
2016-06-10·21:51:10,580·-·I2cReg·-·DEBUG·-·i2c_reg.py:94:_read_reg·-·
2016-06-10·21:51:10,583·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·
2016-06-10·21:51:10,589·-·I2cReg·-·DEBUG·-·i2c_reg.py:108:_write_reg·-·
2016-06-10·21:51:10,593·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·
2016-06-10·21:51:10,599·-·Servod·-·DEBUG·-·servo_server.py:744:get·-·name(pwr_button)
2016-06-10·21:51:10,603·-·Servod·-·DEBUG·-·servo_server.py:464:_get_param_drv·-·
2016-06-10·21:51:10,607·-·Driver·-·DEBUG·-·tca6416.py:81:get·-·
2016-06-10·21:51:10,610·-·I2cReg·-·DEBUG·-·i2c_reg.py:94:_read_reg·-·
2016-06-10·21:51:10,614·-·I2cReg·-·DEBUG·-·i2c_reg.py:166:_wr_rd·-·
2016-06-10·21:51:10,620·-·SystemConfig·-·DEBUG·-·system_config.py:315:_lookup·-·lookup·of·map·press
2016-06-10·21:51:10,624·-·Servod·-·DEBUG·-·servo_server.py:753:get·-·pwr_button·=·release
2016-06-10·21:51:10,673·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[2.831356·KB·disable_scanning_mask·changed:·0x00000000]^M
[2.831903·power·button·released]^M
[2.832359·PB·released]^M
[2.832605·PB·task·6·=·released]^M
[2.832903·PB·PCH·pwrbtn=HIGH]^M
[2.833202·PB·task·0·=·idle,·wait·-1]^M
[2.834096·pow'
2016-06-10·21:51:10,693·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'er·state·6·=·S5->S3,·in·0x003f]^M
[2.834725·chipset·->·S3]^M
[2.835122·power·state·2·=·S3,·in·0x003f]^M
[2.835514·power·state·7·=·S3->S0,·in·0x003f]^M
^M[2.836029·Port·80:·0x1001][2.836457·chipset·->·S0]^M
[2.838548·power·state·3·=·S0,·in·0x003f]^M
[2.882477·SW·0x05]^M
'
2016-06-10·21:51:10,810·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.013693·LPC·RESET#·deasserted]^M
'
2016-06-10·21:51:10,816·-·EC3PO.Interpreter·-·DEBUG·-·/dev/ttyO1·-·EC·has·data
2016-06-10·21:51:10,829·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.021969·Port·80:·0x10]'
2016-06-10·21:51:10,933·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.142585·Port·80:·0x1002][3.142856·LPC·RESET#·asserted]^M
[3.149076·LPC·RESET#·deasserted]^M
'
2016-06-10·21:51:10,962·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'^M[3.156969·Port·80:·0x10]^M[3.161969·Port·80:·0x24]^M[3.167969·Port·80:·0x90]^M[3.170969·Port·80:·0x2a][3.173284·HC·0x87]^M
[3.174523·HC·0x4002]^M
'
2016-06-10·21:51:10,986·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.180892·HC·0x40d2]^M
[3.185211·HC·0x87]^M
[3.186646·HC·0x02]^M
[3.187904·PI3USB9281·I2C·write·failed]^M
'
2016-06-10·21:51:11,035·-·EC3PO.Console·-·DEBUG·-·/dev/pts/0·-·|DBG|->'[3.208764·HC·0x87]^M
PD·crash
===·PROCESS·EXCEPTION:·00·======·xPSR:·00000000·===
r0·:0011e0a1·r1·:00000007·r2·:ffffffff·r3·:64000000
r4·:dead6662·r5·:00000000 r6·:00000000·r7·:00000000
r8·:00000000·r9·:00000000·r10:00000000·r11:00000000
r12:00016401·sp·:0011e090·lr·:00001800·pc·:00000000
mmfs·=·0,·shcsr·=·70000,·hfsr·=·0,·dfsr·=·0
===========·Process·Stack·Contents·===========
0011e0b0:·00000000·00108465·00000000·00112f45
0011e0c0:·deadd00d·deadd00d·deadd00d·deadd00d
0011e0d0:·deadd00d·deadd00d·deadd00d·deadd00d
0011e0e0:·deadd00d·deadd00d·deadd00d·deadd00d
Rebooting...
So to summarize, the actions leading to the EC panic are the following:
- Issue a cold_reset
- reboot ap-off
- hostevent set 0x4000 (recovery button)
- power button press
This failure is different than the failure I encountered with samus yesterday. For the samus case, the DUT would remain at the recovery screen.
,
Jun 11 2016
(actually +shawnn@) I guess the next thing to do is to take a look at what was happening on the PD MCU. It seems that the PD MCU got in a bad state while this was happening. Unfortunately, servo v3 doesn't have access to the PD MCU UART. jrbarnette@, is it possible to get this exact device so I can try and connect a servo v2 to it and look at the PD MCU console output?
,
Jun 14 2016
I have a chell device from the lab, and I've confirmed that it fails the simple test case of `dut-control power_state:rec`. We'll arrange delivery offline.
,
Jun 18 2016
Just wanted to post an update. Today, I've been able to reproduce the failure with the chell received. However, the DUT boots up in normal mode without the EC or PD crashing. Essentially, I haven't seen the failure that I mentioned in comment #14. I tried using cros flash to flash the USB flash drive with the image that was already installed on the chell (R52-8350.8.0), but attempting to recover off of that image always causes the DUT to come up in normal mode. I tried re-running cros flash several times and achieved the same result. I've finally managed to find an image that boots up into recovery mode for the chell (the latest dev-channel test image: R53-8466.0.0) and will be running a stress test over the weekend by running many iterations of the firmware_RecoveryButton test. This uses power_state:rec to enter recovery mode. Note that in order to log the PD MCU UART output, I am using a servo v2. However, the steps taken in power_state:rec are the same for servo v2 and v3.
,
Jun 23 2016
After examining the PD logs, I realized that bug 606062 had the same root cause as this.
,
Aug 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/2678a988493de095861e0edd7966a6a38983f6f7 commit 2678a988493de095861e0edd7966a6a38983f6f7 Author: Aseda Aboagye <aaboagye@google.com> Date: Thu Jun 23 01:34:08 2016 cros_ec_pd_softrec_power: Remove race condition. In order to enter recovery mode, _power_on_bytype() relied upon warm_reset keeping the AP in reset. However, it was found that for some platforms (samus, chell), warm_reset did not actually hold the AP in reset. This caused a race condition where if the AP booted up far enough in normal mode, it would send a message to the PD MCU to jump to its RW image. This would happen before the EC had a chance to reboot and hold the AP in reset. Failing to do so, when the request to boot into recovery mode arrived, the AP would notice that the PD MCU was not in its RO image. It would then send a message to the PD MCU to hard reboot so that it would boot in its RO image. However, the EC would notice that the PD MCU was now in its RO image when it was previously in its RW image. This caused the EC to panic because it assumed that the PD MCU had crashed. When the EC panicked, it reset the entire system and the recovery request was forgotten. In order to reliably enter recovery mode, _power_on_bytype() now does the following: - Hold PD MCU in reset. - Assert and deassert cold_reset. - Reboot the EC and have the EC now hold the AP in reset. - Release PD MCU reset. - Enter recovery mode. - Power down AP. - Enter recovery mode again. By holding the PD MCU in reset until we've rebooted the EC keeping the AP in reset, the PD MCU will never get the message to jump to its RW image. A side-effect of this however, is that in our first recovery boot, the recovery reason will be that software sync failed to check the version. That's why we perform two recovery boots such that the recovery reason will be "button press" once the function completes. For platforms, regardless if they have a PD MCU or not, that have a warm_reset control that actually holds the AP in reset, they should use the cros_ec_softrec_power driver instead of this one. BUG= chromium:606062 BUG= chromium:617832 BRANCH=None TEST=platform_ServoPowerStateController passed for chell & samus on both servo v2 and v3. CQ-DEPEND=CL:355505 Change-Id: I4a577ec788a54c8efffa72450586ee95e194251e Signed-off-by: Aseda Aboagye <aaboagye@google.com> Reviewed-on: https://chromium-review.googlesource.com/355580 Commit-Ready: Aseda Aboagye <aaboagye@chromium.org> Tested-by: Aseda Aboagye <aaboagye@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_glados_overlay.xml [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/data/servo_samus_overlay.xml [modify] https://crrev.com/2678a988493de095861e0edd7966a6a38983f6f7/servo/drv/cros_ec_pd_softrec_power.py |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by aaboagye@chromium.org
, Jun 7 2016Status: Started (was: Available)