Update servo version to ToT
Reported by
jrbarnette@chromium.org,
Jun 26 2018
|
|||||
Issue descriptionWe need to update the guado_labstation/beaglebone_servo images in the test lab to the latest ToT image. The main reason is to deal with bug 855792 . We need to do it relatively soon, to avoid having a crisis update that can't complete smoothly because of the bug. Because of the bug, the update process will have to be slightly adjusted. Basically, the bug can cause update_engine to crash, so we have to have a manual process to force update_engine to restart for every lab station. This SQL query will find the labstation host names: select distinct value from afe_host_attributes where attribute="servo_host" and value like "chromeos%-labstation%"; The required update procedure change is merely to make sure that any labstation that needs it has the following command run: start update-engine As a first cut, it is sufficient to run that command on every labstation, and simply ignore failures. However, some hosts may fail to update, and then need to re-run the command. Because of bug 735619 it may also be necessary to find and manually reboot labstation hosts that don't update quickly.
,
Jun 26 2018
Per comment #8 in bug 855792 , note that it's undesirable to change the ServoHost update code to deal with this problem. It should be relatively easy to write a script that manually deals with the problem on an ad hoc basis until all the hosts have updated.
,
Jun 27 2018
Test passes: xixuan@xixuan0:~/chromiumos/chromeos-admin$ ./lab-tools/update_servohost -i R69-10820.0.0 chromeos4-row9-rack2-host1 Staging beaglebone_servo-release/R69-10820.0.0 on chromeos4-devserver1.cros.corp.google.com chromeos4-devserver1.cros.corp.google.com http://100.115.219.129:8082/stage?archive_url=gs://chromeos-image-archive/beaglebone_servo-release/R69-10820.0.0&artifacts=full_payload Running this command on chromeos4-row9-rack2-host1-servo.cros: update_engine_client --update --omaha_url=http://100.115.219.129:8082/update/beaglebone_servo-release/R69-10820.0.0 [0626/150121:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate. [0626/150121:INFO:update_engine_client.cc(488)] Initiating update check and install. [0626/150121:INFO:update_engine_client.cc(517)] Waiting for update to complete. [0626/151322:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed. Connection to chromeos4-row9-rack2-host1-servo.cros.corp.google.com closed. xixuan@xixuan0:~/chromiumos/chromeos-admin$ ssh root@chromeos4-row9-rack2-host1-servo.cros reboot && sleep 40 xixuan@xixuan0:~/chromiumos/chromeos-admin$ servo-stat chromeos4-row9-rack2-host1 chromeos4-row9-rack2-host1 ...ABCDEFG is up BOARD=link CHROMEOS_RELEASE_VERSION=10820.0.0 Rollback: xixuan@xixuan0:~/chromiumos/chromeos-admin$ ./lab-tools/update_servohost chromeos4-row9-rack2-host1 Staging beaglebone_servo-release/R69-10738.0.0 on chromeos4-devserver1.cros.corp.google.com chromeos4-devserver1.cros.corp.google.com http://100.115.219.129:8082/stage?archive_url=gs://chromeos-image-archive/beaglebone_servo-release/R69-10738.0.0&artifacts=full_payload Running this command on chromeos4-row9-rack2-host1-servo.cros: update_engine_client --update --omaha_url=http://100.115.219.129:8082/update/beaglebone_servo-release/R69-10738.0.0 [0626/165804:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate. [0626/165804:INFO:update_engine_client.cc(488)] Initiating update check and install. [0626/165804:INFO:update_engine_client.cc(517)] Waiting for update to complete. [0626/170845:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed. Connection to chromeos4-row9-rack2-host1-servo.cros.corp.google.com closed. xixuan@xixuan0:~/chromiumos/chromeos-admin$ ssh root@chromeos4-row9-rack2-host1-servo.cros reboot && sleep 40 xixuan@xixuan0:~/chromiumos/chromeos-admin$ servo-stat chromeos4-row9-rack2-host1 chromeos4-row9-rack2-host1 ...ABCDEFG is up BOARD=link CHROMEOS_RELEASE_VERSION=10738.0.0 Upgrade servo version to R69-10820.0.0: xixuan@xixuan0:~/chromiumos/chromeos-admin$ stable_version -t cros beaglebone_servo R69-10820.0.0 Updating Chrome OS beaglebone_servo -> R69-10738.0.0 to R69-10820.0.0 xixuan@xixuan0:~/chromiumos/chromeos-admin$ stable_version -t cros guado_labstation R69-10820.0.0 Updating Chrome OS guado_labstation -> R69-10738.0.0 to R69-10820.0.0 Unchanged Firmware guado_labstation -> Google_Guado.6301.108.2016_10_06_1219
,
Jun 27 2018
Wait. We need to monitor this particular update, because of bug 855792 . See the original description. For now, I'd say wait until the morning, and then run a command like this: ssh $LABSTATION grep VERSION /etc/lsb-release Do it against every available labstation. Any host that hasn't updated should be flagged for extra attention.
,
Jun 27 2018
Re #4, what's the command to get 'every available labstation'?
,
Jun 27 2018
> Re #4, what's the command to get 'every available labstation'?
There's an SQL query mentioned in the description:
select distinct value from afe_host_attributes
where attribute="servo_host" and value like "chromeos%-labstation%";
You can do it with `atest`, but it seems to take more than an hour...
,
Jun 27 2018
> Re #4, what's the command to get 'every available labstation'?
This script makes it easier:
#!/bin/bash
autotest-db <<END | sed 1d
select distinct value from afe_host_attributes
where attribute="servo_host" and value like "chromeos%-labstation%";
END
,
Jun 27 2018
Among 113 labstations, only the following list of labstations get updated:
('chromeos15-row4-rack12-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack19-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack15-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack19-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack5-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack10-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack12-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack7-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack16-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack18-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack22-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos1-row2-rack1-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack22-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack23-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos15-row1-rack2-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack23-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack19-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack23-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack7-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack21-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack21-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack18-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos1-row2-rack10-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
So we need lots of extra attentions if it's not an issue that they haven't got time to update.
Assign to @Richard to decide whether to close this bug and file a new one about why they don't get updated.
,
Jun 27 2018
> Assign to @Richard to decide whether to close this bug and file > a new one about why they don't get updated. The fact that only a minority updated is expected because of known bugs, especially bug 855792 . The first step is to run "start update-engine" across the board; that'll set at least some updates in motion. I've started a script to do that; I'll post the results after it's done. ---- : jrbarnette ~/tmp; cat handle-labstations #!/bin/bash get_labstations() autotest-db <<END | sed 1d select distinct value from afe_host_attributes where attribute="servo_host" and value like "chromeos%-labstation%"; END } for h in $(get_labstations) do echo $h $(ssh -n $h start update-engine) done
,
Jun 27 2018
> The fact that only a minority updated is expected because of known > bugs, especially bug 855792 . For clarity, I should add: This bug was filed specifically as the task to get the lab past the problems caused by the known bug. So all of the work is meant to go here, not on a new bug.
,
Jun 27 2018
Results from the script are attached.
Lines like this show a labstation where the update-engine job
was down, and had to be restarted:
chromeos6-row2-rack14-labstation2 update-engine start/running, process 13877
Lines with only a hostname on them mean update-engine was already running.
The next step is to find out how many of the servers need to be rebooted...
,
Jun 28 2018
After some waiting and periodic checking, all online labstations
save one are either up-to-date, or sitting at UPDATED_NEED_REBOOT.
Tomorrow, we can see to rebooting any labstation that's still out of
date.
Wrinkles yet to be encountered/resolved:
* Five labstations are offline; we'll need to sort them out.
* It's possible that some of the labstations will actually update to
R69-10738.0.0, because they'd already downloaded it. So, we may
need to baby-sit a second round of updates.
* It's not known for certain that the last remaining labstation
will download the build; we may have to force the issue.
,
Jun 28 2018
As of this morning, there are 79 labstations ready for reboot. Because we need to know that every host gets rebooted with this update, and because of bug 735619, I plan to manually reboot all of those hosts sometime today. Rebooting a labstation carries only a low risk of disruption: Servo is unused most of the time for most hosts. The most notable risk is that a repair task that would have succeeded may fail; this would cause the DUT to be unusable for a few hours, but shouldn't cause tests to fail.
,
Jun 28 2018
> Rebooting a labstation carries only a low risk of disruption: Hmmm... FAFT testing is an exception here. I'm going to see if I can apply a filter to avoid affecting labstations attached to DUTs for FAFT.
,
Jun 28 2018
> Hmmm... FAFT testing is an exception here. I'm going to see if > I can apply a filter to avoid affecting labstations attached to > DUTs for FAFT. I've identified 11 labstations to be excluded for this reason.
,
Jun 28 2018
I've run the update. As expected, some of the hosts we waiting
to reboot to an out-of-date build, and will need to go through
another update cycle:
chromeos6-row1-rack19-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row1-rack24-labstation CHROMEOS_RELEASE_VERSION=10178.0.0
chromeos6-row19-rack13-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row19-rack14-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row19-rack15-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row2-rack20-labstation1 CHROMEOS_RELEASE_VERSION=10526.0.0
chromeos6-row3-rack19-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row4-rack13-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row4-rack17-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row4-rack20-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row4-rack4-labstation1 CHROMEOS_RELEASE_VERSION=10635.0.0
chromeos6-row4-rack6-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row5-rack2-labstation CHROMEOS_RELEASE_VERSION=10635.0.0
chromeos6-row6-rack2-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row6-rack2-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row6-rack3-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
The log of all rebooted labstations is in the attached "reboot.out" file.
,
Jun 29 2018
All 16 of the DUTs listed as out-of-date have now downloaded the latest lab image. Manual reboot will be done as needed sometime today.
,
Jun 29 2018
The latest round of reboots is complete.
I re-gathered version info from all labstations; the attached
file has the full summary.
The following 9 hosts are up, but out of date:
chromeos6-row2-rack20-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row1-rack2-labstation CHROMEOS_RELEASE_VERSION=10526.0.0
chromeos6-row2-rack24-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row4-rack2-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row4-rack7-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row4-rack13-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row4-rack17-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
chromeos6-row3-rack20-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
chromeos6-row5-rack4-labstation CHROMEOS_RELEASE_VERSION=10635.0.0
,
Jun 29 2018
> The following 9 hosts are up, but out of date: It should be added: This should be just the remnant of hosts that were attached to one or more FAFT DUTs. Those were never manually rebooted, so this is largely expected.
,
Jun 29 2018
Of the remaining hosts needing reboot, only three are waiting to
update to an out-of-date build:
chromeos6-row1-rack2-labstation
chromeos6-row4-rack13-labstation2
chromeos6-row4-rack2-labstation1
I'm looking into options for these right now.
,
Jun 29 2018
I checked the FAFT DUTs attached to the remaining problem children.
For two labstations, the DUTs were all idle, so I manually rebooted.
Now only one problem child remains:
chromeos6-row4-rack2-labstation1
,
Sep 11
I'm declaring victory here. There's been at least two updates past version R69-10820.0.0. There are still labstations behind target, but this bug isn't the way to get them fixed. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Jun 26 2018