New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 856740 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 11
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Task

Blocking:
issue 855792



Sign in to add a comment

Update servo version to ToT

Reported by jrbarnette@chromium.org, Jun 26 2018

Issue description

We need to update the guado_labstation/beaglebone_servo images in the test
lab to the latest ToT image.  The main reason is to deal with  bug 855792 .
We need to do it relatively soon, to avoid having a crisis update that can't
complete smoothly because of the bug.

Because of the bug, the update process will have to be slightly adjusted.
Basically, the bug can cause update_engine to crash, so we have to have
a manual process to force update_engine to restart for every lab station.

This SQL query will find the labstation host names:
    select distinct value from afe_host_attributes
        where attribute="servo_host" and value like "chromeos%-labstation%";

The required update procedure change is merely to make sure that
any labstation that needs it has the following command run:
    start update-engine

As a first cut, it is sufficient to run that command on every labstation,
and simply ignore failures.  However, some hosts may fail to update, and
then need to re-run the command.

Because of bug 735619 it may also be necessary to find and manually reboot
labstation hosts that don't update quickly.

 
Blocking: 855792
Per comment #8 in  bug 855792 , note that it's undesirable to change
the ServoHost update code to deal with this problem.  It should be
relatively easy to write a script that manually deals with the
problem on an ad hoc basis until all the hosts have updated.

Comment 3 by xixuan@chromium.org, Jun 27 2018

Status: Fixed (was: Assigned)
Test passes:

xixuan@xixuan0:~/chromiumos/chromeos-admin$ ./lab-tools/update_servohost -i R69-10820.0.0 chromeos4-row9-rack2-host1
Staging beaglebone_servo-release/R69-10820.0.0 on chromeos4-devserver1.cros.corp.google.com
chromeos4-devserver1.cros.corp.google.com
http://100.115.219.129:8082/stage?archive_url=gs://chromeos-image-archive/beaglebone_servo-release/R69-10820.0.0&artifacts=full_payload
Running this command on chromeos4-row9-rack2-host1-servo.cros:
     update_engine_client --update --omaha_url=http://100.115.219.129:8082/update/beaglebone_servo-release/R69-10820.0.0
[0626/150121:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate.
[0626/150121:INFO:update_engine_client.cc(488)] Initiating update check and install.
[0626/150121:INFO:update_engine_client.cc(517)] Waiting for update to complete.
[0626/151322:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed.
Connection to chromeos4-row9-rack2-host1-servo.cros.corp.google.com closed.
xixuan@xixuan0:~/chromiumos/chromeos-admin$ ssh root@chromeos4-row9-rack2-host1-servo.cros reboot && sleep 40
xixuan@xixuan0:~/chromiumos/chromeos-admin$ servo-stat chromeos4-row9-rack2-host1
chromeos4-row9-rack2-host1 ...ABCDEFG is up BOARD=link CHROMEOS_RELEASE_VERSION=10820.0.0

Rollback:

xixuan@xixuan0:~/chromiumos/chromeos-admin$ ./lab-tools/update_servohost chromeos4-row9-rack2-host1
Staging beaglebone_servo-release/R69-10738.0.0 on chromeos4-devserver1.cros.corp.google.com
chromeos4-devserver1.cros.corp.google.com
http://100.115.219.129:8082/stage?archive_url=gs://chromeos-image-archive/beaglebone_servo-release/R69-10738.0.0&artifacts=full_payload
Running this command on chromeos4-row9-rack2-host1-servo.cros:
     update_engine_client --update --omaha_url=http://100.115.219.129:8082/update/beaglebone_servo-release/R69-10738.0.0
[0626/165804:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate.
[0626/165804:INFO:update_engine_client.cc(488)] Initiating update check and install.
[0626/165804:INFO:update_engine_client.cc(517)] Waiting for update to complete.
[0626/170845:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed.
Connection to chromeos4-row9-rack2-host1-servo.cros.corp.google.com closed.
xixuan@xixuan0:~/chromiumos/chromeos-admin$ ssh root@chromeos4-row9-rack2-host1-servo.cros reboot && sleep 40
xixuan@xixuan0:~/chromiumos/chromeos-admin$ servo-stat chromeos4-row9-rack2-host1
chromeos4-row9-rack2-host1 ...ABCDEFG is up BOARD=link CHROMEOS_RELEASE_VERSION=10738.0.0



Upgrade servo version to R69-10820.0.0:

xixuan@xixuan0:~/chromiumos/chromeos-admin$ stable_version -t cros beaglebone_servo R69-10820.0.0
Updating  Chrome OS  beaglebone_servo -> R69-10738.0.0 to R69-10820.0.0
xixuan@xixuan0:~/chromiumos/chromeos-admin$ stable_version -t cros guado_labstation R69-10820.0.0
Updating  Chrome OS  guado_labstation -> R69-10738.0.0 to R69-10820.0.0
Unchanged Firmware   guado_labstation -> Google_Guado.6301.108.2016_10_06_1219



Status: Assigned (was: Fixed)
Wait.  We need to monitor this particular update, because of  bug 855792 .
See the original description.

For now, I'd say wait until the morning, and then run a command like this:
    ssh $LABSTATION grep VERSION /etc/lsb-release

Do it against every available labstation.  Any host that hasn't updated
should be flagged for extra attention.

Comment 5 by xixuan@chromium.org, Jun 27 2018

Re #4, what's the command to get 'every available labstation'?
> Re #4, what's the command to get 'every available labstation'?

There's an SQL query mentioned in the description:
    select distinct value from afe_host_attributes
        where attribute="servo_host" and value like "chromeos%-labstation%";

You can do it with `atest`, but it seems to take more than an hour...

> Re #4, what's the command to get 'every available labstation'?

This script makes it easier:

#!/bin/bash

autotest-db <<END | sed 1d
select distinct value from afe_host_attributes
        where attribute="servo_host" and value like "chromeos%-labstation%";
END

Comment 8 by xixuan@chromium.org, Jun 27 2018

Cc: xixuan@chromium.org
Owner: jrbarnette@chromium.org
Among 113 labstations, only the following list of labstations get updated:

('chromeos15-row4-rack12-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack19-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack15-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack19-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack5-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack10-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack12-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack7-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack16-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack18-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack22-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos1-row2-rack1-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack22-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row3-rack23-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos15-row1-rack2-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack23-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack19-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack23-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack7-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row2-rack21-labstation1', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row1-rack21-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos6-row4-rack18-labstation2', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')
('chromeos1-row2-rack10-labstation', 'CHROMEOS_RELEASE_VERSION=10820.0.0\n')


So we need lots of extra attentions if it's not an issue that they haven't got time to update.

Assign to @Richard to decide whether to close this bug and file a new one about why they don't get updated.

> Assign to @Richard to decide whether to close this bug and file
> a new one about why they don't get updated.

The fact that only a minority updated is expected because of known
bugs, especially  bug 855792 .

The first step is to run "start update-engine" across the board; that'll
set at least some updates in motion.  I've started a script to do that;
I'll post the results after it's done.

----
: jrbarnette ~/tmp; cat handle-labstations 
#!/bin/bash

get_labstations()
    autotest-db <<END | sed 1d
    select distinct value from afe_host_attributes
        where attribute="servo_host" and value like "chromeos%-labstation%";
END
}

for h in $(get_labstations)
do
    echo $h $(ssh -n $h start update-engine)
done

> The fact that only a minority updated is expected because of known
> bugs, especially  bug 855792 .

For clarity, I should add:  This bug was filed specifically as the
task to get the lab past the problems caused by the known bug.  So
all of the work is meant to go here, not on a new bug.
Results from the script are attached.

Lines like this show a labstation where the update-engine job
was down, and had to be restarted:
    chromeos6-row2-rack14-labstation2 update-engine start/running, process 13877

Lines with only a hostname on them mean update-engine was already running.

The next step is to find out how many of the servers need to be rebooted...

labstations.out
6.5 KB Download
After some waiting and periodic checking, all online labstations
save one are either up-to-date, or sitting at UPDATED_NEED_REBOOT.

Tomorrow, we can see to rebooting any labstation that's still out of
date.

Wrinkles yet to be encountered/resolved:
  * Five labstations are offline; we'll need to sort them out.
  * It's possible that some of the labstations will actually update to
    R69-10738.0.0, because they'd already downloaded it.  So, we may
    need to baby-sit a second round of updates.
  * It's not known for certain that the last remaining labstation
    will download the build; we may have to force the issue.

As of this morning, there are 79 labstations ready for reboot.
Because we need to know that every host gets rebooted with this
update, and because of bug 735619, I plan to manually reboot
all of those hosts sometime today.

Rebooting a labstation carries only a low risk of disruption:
Servo is unused most of the time for most hosts.  The most notable
risk is that a repair task that would have succeeded may fail; this
would cause the DUT to be unusable for a few hours, but shouldn't
cause tests to fail.

> Rebooting a labstation carries only a low risk of disruption:

Hmmm...  FAFT testing is an exception here.  I'm going to see if
I can apply a filter to avoid affecting labstations attached to
DUTs for FAFT.

> Hmmm...  FAFT testing is an exception here.  I'm going to see if
> I can apply a filter to avoid affecting labstations attached to
> DUTs for FAFT.

I've identified 11 labstations to be excluded for this reason.
I've run the update.  As expected, some of the hosts we waiting
to reboot to an out-of-date build, and will need to go through
another update cycle:
    chromeos6-row1-rack19-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row1-rack24-labstation CHROMEOS_RELEASE_VERSION=10178.0.0
    chromeos6-row19-rack13-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row19-rack14-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row19-rack15-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row2-rack20-labstation1 CHROMEOS_RELEASE_VERSION=10526.0.0
    chromeos6-row3-rack19-labstation CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row4-rack13-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row4-rack17-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row4-rack20-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row4-rack4-labstation1 CHROMEOS_RELEASE_VERSION=10635.0.0
    chromeos6-row4-rack6-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row5-rack2-labstation CHROMEOS_RELEASE_VERSION=10635.0.0
    chromeos6-row6-rack2-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row6-rack2-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row6-rack3-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0

The log of all rebooted labstations is in the attached "reboot.out" file.

reboot.out
4.5 KB Download
All 16 of the DUTs listed as out-of-date have now downloaded the
latest lab image.  Manual reboot will be done as needed sometime
today.

The latest round of reboots is complete.

I re-gathered version info from all labstations; the attached
file has the full summary.

The following 9 hosts are up, but out of date:
    chromeos6-row2-rack20-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row1-rack2-labstation CHROMEOS_RELEASE_VERSION=10526.0.0
    chromeos6-row2-rack24-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row4-rack2-labstation1 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row4-rack7-labstation1 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row4-rack13-labstation2 CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row4-rack17-labstation2 CHROMEOS_RELEASE_VERSION=10738.0.0
    chromeos6-row3-rack20-labstation CHROMEOS_RELEASE_VERSION=10658.0.0
    chromeos6-row5-rack4-labstation CHROMEOS_RELEASE_VERSION=10635.0.0

labstations-versions.out
11.4 KB Download
> The following 9 hosts are up, but out of date:

It should be added:  This should be just the remnant of hosts that
were attached to one or more FAFT DUTs.  Those were never manually
rebooted, so this is largely expected.

Of the remaining hosts needing reboot, only three are waiting to
update to an out-of-date build:
    chromeos6-row1-rack2-labstation
    chromeos6-row4-rack13-labstation2
    chromeos6-row4-rack2-labstation1

I'm looking into options for these right now.

I checked the FAFT DUTs attached to the remaining problem children.
For two labstations, the DUTs were all idle, so I manually rebooted.
Now only one problem child remains:
    chromeos6-row4-rack2-labstation1

Status: Fixed (was: Assigned)
I'm declaring victory here.  There's been at least two updates past
version R69-10820.0.0.  There are still labstations behind target,
but this bug isn't the way to get them fixed.

Sign in to add a comment