New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 785065 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

quick provision erroneously checks version before reboot on veyron_rialto

Project Member Reported by davidri...@chromium.org, Nov 15 2017

Issue description

Using devserver quick provision on veyron_rialto, provisions fail because the reboot on the DUT takes too long and the devserver checks the version before the reboot happens.

To reproduce (set ds, dut appropriately)
In chroot, start devserver:
cp ~/trunk/src/platform/dev/quick-provision/quick-provision /var/lib/devserver/static
~/trunk/src/platform/dev/devserver.py --production --port 8080 --static_dir /var/lib/devserver/static

Either in or out of chroot, request provision from devserver:
build=veyron_rialto-paladin/R64-10083.0.0-rc2
ds=okaybye2.mtv.corp.google.com:8080
dut=chromeos2-row1-rack10-host3.cros
curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful"
time curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True"

A simple fix is to add a sleep for 2-3 seconds after the quick provision is done by the devserver before the state is checked.  

dgarrett/xixuan: any suggestions on better ways than a hardcoded sleep?

dianders: any idea why veyron_rialto takes longer than other devices to reboot after a reboot is issued?  I tested this by pinging the device and issuing a reboot and saw that veyron_rialto took 2-3 seconds to stop responding to pings compared to a cyan which was almost instantaneous.
 
Cc: amstan@chromium.org
+amstan

Presumably veyron_rialto is just slow as molasses since it's a rk3288 CPU w/ no heatsink and thus is running in a massively thermally throttled mode almost always.

Comment 2 by amstan@chromium.org, Nov 15 2017

Cc: drustsm...@google.com rialto-eng@google.com
thanks Niranjan already replied to D. Riley about it
re c#1:

veyron_rialto seems to actually provision very quickly (probably because of small image size), and veyron_jaq doesn't have the same lag in responding to reboot.

In the end I think provisioning needs to be more forgiving, but still wanted to understand if we can the lag.
Instead of a sleep, you can check uptime when you check the version. If uptime is too long, you haven't rebooted yet.

A note.... are you running postinst? If not, you probably need a short sleep afterwards anyway to allow disk flushes, even in the face of driver/drive firmware/hardware bugs.
Instead of uptime, you could do something like touch a file in /tmp. If the file still exists, you haven't yet rebooted.
I'm thinking of logic like: 
- record time
- do a check of the version
- if the check fails because version is the same and time elapsed is < N seconds, sleep M seconds and repeat (N is number of seconds to wait for reboot, M is time in between checks)


I have a small delay right before rebooting and there is already a 10 second delay in postinst after the sync so I don't think we need more sleeps.

I'd rather not rely on special files and synchronization/handshakes between two different parts of code if we can just have simple logic on one side to handle things properly.
Have you thought about '/proc/sys/kernel/random/boot_id'?  That's what autotest uses to detect a reboot?

---

> veyron_rialto seems to actually provision very quickly (probably because of small 
> image size), and veyron_jaq doesn't have the same lag in responding to reboot.

veyron_jaq has a massive heatsink, so it won't thermally throttle in the same way as rialto.  rialto (as I understand it) is the only product we support that has no heatsink at all and no fans.

I'm not saying I'm 100% sure that this is why it takes so long to reboot--just that it's my first guess.
Waiting for N seconds is a heuristic. It was my first thought as well, but we can do better.

I like: /proc/sys/kernel/random/boot_id

The /tmp/file didn't require any handshaking, it just depended on /tmp being wiped during a reboot. But boot_id is better.


When I say handshaking/synchronization I mean that you need code that is synchronized between the quick provision script and the check script.

If you check boot_id, you need a baseline to compare it against which we don't have.  And assuming it doesn't match, you still need to delay some amount of time and try again.

I've posted a change at https://chromium-review.googlesource.com/#/c/chromiumos/chromite/+/775658 which attempts, and they does a delay if the failure was detected within a small amount of time.  It doesn't need initial state and shouldn't introduce any delays unless reboot is slow to trigger, at which point it gives it 5 seconds.
I ran into another related failure that isn't handled by my fix at https://chromium-review.googlesource.com/775658 

Necessary conditions:
- device is provisioning to the same version as is already installed
- reboot is slow to occur (ie like on veyron_rialto)

Timeline of events:
t0: version A is running on DUT
provision 1: request to provision A
provision 1: version A is provisioned
tx: provision 1: dut starts reboot (sleep 2; reboot)
provision 1: version A is checked
provision 1: provision rpc returns success
provision 2: request to provision B
tx+4: dut actually reboots
provision 2: fails
Project Member

Comment 12 by bugdroid1@chromium.org, Dec 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/1098c63be5ba332c2863de1b2f047acbc203b6ad

commit 1098c63be5ba332c2863de1b2f047acbc203b6ad
Author: David Riley <davidriley@chromium.org>
Date: Wed Dec 06 02:00:45 2017

auto_updater: Allow delay when checking versions to handle slow rebooting.

BUG= chromium:785065 
TEST=build=veyron_rialto-paladin/R64-10083.0.0-rc2; curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful"; curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True"

Change-Id: Ib7935fd1d190cccf8346935f523f8e6c1b5f3577
Reviewed-on: https://chromium-review.googlesource.com/775658
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/1098c63be5ba332c2863de1b2f047acbc203b6ad/lib/auto_updater.py

Project Member

Comment 13 by bugdroid1@chromium.org, Dec 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/51e417d4cf8d7306ebc8930a3c1a153adf77372b

commit 51e417d4cf8d7306ebc8930a3c1a153adf77372b
Author: David Riley <davidriley@chromium.org>
Date: Fri Dec 08 11:59:08 2017

auto_updater: Add check to ensure device has rebooted.

BUG= chromium:785065 
TEST=repeatedly provision veyron_rialto with quick provision

Change-Id: I08eb869e52b73bcb45c33597cb31742bc775231b
Reviewed-on: https://chromium-review.googlesource.com/804894
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/remote_access.py
[modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/remote_access_unittest.py
[modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/auto_updater.py

Project Member

Comment 14 by bugdroid1@chromium.org, Dec 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/6c46725190160d9fe677852864551564cc9cab93

commit 6c46725190160d9fe677852864551564cc9cab93
Author: David Riley <davidriley@chromium.org>
Date: Tue Dec 12 19:09:09 2017

quick-provision: Ensure that device reboots when checking provision.

If a device is slow to reboot, it was possible for the check to
erroneously check based on the original state before rebooting.

BUG= chromium:785065 
CQ-DEPEND=CL:804894
TEST=build=veyron_rialto-paladin/R64-10083.0.0-rc2; curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful"; curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True"

Change-Id: I8933c8cc3a5aadb342c927c0fe12eafacc8fe899
Reviewed-on: https://chromium-review.googlesource.com/804961
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/6c46725190160d9fe677852864551564cc9cab93/cros_update.py

Project Member

Comment 16 by bugdroid1@chromium.org, Jan 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/08fd13ce6fc9ad727feab6714d80c63fcefdb4f7

commit 08fd13ce6fc9ad727feab6714d80c63fcefdb4f7
Author: David Riley <davidriley@chromium.org>
Date: Wed Jan 10 23:44:45 2018

quick-provision: Ensure no concurrent quick provisioning.

Ensure that there is no concurrent quick provisioning through the use
of a lock file.  Similarly, if a quick provision has successfully
completed on the device, do not make a second attempt and instead wait
for it to reboot.

This logic will need to be revisited if quick provisioning becomes the
only form of provisioning without fallback to the legacy path.

BUG= chromium:785065 
TEST=loadtest.py -t 1 -s 0; ssh dut /tmp/quick-provision

Change-Id: I79e5e6aeb26e4835165fcd510b769efbe7b2a841
Reviewed-on: https://chromium-review.googlesource.com/857592
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/08fd13ce6fc9ad727feab6714d80c63fcefdb4f7/quick-provision/quick-provision

Status: Fixed (was: Untriaged)

Sign in to add a comment