quick provision erroneously checks version before reboot on veyron_rialto |
|||
Issue descriptionUsing devserver quick provision on veyron_rialto, provisions fail because the reboot on the DUT takes too long and the devserver checks the version before the reboot happens. To reproduce (set ds, dut appropriately) In chroot, start devserver: cp ~/trunk/src/platform/dev/quick-provision/quick-provision /var/lib/devserver/static ~/trunk/src/platform/dev/devserver.py --production --port 8080 --static_dir /var/lib/devserver/static Either in or out of chroot, request provision from devserver: build=veyron_rialto-paladin/R64-10083.0.0-rc2 ds=okaybye2.mtv.corp.google.com:8080 dut=chromeos2-row1-rack10-host3.cros curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful" time curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True" A simple fix is to add a sleep for 2-3 seconds after the quick provision is done by the devserver before the state is checked. dgarrett/xixuan: any suggestions on better ways than a hardcoded sleep? dianders: any idea why veyron_rialto takes longer than other devices to reboot after a reboot is issued? I tested this by pinging the device and issuing a reboot and saw that veyron_rialto took 2-3 seconds to stop responding to pings compared to a cyan which was almost instantaneous.
,
Nov 15 2017
,
Nov 15 2017
thanks Niranjan already replied to D. Riley about it
,
Nov 16 2017
re c#1: veyron_rialto seems to actually provision very quickly (probably because of small image size), and veyron_jaq doesn't have the same lag in responding to reboot. In the end I think provisioning needs to be more forgiving, but still wanted to understand if we can the lag.
,
Nov 16 2017
Instead of a sleep, you can check uptime when you check the version. If uptime is too long, you haven't rebooted yet. A note.... are you running postinst? If not, you probably need a short sleep afterwards anyway to allow disk flushes, even in the face of driver/drive firmware/hardware bugs.
,
Nov 16 2017
Instead of uptime, you could do something like touch a file in /tmp. If the file still exists, you haven't yet rebooted.
,
Nov 16 2017
I'm thinking of logic like: - record time - do a check of the version - if the check fails because version is the same and time elapsed is < N seconds, sleep M seconds and repeat (N is number of seconds to wait for reboot, M is time in between checks) I have a small delay right before rebooting and there is already a 10 second delay in postinst after the sync so I don't think we need more sleeps. I'd rather not rely on special files and synchronization/handshakes between two different parts of code if we can just have simple logic on one side to handle things properly.
,
Nov 16 2017
Have you thought about '/proc/sys/kernel/random/boot_id'? That's what autotest uses to detect a reboot? --- > veyron_rialto seems to actually provision very quickly (probably because of small > image size), and veyron_jaq doesn't have the same lag in responding to reboot. veyron_jaq has a massive heatsink, so it won't thermally throttle in the same way as rialto. rialto (as I understand it) is the only product we support that has no heatsink at all and no fans. I'm not saying I'm 100% sure that this is why it takes so long to reboot--just that it's my first guess.
,
Nov 16 2017
Waiting for N seconds is a heuristic. It was my first thought as well, but we can do better. I like: /proc/sys/kernel/random/boot_id The /tmp/file didn't require any handshaking, it just depended on /tmp being wiped during a reboot. But boot_id is better.
,
Nov 16 2017
When I say handshaking/synchronization I mean that you need code that is synchronized between the quick provision script and the check script. If you check boot_id, you need a baseline to compare it against which we don't have. And assuming it doesn't match, you still need to delay some amount of time and try again. I've posted a change at https://chromium-review.googlesource.com/#/c/chromiumos/chromite/+/775658 which attempts, and they does a delay if the failure was detected within a small amount of time. It doesn't need initial state and shouldn't introduce any delays unless reboot is slow to trigger, at which point it gives it 5 seconds.
,
Nov 18 2017
I ran into another related failure that isn't handled by my fix at https://chromium-review.googlesource.com/775658 Necessary conditions: - device is provisioning to the same version as is already installed - reboot is slow to occur (ie like on veyron_rialto) Timeline of events: t0: version A is running on DUT provision 1: request to provision A provision 1: version A is provisioned tx: provision 1: dut starts reboot (sleep 2; reboot) provision 1: version A is checked provision 1: provision rpc returns success provision 2: request to provision B tx+4: dut actually reboots provision 2: fails
,
Dec 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/1098c63be5ba332c2863de1b2f047acbc203b6ad commit 1098c63be5ba332c2863de1b2f047acbc203b6ad Author: David Riley <davidriley@chromium.org> Date: Wed Dec 06 02:00:45 2017 auto_updater: Allow delay when checking versions to handle slow rebooting. BUG= chromium:785065 TEST=build=veyron_rialto-paladin/R64-10083.0.0-rc2; curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful"; curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True" Change-Id: Ib7935fd1d190cccf8346935f523f8e6c1b5f3577 Reviewed-on: https://chromium-review.googlesource.com/775658 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/1098c63be5ba332c2863de1b2f047acbc203b6ad/lib/auto_updater.py
,
Dec 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/51e417d4cf8d7306ebc8930a3c1a153adf77372b commit 51e417d4cf8d7306ebc8930a3c1a153adf77372b Author: David Riley <davidriley@chromium.org> Date: Fri Dec 08 11:59:08 2017 auto_updater: Add check to ensure device has rebooted. BUG= chromium:785065 TEST=repeatedly provision veyron_rialto with quick provision Change-Id: I08eb869e52b73bcb45c33597cb31742bc775231b Reviewed-on: https://chromium-review.googlesource.com/804894 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/remote_access.py [modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/remote_access_unittest.py [modify] https://crrev.com/51e417d4cf8d7306ebc8930a3c1a153adf77372b/lib/auto_updater.py
,
Dec 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/6c46725190160d9fe677852864551564cc9cab93 commit 6c46725190160d9fe677852864551564cc9cab93 Author: David Riley <davidriley@chromium.org> Date: Tue Dec 12 19:09:09 2017 quick-provision: Ensure that device reboots when checking provision. If a device is slow to reboot, it was possible for the check to erroneously check based on the original state before rebooting. BUG= chromium:785065 CQ-DEPEND=CL:804894 TEST=build=veyron_rialto-paladin/R64-10083.0.0-rc2; curl "http://${ds}/stage?archive_url=gs://chromeos-image-archive/${build}&artifacts=quick_provision,stateful"; curl "http://${ds}/cros_au?full_update=False&force_update=True&build_name=${build}&host_name=${dut}&async=False&clobber_stateful=True&quick_provision=True" Change-Id: I8933c8cc3a5aadb342c927c0fe12eafacc8fe899 Reviewed-on: https://chromium-review.googlesource.com/804961 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/6c46725190160d9fe677852864551564cc9cab93/cros_update.py
,
Jan 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/08fd13ce6fc9ad727feab6714d80c63fcefdb4f7 commit 08fd13ce6fc9ad727feab6714d80c63fcefdb4f7 Author: David Riley <davidriley@chromium.org> Date: Wed Jan 10 23:44:45 2018 quick-provision: Ensure no concurrent quick provisioning. Ensure that there is no concurrent quick provisioning through the use of a lock file. Similarly, if a quick provision has successfully completed on the device, do not make a second attempt and instead wait for it to reboot. This logic will need to be revisited if quick provisioning becomes the only form of provisioning without fallback to the legacy path. BUG= chromium:785065 TEST=loadtest.py -t 1 -s 0; ssh dut /tmp/quick-provision Change-Id: I79e5e6aeb26e4835165fcd510b769efbe7b2a841 Reviewed-on: https://chromium-review.googlesource.com/857592 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/08fd13ce6fc9ad727feab6714d80c63fcefdb4f7/quick-provision/quick-provision
,
Mar 6 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by diand...@chromium.org
, Nov 15 2017