Provisioning fallback from quick-provision to update_engine should be restricted
Reported by
jrbarnette@chromium.org,
Jul 19
|
||
Issue description
Currently, during provisioning, if the quick-provision script
reports any failure, the provisioning code automatically retries
by using the update_engine to download and install the build.
A spot check of 24 hours worth of provision failures shows that
in all cases where the provision failure was pegged with "Download
and install failed", quick-provisioning simply died in the midst of
downloading valid payload data. Although in such a case retrying
might be needed, it seems more useful to fail, at least until we
can learn more about why these downloads fail in the first place.
So, we should change the provisioning logic to something more like this:
* If quick-provisioning payloads are staged on the devserver, use
quick-provisioning but don't retry failures.
* Otherwise, use update_engine.
,
Jul 21
> The spot check was incomplete; download failure is the most common, > but not the only source of failure. However, the conclusion is largely > unchanged: by and large the observed failures did not seem to justify > retrying with update_engine. Digging in to 24 hours worth of real provision logs, I've come to a different conclusion. For the sample period, a certain number of quick-provision failures became provision successes because of the retry implicit in switching to use update_engine. My initial estimate for the sample is ~40% of quick-provision failures will succeed when retried with update_engine. At first blush, the most likely explanation for the successful retries would be not that update_engine is more reliable, but that devservers can get under enough load that retries are an effective approach. That suggests that we should retry any failed download operation, possibly on a different devserver, and without switching the algorithm for the download.
,
Jul 23
c#2 doesn't sound right to me. Do you have details on the failures? Is it failures that happen mid-transfer, or ones that fail to start? ie, what does individual log analysis say instead of final results. I think we really should understand the failures before just adding another retry because your analysis in c#2 doesn't match what I've seen from many provision failures I've looked over the logs from.
,
Jul 23
> Do you have details on the failures? Is it failures that > happen mid-transfer, or ones that fail to start? ie, what > does individual log analysis say instead of final results. A substantial number of the logs that I looked at showed failures occurring mid-transfer. Although this was a common reason for failure, it wasn't the only reason for failure. I'm in the process now of downloading the necessary logs to get a more comprehensive view of the failure symptoms. That will also give a better view of the actual rate of retry success.
,
Jul 24
I've now completed downloading 24 hours of provision task data. The
data covers most (but not quite all) shards. For reference, here's the
Viceroy graph of failure causes in the time interval:
http://shortn/_IQpTyOQAGu
The failure reason of interest is "Image failed to download and install."
In my browser, that reason renders as a thick maroon band in the center
of the graph.
I ran a script across the logs to classify the various failures. Here's
the summary of the failure breakdown as a percentage of all the jobs:
0.1% Quick provision failed, but retry with update-engine passed.
0.5% Quick provision failed, and retry with update-engine failed.
2.1% Provisioning failed for some reason not involving quick provision.
I also dug into logs of selected failures. Although failure in the
middle of download is a common symptom, it's not clear that the download
is the cause. Here's an excerpt from a sample failure: 07/19 11:29:49.987 DEBUG| utils:0287| [stdout] 0K ........ ........ ........ ........ 2% 30.0M 36s
07/19 11:29:51.049 DEBUG| utils:0287| [stdout] 32768K ........ ........ ........ ........ 5% 30.3M 35s
07/19 11:29:52.329 DEBUG| utils:0287| [stdout] 65536K ........ ........ ........ ........ 8% 25.0M 37s
07/19 11:34:35.240 DEBUG| utils:0306| [stdout] 98304K ........ ........ ........ .....
07/19 11:34:35.242 ERROR| autoupdater:0902| quick-provision script failed; will fall back to update_engine.
[ ... ]
Exit status: 255
Duration: 287.391553879
The command failed during the download, but the exit status is 255. That
exit status is used by ssh to indicate a failure in the ssh connection,
rather than a failure in the underlying command. The error code could be
misleading: Some commands other than ssh can (and have) returned that
code for failures. Nonetheless: we may be looking at failures in the
ssh connections as a significant factor.
Not all of the failures were code 255. Here's the breakdown of exit codes
from quick-provision:
13 Exit status: 127
31 Exit status: 1
73 Exit status: 255
,
Jul 24
That Viceroy graph includes failures which likely should not be included in our stats for other labs. Also, it includes DUTs that are in a repair loop which also throws off stats. 1%, while not ideal, is roughly the threshold I thought would be reasonable for new devices for the provisioning criteria. If we're operating below that, I'm tempted to say that we should just fail the entire auto update and allow an entirely new subsequent auto update to occur instead of adding more retries to a single attempt. Ideally, the scheduler would be able to do that much faster than can be done now, but I think that's overall preferable to making it more and more complicated again which will just get us back to where we were before. With regards to individual failures: Did the DUT reboot? Did USB ethernet fail and need to be repaired? Did other DUTs on the same rack experience failures at the same time? Did other provisions from the same shard/devserver failure at the same time? Has this gotten worse since moving to shard instead of devserver for running the ssh process? I could see that being an issue. Connections should not just be going away. I'd much rather understand and drive down the actual causes of failures than trying to just paper over it with a fat retry.
,
Jul 24
> That Viceroy graph includes failures which likely should not
> be included in our stats for other labs. Also, it includes DUTs
> that are in a repair loop which also throws off stats.
The numbers cited in #c5 are based on the downloaded logs of >19000
provision tasks, and don't suffer from the known flaws in the graph.
Although the graph is imperfect, it's sufficient to demonstrate the
relative frequency of the specific failure w.r.t. other failures.
Specific issues, and why they can be discounted, or accounted for:
* The graph includes failures from autoupdate_EndToEnd test.
However, those failures have distinct, easily recognized failure
reasons, such as "(0) Unknown exception". I'm working to separate
the AU test metrics from Provision tasks.
* The graph includes some repair actions, both success and failure.
Those may be significant. However, if install during repair fails
because of "Image failed to download and install", it's still relevant
here. In any event, as of yesterday, install operations during repair
are no longer included in the reported metrics.
* DUTs in repair loops are exceedingly rare. At this moment, there are only
5 in the inventory; that's typical. It would be very hard for that few
DUTs to make a dent in statistics based on ~19000 provision tasks/day.
Moreover, as with repairs, if we're seeing the "Image failed to download
and install" symptom it's likely still relevant.
> 1%, while not ideal, is roughly the threshold I thought would be
> reasonable for new devices for the provisioning criteria. If we're
> operating below that, I'm tempted to say that we should just fail the
> entire auto update and allow an entirely new subsequent auto update to
> occur instead of adding more retries to a single attempt. Ideally, the
> scheduler would be able to do that much faster than can be done now, but
> I think that's overall preferable to making it more and more complicated
> again which will just get us back to where we were before.
Yeah, if we just dispense altogether with retrying, only about 1 in 1000
provision tasks will fail which might otherwise have passed, and failures
in general will get faster. Also, 1 in 1000 is likely the worst case.
If you look at the viceroy graph for more recent days, the rate of the
problem symptom has gone down considerably.
> With regards to individual failures: Did the DUT reboot? Did USB
> ethernet fail and need to be repaired? Did other DUTs on the same
> rack experience failures at the same time? Did other provisions
> from the same shard/devserver failure at the same time?
(I'm assuming these questions are asked w.r.t. the
I don't have good data on any of these questions. Based on anecdote
these are my first guesses:
* DUTs don't suffer any system failure, be that a kernel crash, or
a network outage.
* DUT location largely is not correlated with the failures. However,
I do know of one DUT that has an exceptionally high count of this
kind of failure.
* I don't have good data on temporal distribution of failures. I'd
want better data before drawing any conclusions about correlations
with shards or devservers.
> Has this gotten worse since moving to shard instead of devserver for
> running the ssh process? I could see that being an issue.
The short answer is that looking at the graphs from before and after
the switchover, there doesn't seem to be any change that can be attribtued
to the switchover per se. There _were_ changes in the rate, but I've
traced them to other causes not related to the switchover (for example,
see bug 855139).
However, by and large, I think this is impossible to know. Distinguishing
a clear failure symptom from the old code is hard; the monarch data can't
show this specific kind of failure at all for the old strategy (the main
point of the switchover was that the new strategy provides this much more
precise description of the failure).
It would be possible to download logs from before the provisioning
change to see how often quick-provision failed there. However, you'd
need much more than 24 hours of data to know for sure that there was
a trend. The viceroy graph shows that the rate of quick-provision
failure has been varying significantly over time. Also, with the
aggregate failure rate of quick-provision at less that 1% of all provision
tasks, reliably measuring a statistically significant difference is just
plain hard.
,
Sep 13
It sounds like there isn't any viable action items here, other than a vague "get better data" |
||
►
Sign in to add a comment |
||
Comment 1 by jrbarnette@chromium.org
, Jul 19