New issue
Advanced search Search tips

Issue 865467 link

Starred by 1 user

Issue metadata

Status: Unconfirmed
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Provisioning fallback from quick-provision to update_engine should be restricted

Reported by jrbarnette@chromium.org, Jul 19

Issue description

Currently, during provisioning, if the quick-provision script
reports any failure, the provisioning code automatically retries
by using the update_engine to download and install the build.

A spot check of 24 hours worth of provision failures shows that
in all cases where the provision failure was pegged with "Download
and install failed", quick-provisioning simply died in the midst of
downloading valid payload data.  Although in such a case retrying
might be needed, it seems more useful to fail, at least until we
can learn more about why these downloads fail in the first place.

So, we should change the provisioning logic to something more like this:
  * If quick-provisioning payloads are staged on the devserver, use
    quick-provisioning but don't retry failures.
  * Otherwise, use update_engine.

 
> A spot check of 24 hours worth of provision failures shows that
> in all cases where the provision failure [ ... ]

The spot check was incomplete; download failure is the most common,
but not the only source of failure.  However, the conclusion is largely
unchanged:  by and large the observed failures did not seem to justify
retrying with update_engine.

> The spot check was incomplete; download failure is the most common,
> but not the only source of failure.  However, the conclusion is largely
> unchanged:  by and large the observed failures did not seem to justify
> retrying with update_engine.

Digging in to 24 hours worth of real provision logs, I've come to a
different conclusion.  For the sample period, a certain number of
quick-provision failures became provision successes because of the
retry implicit in switching to use update_engine.  My initial estimate
for the sample is ~40% of quick-provision failures will succeed when
retried with update_engine.

At first blush, the most likely explanation for the successful retries
would be not that update_engine is more reliable, but that devservers
can get under enough load that retries are an effective approach.  That
suggests that we should retry any failed download operation, possibly
on a different devserver, and without switching the algorithm for the
download.

c#2 doesn't sound right to me.

Do you have details on the failures?  Is it failures that happen mid-transfer, or ones that fail to start?  ie, what does individual log analysis say instead of final results.

I think we really should understand the failures before just adding another retry because your analysis in c#2 doesn't match what I've seen from many provision failures I've looked over the logs from.
> Do you have details on the failures?  Is it failures that
> happen mid-transfer, or ones that fail to start?  ie, what
> does individual log analysis say instead of final results.

A substantial number of the logs that I looked at showed failures
occurring mid-transfer.  Although this was a common reason for
failure, it wasn't the only reason for failure.

I'm in the process now of downloading the necessary logs to get a
more comprehensive view of the failure symptoms.  That will also
give a better view of the actual rate of retry success.

I've now completed downloading 24 hours of provision task data.  The
data covers most (but not quite all) shards.  For reference, here's the
Viceroy graph of failure causes in the time interval:
    http://shortn/_IQpTyOQAGu

The failure reason of interest is "Image failed to download and install."
In my browser, that reason renders as a thick maroon band in the center
of the graph.

I ran a script across the logs to classify the various failures.  Here's
the summary of the failure breakdown as a percentage of all the jobs:
  0.1%    Quick provision failed, but retry with update-engine passed.
  0.5%    Quick provision failed, and retry with update-engine failed.
  2.1%    Provisioning failed for some reason not involving quick provision.

I also dug into logs of selected failures.  Although failure in the
middle of download is a common symptom, it's not clear that the download
is the cause.  Here's an excerpt from a sample failure:    07/19 11:29:49.987 DEBUG|             utils:0287| [stdout]      0K ........ ........ ........ ........  2% 30.0M 36s
    07/19 11:29:51.049 DEBUG|             utils:0287| [stdout]  32768K ........ ........ ........ ........  5% 30.3M 35s
    07/19 11:29:52.329 DEBUG|             utils:0287| [stdout]  65536K ........ ........ ........ ........  8% 25.0M 37s
    07/19 11:34:35.240 DEBUG|             utils:0306| [stdout]  98304K ........ ........ ........ .....
    07/19 11:34:35.242 ERROR|       autoupdater:0902| quick-provision script failed; will fall back to update_engine.
[ ... ]
    Exit status: 255
    Duration: 287.391553879

The command failed during the download, but the exit status is 255.  That
exit status is used by ssh to indicate a failure in the ssh connection,
rather than a failure in the underlying command.  The error code could be
misleading:  Some commands other than ssh can (and have) returned that
code for failures.  Nonetheless:  we may be looking at failures in the
ssh connections as a significant factor.

Not all of the failures were code 255.  Here's the breakdown of exit codes
from quick-provision:
     13 Exit status: 127
     31 Exit status: 1
     73 Exit status: 255

That Viceroy graph includes failures which likely should not be included in our stats for other labs.  Also, it includes DUTs that are in a repair loop which also throws off stats.

1%, while not ideal, is roughly the threshold I thought would be reasonable for new devices for the provisioning criteria.  If we're operating below that, I'm tempted to say that we should just fail the entire auto update and allow an entirely new subsequent auto update to occur instead of adding more retries to a single attempt.  Ideally, the scheduler would be able to do that much faster than can be done now, but I think that's overall preferable to making it more and more complicated again which will just get us back to where we were before.

With regards to individual failures:  Did the DUT reboot?  Did USB ethernet fail and need to be repaired?  Did other DUTs on the same rack experience failures at the same time?  Did other provisions from the same shard/devserver failure at the same time?  

Has this gotten worse since moving to shard instead of devserver for running the ssh process?  I could see that being an issue.

Connections should not just be going away.  I'd much rather understand and drive down the actual causes of failures than trying to just paper over it with a fat retry.

> That Viceroy graph includes failures which likely should not
> be included in our stats for other labs.  Also, it includes DUTs
> that are in a repair loop which also throws off stats.

The numbers cited in #c5 are based on the downloaded logs of >19000
provision tasks, and don't suffer from the known flaws in the graph.

Although the graph is imperfect, it's sufficient to demonstrate the
relative frequency of the specific failure w.r.t. other failures.
Specific issues, and why they can be discounted, or accounted for:
  * The graph includes failures from autoupdate_EndToEnd test.
    However, those failures have distinct, easily recognized failure
    reasons, such as "(0) Unknown exception".  I'm working to separate
    the AU test metrics from Provision tasks.
  * The graph includes some repair actions, both success and failure.
    Those may be significant.  However, if install during repair fails
    because of "Image failed to download and install", it's still relevant
    here.  In any event, as of yesterday, install operations during repair
    are no longer included in the reported metrics.
  * DUTs in repair loops are exceedingly rare.  At this moment, there are only
    5 in the inventory; that's typical.  It would be very hard for that few
    DUTs to make a dent in statistics based on ~19000 provision tasks/day.
    Moreover, as with repairs, if we're seeing the "Image failed to download
    and install" symptom it's likely still relevant.

> 1%, while not ideal, is roughly the threshold I thought would be
> reasonable for new devices for the provisioning criteria.  If we're
> operating below that, I'm tempted to say that we should just fail the
> entire auto update and allow an entirely new subsequent auto update to
> occur instead of adding more retries to a single attempt.  Ideally, the
> scheduler would be able to do that much faster than can be done now, but
> I think that's overall preferable to making it more and more complicated
> again which will just get us back to where we were before.

Yeah, if we just dispense altogether with retrying, only about 1 in 1000
provision tasks will fail which might otherwise have passed, and failures
in general will get faster.  Also, 1 in 1000 is likely the worst case.
If you look at the viceroy graph for more recent days, the rate of the
problem symptom has gone down considerably.


> With regards to individual failures:  Did the DUT reboot?  Did USB
> ethernet fail and need to be repaired?  Did other DUTs on the same
> rack experience failures at the same time?  Did other provisions
> from the same shard/devserver failure at the same time?

(I'm assuming these questions are asked w.r.t. the 
I don't have good data on any of these questions.  Based on anecdote
these are my first guesses:
  * DUTs don't suffer any system failure, be that a kernel crash, or
    a network outage.
  * DUT location largely is not correlated with the failures.  However,
    I do know of one DUT that has an exceptionally high count of this
    kind of failure.
  * I don't have good data on temporal distribution of failures.  I'd
    want better data before drawing any conclusions about correlations
    with shards or devservers.

> Has this gotten worse since moving to shard instead of devserver for
> running the ssh process?  I could see that being an issue.

The short answer is that looking at the graphs from before and after
the switchover, there doesn't seem to be any change that can be attribtued
to the switchover per se.  There _were_ changes in the rate, but I've
traced them to other causes not related to the switchover (for example,
see bug 855139).

However, by and large, I think this is impossible to know.  Distinguishing
a clear failure symptom from the old code is hard; the monarch data can't
show this specific kind of failure at all for the old strategy (the main
point of the switchover was that the new strategy provides this much more
precise description of the failure).

It would be possible to download logs from before the provisioning
change to see how often quick-provision failed there.  However, you'd
need much more than 24 hours of data to know for sure that there was
a trend.  The viceroy graph shows that the rate of quick-provision
failure has been varying significantly over time.  Also, with the
aggregate failure rate of quick-provision at less that 1% of all provision
tasks, reliably measuring a statistically significant difference is just
plain hard.
Status: Unconfirmed (was: Untriaged)
It sounds like there isn't any viable action items here, other than a vague "get better data"

Sign in to add a comment