Provision throttling caused a large spike in provision failures
Reported by
jrbarnette@chromium.org,
Aug 30 2017
|
||||||
Issue description
When provision payload copy throttling rolled out to the lab, there was
a dramatic spike in the provision failure rate:
https://viceroy.corp.google.com/chromeos/provision?duration=6h&utc_end=1504058400#_VG_huYBJmlb
Looking at the throttling workqueue logs, the cause seems
to have been principally a large number of aborts in copy
requests, presumably caused by long queue wait times.
The exact cause needs to get sorted out, and fixed, so that
we can try (again) to deploy the feature.
,
Sep 20 2017
,
Oct 13 2017
Some issues were discovered by perusing the logs at the time
of the failure; see the two blocking bugs. However, not all
anomalies in the logs were satisfactorily explained. Unfortunately,
the logs that would show the history got wiped out because of bug
774597. So, future problem solving will be dependent on reproducing
the failures again.
Current strategy for moving this forward is this:
* Early debug indicated that devservers with only 2x1000 ethernet
interfaces are too slow to be useful, so all such servers need
to be upgraded.
* Fix the other known bugs.
* Implement (somehow) the ability to selectively enable the
throttling feature on some devservers, but not all of them.
* Enable throttling on selected servers, and watch for failures.
* Debug and fix the failures as necessary.
* If the system fails to reproduce the problem, gradually increase
the number of enabled devservers, until the problem is reproducible,
or all servers are throttling without failures.
,
Oct 13 2017
,
Oct 13 2017
,
Dec 12 2017
,
Jan 30 2018
,
Jun 27 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, Sep 20 2017