AFE went down without an alert being fired |
|||
Issue descriptionSee Issue 786106 in which AFE went down but no alert was fired. (other than high provision failure rate, which is likely unrelated). Our AFE dashboards don't really show a great signature either. I see a small blip of rpc ping failures on https://viceroy.corp.google.com/chromeos/afe_rpc?duration=6h&host_name=chromeos-server2&refresh=-1&utc_end=1510864970.15 (and a likely unrelated spike in some shard rpc times). I see no 5XX error rate spike in https://viceroy.corp.google.com/chromeos/afe_health?duration=6h&host_name=chromeos-server2&refresh=-1&utc_end=1510864970.15
,
Dec 4 2017
It looks like the 400 responses are already treated as errors. My guess now is that the routes it was hitting ('get_server_time', 'ping_db') must not have actually been getting 404 responses?
See: https://chromium.git.corp.google.com/chromiumos/third_party/autotest/+/master/site_utils/rpc_flight_recorder.py#253
,
Dec 4 2017
Slight mystery, holding open 1 more week.
,
Dec 4 2017
,
Dec 11 2017
likely fixed by other means (improved shard_client alert, others...) |
|||
►
Sign in to add a comment |
|||
Comment 1 by akes...@chromium.org
, Nov 20 2017Owner: pho...@chromium.org
Status: Assigned (was: Untriaged)