New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 786108 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

AFE went down without an alert being fired

Project Member Reported by akes...@chromium.org, Nov 16 2017

Issue description

See Issue 786106 in which AFE went down but no alert was fired. (other than high provision failure rate, which is likely unrelated).

Our AFE dashboards don't really show a great signature either. I see a small blip of rpc ping failures on https://viceroy.corp.google.com/chromeos/afe_rpc?duration=6h&host_name=chromeos-server2&refresh=-1&utc_end=1510864970.15 (and a likely unrelated spike in some shard rpc times). I see no 5XX error rate spike in https://viceroy.corp.google.com/chromeos/afe_health?duration=6h&host_name=chromeos-server2&refresh=-1&utc_end=1510864970.15


 
Labels: -Chase-Pending Chase
Owner: pho...@chromium.org
Status: Assigned (was: Untriaged)
fix rpc recorder to:
 - assert that response is not only non-5XX but also contains correct response
 - alert on too many incorrect responses
It looks like the 400 responses are already treated as errors. My guess now is that the routes it was hitting ('get_server_time', 'ping_db') must not have actually been getting 404 responses?

See: https://chromium.git.corp.google.com/chromiumos/third_party/autotest/+/master/site_utils/rpc_flight_recorder.py#253
Slight mystery, holding open 1 more week.
Cc: pprabhu@chromium.org
Status: WontFix (was: Assigned)
likely fixed by other means (improved shard_client alert, others...)

Sign in to add a comment