It is currently difficult to assess the health of the chromeperf dashboard backend. We often need to search and monitor the logs manually, and it is easy to miss new problems. The app engine dashboard provides some useful information and some not so useful information, mostly in tables that require significant effort and background to understand.
Ideally, the services triage rotation could glance at a single dashboard daily, and, if a chart moves, drill down through breakdown metrics in other charts to identify problems, possibly without needing to search logs at all.
Logs-based metrics are flexible enough to monitor a wide variety of useful metrics such as errors and latencies. Stackdriver dashboards are easy to set up to display metrics in charts that are easier to understand than tables.
I'll start a doc to outline some of the metrics that we'd like to monitor.
https://docs.google.com/document/d/1wHWcX9bxCWIxW8LVhmvKF8Ho-hWo7uc-xsdcO8ta94U/edit
This bug will track adding logs to measure them, metrics to parse the logs, and a dashboard to display them.
Comment 1 by simonhatch@chromium.org
, Oct 4