We experienced a system wide outage between 1AM-4AM due to heavy load across the lab. The root cause is that we have a group of weekly task scheduled for last night, and it created 1096 suite jobs to be run within 4 hours. For following 5 suites running on suites pool:
control.experimental
control.kernel_per-build_benchmarks
control.kernel_per-build_regression
control.network3g_pseudomodem
control.network_ui
control.regression
The lab (mostly devservers) was overloaded between 1AM to 4AM, and led to many job failures.
The devserver load can be tracked in this dashboard:
http://104.154.79.237/grafana/#/dashboard/db/autotest-devserver-load
We are working on several approaches to fix this issue:
1. Scheduler weekly suite jobs more evenly. (CL 331441)
2. Add more devserver b/27047069
Comment 1 by bugdroid1@chromium.org
, Mar 8 2016