Add monitoring to GCE provider's various cron jobs. |
||
Issue descriptionThey should report counts of everything they're doing, broken down by whether they were successful or not.
,
Feb 22 2017
IMHO, I don't think monitoring every AppEngine's internal RPC is useful (like datastore RPCs). We should be relying on GAE's SLA's for that. But higher-level operations that are meaningful to the cron jobs would be useful to monitor - like how many ${things} got processed, how long they took, etc. Whatever would be interesting to know in an outage or for debugging / optimizing the service.
Dave may have other ideas in mind, maybe more concrete (I'm not familiar enough with the service).
,
Feb 22 2017
The app uses cron jobs to schedule a bunch of task queues for parallel processing of datastore entities. Each task queue processes just one entity, and the monitoring we already have from instrumenting the app's HTTP endpoints already provides the high level overview suggested in #2. Since each task queue processes exactly one thing, the number of things processed is equal to the number of requests seen here: https://viceroy.corp.google.com/chrome_infra/Appengine/gce-backend#_VG_cuLcjDTd And how long each task queue took is also monitored already: https://viceroy.corp.google.com/chrome_infra/Appengine/gce-backend#_VG_NgKXthBf
,
Feb 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792 commit 51abdc79bf08bf66f5cb87e9e4d3c40188b4c792 Author: smut <smut@google.com> Date: Thu Feb 23 02:13:05 2017 Refactor task enqueuing BUG= 692433 Review-Url: https://codereview.chromium.org/2713533002 [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/catalog.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/cleanup.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/instance_group_managers.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/instance_group_managers_test.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/instance_templates.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/instance_templates_test.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/instances.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/metadata.py [modify] https://crrev.com/51abdc79bf08bf66f5cb87e9e4d3c40188b4c792/appengine/gce-backend/utilities.py
,
Feb 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/7811e968699d23b5bfe551a34ee5458047c52213 commit 7811e968699d23b5bfe551a34ee5458047c52213 Author: smut <smut@google.com> Date: Thu Feb 23 23:24:31 2017 Count configured minimum and maximum numbers of instances BUG= 692433 Review-Url: https://codereview.chromium.org/2705153007 [modify] https://crrev.com/7811e968699d23b5bfe551a34ee5458047c52213/appengine/gce-backend/config.py [modify] https://crrev.com/7811e968699d23b5bfe551a34ee5458047c52213/appengine/gce-backend/metrics.py
,
Feb 24 2017
I discussed with Sergey and it seems like it doesn't quite make sense to monitor each individual cron job/task queue in this app. Instead I've added monitoring for "expected" and "actual" number of VMs which should signal whether the app is working properly or not. |
||
►
Sign in to add a comment |
||
Comment 1 by s...@google.com
, Feb 17 2017