New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 692433 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
User never visited
Closed: Feb 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 690292



Sign in to add a comment

Add monitoring to GCE provider's various cron jobs.

Project Member Reported by dsansome@chromium.org, Feb 15 2017

Issue description

They should report counts of everything they're doing, broken down by whether they were successful or not.
 

Comment 1 by s...@google.com, Feb 17 2017

What level of detail are we talking about here? Do you want me to report the outcome of every RPC? Of every datastore operation? Do you want the specific reason for the failure included?
IMHO, I don't think monitoring every AppEngine's internal RPC is useful (like datastore RPCs). We should be relying on GAE's SLA's for that. But higher-level operations that are meaningful to the cron jobs would be useful to monitor - like how many ${things} got processed, how long they took, etc. Whatever would be interesting to know in an outage or for debugging / optimizing the service.

Dave may have other ideas in mind, maybe more concrete (I'm not familiar enough with the service).

Comment 3 by s...@google.com, Feb 22 2017

The app uses cron jobs to schedule a bunch of task queues for parallel processing of datastore entities. Each task queue processes just one entity, and the monitoring we already have from instrumenting the app's HTTP endpoints already provides the high level overview suggested in #2.

Since each task queue processes exactly one thing, the number of things processed is equal to the number of requests seen here:
https://viceroy.corp.google.com/chrome_infra/Appengine/gce-backend#_VG_cuLcjDTd

And how long each task queue took is also monitored already:
https://viceroy.corp.google.com/chrome_infra/Appengine/gce-backend#_VG_NgKXthBf

Comment 6 by s...@google.com, Feb 24 2017

Status: Fixed (was: Assigned)
I discussed with Sergey and it seems like it doesn't quite make sense to monitor each individual cron job/task queue in this app. Instead I've added monitoring for "expected" and "actual" number of VMs which should signal whether the app is working properly or not.

Sign in to add a comment