New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 655863 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: ----



Sign in to add a comment

Sheriff-o-matic complains about stale masters and offline builders and displays stale data

Project Member Reported by kulshin@chromium.org, Oct 14 2016

Issue description

Most of the day today, the SOM has been complaining about stale masters and offline builders, even though the builders appear to be otherwise healthy. It is also not noticing that the problem with the webkit builders has been resolved and some of the builders cycled green. There was a different, unrelated, problem that cause some further failures, but that also has been fixed - in any case, the builders are still complaining about the old fixed problem and didn't notice the new failure at all (I only noticed it because I was checking builder status manually).
 
Cc: dsansome@chromium.org
Owner: hinoka@chromium.org
Status: Assigned (was: Untriaged)
The master data is stale. There's a problem with how we're storing the data in a caching layer we have.

Comment 3 by hinoka@chromium.org, Oct 14 2016

Moving discussion here...

Milo is having trouble storing builds from chromium.fyi and chromium.perf because the raw data is over 6MB, and the compressed data is over 1MB, which is over the 1MB datastore limit.

Example from chromium.fyi:
Length of json data: 7385135
Length of gzipped data: 1121537

My current theory is that there are so many pending builds that it's pushing it over the limit.  Right now the code restricts the number of pending builds to only send 75 per builder.  I'll reduce this down to 25 to see if it makes an improvement.
Project Member

Comment 4 by bugdroid1@chromium.org, Oct 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/363fb29ae42f5a475c3b93c857d0eed5ea58588d

commit 363fb29ae42f5a475c3b93c857d0eed5ea58588d
Author: hinoka <hinoka@chromium.org>
Date: Fri Oct 14 01:52:49 2016

Pubsub: Restrict full pending builds states to 25 per builder (from 75)

BUG= 655863 

Review-Url: https://codereview.chromium.org/2422503002

[modify] https://crrev.com/363fb29ae42f5a475c3b93c857d0eed5ea58588d/scripts/master/pubsub_json_status_push.py

Project Member

Comment 5 by bugdroid1@chromium.org, Oct 14 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager.git/+/1e31d26a87655c0c5deb8ece029f7091c5278fb2

commit 1e31d26a87655c0c5deb8ece029f7091c5278fb2
Author: hinoka <hinoka@google.com>
Date: Fri Oct 14 02:01:05 2016

Comment 6 by hinoka@chromium.org, Oct 14 2016

Should be fixed (for now)

>>> o = json.load(urllib.urlopen('http://chrome-build-extract.appspot.com/get_master/chromium.perf?json=true'))
>>> o['created']
u'2016-10-14T03:35:54.266699Z'

Comment 7 by hinoka@chromium.org, Oct 14 2016

Status: Fixed (was: Assigned)
This may happen again if we add like 30 more builders to chromium.perf, and each builder has 25 or more pending builds.

But I'd expect the master to topple over way before that happens.
Status: Assigned (was: Fixed)
Can we get some monitoring on those datastore insert failures?

Comment 9 by hinoka@chromium.org, Oct 14 2016

Should it be a ts_mon metric that sends master insertion events tagged with "success"/"failure"?
Project Member

Comment 10 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/39c1f5c6da051287a0e84f27c6f611181fecb925

commit 39c1f5c6da051287a0e84f27c6f611181fecb925
Author: hinoka <hinoka@google.com>
Date: Wed Oct 19 22:39:39 2016

Milo: Pubsub - Trim out pending build states if there are more than 25 per builder

BUG= 655863 

Review-Url: https://chromiumcodereview.appspot.com/2421713003

[modify] https://crrev.com/39c1f5c6da051287a0e84f27c6f611181fecb925/milo/appengine/buildbot/pubsub.go

Labels: Milestone-Reliability
Project Member

Comment 12 by bugdroid1@chromium.org, Oct 20 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/12dcae3b065d4b21435a894be863aee471bc4e1c

commit 12dcae3b065d4b21435a894be863aee471bc4e1c
Author: hinoka <hinoka@google.com>
Date: Thu Oct 20 21:59:26 2016

Status: Fixed (was: Assigned)
Stability patches have been landed, this should be fixed.
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 18 2016

Sign in to add a comment