New issue
Advanced search Search tips

Issue 639900 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

SOM master stale alert, when master isn't really stale?

Project Member Reported by martiniss@chromium.org, Aug 22 2016

Issue description

There's an alert on SOM prod about master stale data. There's also several on staging.

We need to make sure that the data is actually real. We should also make it much easier for sheriffs to deal with these errors; it's pretty confusing now.

This might be related to the network flakiness happening now?
 
Owner: sergeybe...@chromium.org
Status: Assigned (was: Available)
Actually, CBE looks to be very stale... Assigning to sergeyberezin@.
Cc: sergeybe...@chromium.org hinoka@chromium.org
Owner: ----
Status: Available (was: Assigned)
cc-ing hinoka as well.
I purged the task queue. We'll see if the new tasks are any better at their job.
Status: Fixed (was: Available)
Ok, I think that solved the issue. No more stale data alerts on SOM.

Do you guys want to dig into why this happened?
Status: Available (was: Fixed)
It's still broken :(
Re: #c4: Umm...  Not really :-)

The immediate problem seemed to be that some masters were having a bad day, and timing out on their /json. The tasks kept retrying and piling up.

The real problem is that the task queue is likely used improperly in the app, and the tasks should really never retry. We repeatedly tried to fix that, but somehow tasks continue to retry. Not sure why...

Comment 7 by hinoka@chromium.org, Aug 23 2016

Waiting for the tasks to finish worked (for some definition of worked).  The alerts are now gone!

So Milo actually now has the master json also, and it's getting pushed over, not pulled.  On top of that it has support for internal masters too.  It might be worth it to make a 300 redirect from CBE for the masters that are being stored by Milo.
Milo is the new CBE now? E8-O
Can we fix this? The data in SOM is stale again; currently there are alerts for chromium, chromium.chromiumos, chromium.gpu, and chromium.chrome, all for > 30 minutes. 

Is the solution to just flush the task queue once every couple hours? I'd be glad to change data sources if that helps; I don't know where any documentation about any of this stuff lives.
Project Member

Comment 12 by bugdroid1@chromium.org, Sep 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6

commit 1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6
Author: hinoka <hinoka@google.com>
Date: Sat Sep 03 00:58:14 2016

Milo: pRPC endpoint for getting Buildbot master data

g/prpc endpoints are done via a grpc.go file in each of the modules (in contrast to the html.go file)

In addition:
* Master fetches are done in getMasterEntry(), which performs the ACL check for both html and grpc endpoints
* Added symlinks for rpcexplorer static assets

BUG= 639900 

Review-Url: https://codereview.chromium.org/2275123002

[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbot.pb.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbot.proto
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbotserver_dec.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/generate.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/pb.discovery.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/builder.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/grpc.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/master.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/pubsub_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/milo.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/auth
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/bower_components
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/rpc
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/rpcexplorer
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/acl_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/config.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/config_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/server/static/rpcexplorer/rpc-method.html

Project Member

Comment 13 by bugdroid1@chromium.org, Sep 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6

commit 1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6
Author: hinoka <hinoka@google.com>
Date: Sat Sep 03 00:58:14 2016

Milo: pRPC endpoint for getting Buildbot master data

g/prpc endpoints are done via a grpc.go file in each of the modules (in contrast to the html.go file)

In addition:
* Master fetches are done in getMasterEntry(), which performs the ACL check for both html and grpc endpoints
* Added symlinks for rpcexplorer static assets

BUG= 639900 

Review-Url: https://codereview.chromium.org/2275123002

[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbot.pb.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbot.proto
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/buildbotserver_dec.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/generate.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/api/proto/pb.discovery.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/builder.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/grpc.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/master.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/buildbot/pubsub_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/milo.go
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/auth
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/bower_components
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/rpc
[add] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/frontend/static/common/rpcexplorer
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/acl_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/config.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/milo/appengine/settings/config_test.go
[modify] https://crrev.com/1eff02f4e6c4deaaa26c9096eb87dfd47d4c9cd6/server/static/rpcexplorer/rpc-method.html

Project Member

Comment 14 by bugdroid1@chromium.org, Sep 6 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/a35d78a5c2ef3ffb7bfe3af29a748fab221bf012

commit a35d78a5c2ef3ffb7bfe3af29a748fab221bf012
Author: hinoka <hinoka@google.com>
Date: Tue Sep 06 20:36:09 2016

SOM staging is currently reporting stale masters:

Stale https://build.chromium.org/p/chromium.android master data
Active for: 0h 12m 0s

Stale https://build.chromium.org/p/chromium.linux master data
Active for: 0h 48m 9s

Owner: hinoka@chromium.org
Status: Assigned (was: Available)
Still have stale data. CBE has very backed up task queues....
Darn heisenbugs, everytime i look at som...
Once we go pubsub and get CBE out of the picture, SoM should be much more reliable.
The thing is, #15 was dated after i rolled out the new CBE version, which isn't supposed to even pull from itself, so something has gone awefully wrong.
Cc: -megjab...@chromium.org
Status: Fixed (was: Assigned)
Okay it didn't work because it got reverted.  Pushed out a fix and rolled it forward again.  Please re-open if this happens again.

Sign in to add a comment