New issue
Advanced search Search tips

Issue 728916 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: ----

Blocking:
issue 739537



Sign in to add a comment

Milo showing very old pending builds

Project Member Reported by tansell@chromium.org, Jun 2 2017

Issue description

Milo is again showing very old pending builds which have completed a long time ago.

See https://luci-milo.appspot.com/buildbot/tryserver.chromium.linux/linux_chromium_rel_ng/ and the attached screenshot.
 
pending-builds.png
410 KB View Download
Owner: hinoka@chromium.org
Status: Assigned (was: Untriaged)
This looks like the pubsub pusher on the buildbot side not clearing out pending builds that have started.

Comment 2 by hinoka@chromium.org, Jun 21 2017

Cc: hinoka@chromium.org
 Issue 735600  has been merged into this issue.
This seems to keep happening.

I *really* like Milo this is one of the primary reasons I can't trust milo at the moment.

Could we get some testing or alerts or something around this? 


Comment 4 by estaab@chromium.org, Jun 22 2017

Labels: Pri-1
Project Member

Comment 5 by bugdroid1@chromium.org, Jun 22 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/master-manager/+/fa1d89b5debb306a33ab325afc6ff8100cd02982

commit fa1d89b5debb306a33ab325afc6ff8100cd02982
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Jun 22 02:12:38 2017

Project Member

Comment 6 by bugdroid1@chromium.org, Jun 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/2f54f0dafd824a9aef3be3a153548436ac95cdf6

commit 2f54f0dafd824a9aef3be3a153548436ac95cdf6
Author: Ryan Tseng <hinoka@google.com>
Date: Thu Jun 22 01:51:59 2017

Buildbot Pubsub: Do a full pending builds sync every 10 minutes

Bug:  728916 
Change-Id: I97bd3f726d01a0de141028e534f68c6ae9ba072b
Tests: Ran locally, didn't crash
Reviewed-on: https://chromium-review.googlesource.com/543855
Commit-Queue: Ryan Tseng <hinoka@chromium.org>
Reviewed-by: Daniel Jacques <dnj@chromium.org>

[modify] https://crrev.com/2f54f0dafd824a9aef3be3a153548436ac95cdf6/scripts/master/pubsub_json_status_push.py

Comment 7 by hinoka@chromium.org, Jun 22 2017

chromium.perf looks fine now, but tryserver.chromium.linux is still exhibiting this behavior.  

Re: Testing and alerts - This is difficult to test since it's current not reproducible.  Alerts would be a good idea, but first I'm going to figure out a root cause or mitigation.

Comment 8 by hinoka@chromium.org, Jun 22 2017

Looking at the logs (it's easier to search through since tryserver.chromium.linux was restarted recently) these pending build requests were added at the time the master was restarted, probably as carry-over from the previous instance.

Comment 9 by hinoka@chromium.org, Jun 22 2017

This looks like a good example of how it is happening:

2017-06-21 18:02:44-0700 [-] added buildset 115659 to database (build requests: {u'Linux ChromiumOS Tests (dbg)(1)': 543787})
2017-06-21 18:02:44-0700 [-] Processing new build request: {'bsid': 115659, 'buildername': u'Linux ChromiumOS Tests (dbg)(1)', 'brid': 543787}
2017-06-21 18:02:44-0700 [-] PubSub: Status update (requestSubmitted): Linux ChromiumOS Tests (dbg)(1)/543787
2017-06-21 18:02:44-0700 [-] PubSub: Successfully recorded build request 543787/Linux ChromiumOS Tests (dbg)(1)
2017-06-21 18:05:54-0700 [-] PubSub: Successfully removed and unsubscribed Linux ChromiumOS Tests (dbg)(1)/543787
2017-06-22 18:15:09-0700 make stop
<Master was restarted here>
2017-06-22 08:16:09-0700 make start
2017-06-21 18:17:13-0700 [-] PubSub: Successfully recorded build request 543787/Linux ChromiumOS Tests (dbg)(1)

There is code to load pending builds when a master restarts here:
https://cs.chromium.org/chromium/build/scripts/master/pubsub_json_status_push.py?q=pubsub_json&sq=package:chromium&dr&l=590

It does so by calling: builder.getPendingBuildRequestStatuses()
Which works by calling db.buildrequests.getBuildRequests(claimed=False, buildername=self.name)
https://cs.chromium.org/chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py?type=cs&q=getPendingBuildRequestStatuses&l=347

Looking at the db, that build request was:
submitted at 18:02:45
claimed at   18:46:25
completed at 18:53:45

And if we look at the actual build this br was associated with:

chromiumchromiumos=> select * from builds where brid = 543787;                                                                                                                                                                                                                    
   id   | number |  brid  | start_time | finish_time 
--------+--------+--------+------------+-------------
 544485 |  27493 | 543787 | 1498094196 |  1498096425
 544445 |  27492 | 543787 | 1498093554 |  1498094138

This is responsible for two builds.
Before restart: https://build.chromium.org/p/chromium.chromiumos/builders/Linux%20ChromiumOS%20Tests%20%28dbg%29%281%29/builds/27492
After restart:  https://build.chromium.org/p/chromium.chromiumos/builders/Linux%20ChromiumOS%20Tests%20%28dbg%29%281%29/builds/27493

2017-06-21 18:16:36-0700 [-] PubSub: Status update (buildStarted): ('Linux ChromiumOS Tests (dbg)(1)', <BuildStatus #27493>)

I see, this is a race condition where things happen in this order:
1. Master starts
2. DB request made for pending builds
3. Build starts, subscribers notified
4. pending build is recorded and subscribed (and never exercised)
More useful timestamps:

2017-06-21 18:16:25-0700 [-] PubSub: Status update (builderAdded): ('Linux ChromiumOS Tests (dbg)(1)', <buildbot.status.builder.BuilderStatus instance at 0x661c3320>) 
2017-06-21 18:16:36-0700 [-] PubSub: Status update (buildStarted): ('Linux ChromiumOS Tests (dbg)(1)', <BuildStatus #27493>) 

This failure mode should've been mitigated by the CL committed yesterday at 6:51PM.  This master was restarted at 6:15PM, so it didn't get picked up.
tryserver.chromium.linux was restarted last evening and this appears fixed for that master.  All thats left is to restart masters and we should be able to close this bug.
Blocking: 739537
Status: Fixed (was: Assigned)

Sign in to add a comment