Milo showing very old pending builds |
||||
Issue descriptionMilo is again showing very old pending builds which have completed a long time ago. See https://luci-milo.appspot.com/buildbot/tryserver.chromium.linux/linux_chromium_rel_ng/ and the attached screenshot.
,
Jun 21 2017
,
Jun 22 2017
This seems to keep happening. I *really* like Milo this is one of the primary reasons I can't trust milo at the moment. Could we get some testing or alerts or something around this?
,
Jun 22 2017
,
Jun 22 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager/+/fa1d89b5debb306a33ab325afc6ff8100cd02982 commit fa1d89b5debb306a33ab325afc6ff8100cd02982 Author: Ryan Tseng <hinoka@google.com> Date: Thu Jun 22 02:12:38 2017
,
Jun 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/2f54f0dafd824a9aef3be3a153548436ac95cdf6 commit 2f54f0dafd824a9aef3be3a153548436ac95cdf6 Author: Ryan Tseng <hinoka@google.com> Date: Thu Jun 22 01:51:59 2017 Buildbot Pubsub: Do a full pending builds sync every 10 minutes Bug: 728916 Change-Id: I97bd3f726d01a0de141028e534f68c6ae9ba072b Tests: Ran locally, didn't crash Reviewed-on: https://chromium-review.googlesource.com/543855 Commit-Queue: Ryan Tseng <hinoka@chromium.org> Reviewed-by: Daniel Jacques <dnj@chromium.org> [modify] https://crrev.com/2f54f0dafd824a9aef3be3a153548436ac95cdf6/scripts/master/pubsub_json_status_push.py
,
Jun 22 2017
chromium.perf looks fine now, but tryserver.chromium.linux is still exhibiting this behavior. Re: Testing and alerts - This is difficult to test since it's current not reproducible. Alerts would be a good idea, but first I'm going to figure out a root cause or mitigation.
,
Jun 22 2017
Looking at the logs (it's easier to search through since tryserver.chromium.linux was restarted recently) these pending build requests were added at the time the master was restarted, probably as carry-over from the previous instance.
,
Jun 22 2017
This looks like a good example of how it is happening:
2017-06-21 18:02:44-0700 [-] added buildset 115659 to database (build requests: {u'Linux ChromiumOS Tests (dbg)(1)': 543787})
2017-06-21 18:02:44-0700 [-] Processing new build request: {'bsid': 115659, 'buildername': u'Linux ChromiumOS Tests (dbg)(1)', 'brid': 543787}
2017-06-21 18:02:44-0700 [-] PubSub: Status update (requestSubmitted): Linux ChromiumOS Tests (dbg)(1)/543787
2017-06-21 18:02:44-0700 [-] PubSub: Successfully recorded build request 543787/Linux ChromiumOS Tests (dbg)(1)
2017-06-21 18:05:54-0700 [-] PubSub: Successfully removed and unsubscribed Linux ChromiumOS Tests (dbg)(1)/543787
2017-06-22 18:15:09-0700 make stop
<Master was restarted here>
2017-06-22 08:16:09-0700 make start
2017-06-21 18:17:13-0700 [-] PubSub: Successfully recorded build request 543787/Linux ChromiumOS Tests (dbg)(1)
There is code to load pending builds when a master restarts here:
https://cs.chromium.org/chromium/build/scripts/master/pubsub_json_status_push.py?q=pubsub_json&sq=package:chromium&dr&l=590
It does so by calling: builder.getPendingBuildRequestStatuses()
Which works by calling db.buildrequests.getBuildRequests(claimed=False, buildername=self.name)
https://cs.chromium.org/chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py?type=cs&q=getPendingBuildRequestStatuses&l=347
Looking at the db, that build request was:
submitted at 18:02:45
claimed at 18:46:25
completed at 18:53:45
And if we look at the actual build this br was associated with:
chromiumchromiumos=> select * from builds where brid = 543787;
id | number | brid | start_time | finish_time
--------+--------+--------+------------+-------------
544485 | 27493 | 543787 | 1498094196 | 1498096425
544445 | 27492 | 543787 | 1498093554 | 1498094138
This is responsible for two builds.
Before restart: https://build.chromium.org/p/chromium.chromiumos/builders/Linux%20ChromiumOS%20Tests%20%28dbg%29%281%29/builds/27492
After restart: https://build.chromium.org/p/chromium.chromiumos/builders/Linux%20ChromiumOS%20Tests%20%28dbg%29%281%29/builds/27493
,
Jun 22 2017
2017-06-21 18:16:36-0700 [-] PubSub: Status update (buildStarted): ('Linux ChromiumOS Tests (dbg)(1)', <BuildStatus #27493>)
I see, this is a race condition where things happen in this order:
1. Master starts
2. DB request made for pending builds
3. Build starts, subscribers notified
4. pending build is recorded and subscribed (and never exercised)
,
Jun 22 2017
More useful timestamps:
2017-06-21 18:16:25-0700 [-] PubSub: Status update (builderAdded): ('Linux ChromiumOS Tests (dbg)(1)', <buildbot.status.builder.BuilderStatus instance at 0x661c3320>)
2017-06-21 18:16:36-0700 [-] PubSub: Status update (buildStarted): ('Linux ChromiumOS Tests (dbg)(1)', <BuildStatus #27493>)
,
Jun 22 2017
This failure mode should've been mitigated by the CL committed yesterday at 6:51PM. This master was restarted at 6:15PM, so it didn't get picked up.
,
Jun 23 2017
tryserver.chromium.linux was restarted last evening and this appears fixed for that master. All thats left is to restart masters and we should be able to close this bug.
,
Jul 6 2017
,
Aug 2 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by hinoka@chromium.org
, Jun 2 2017Status: Assigned (was: Untriaged)