pre-cq-launcher is failing to be triggered |
||||||||||
Issue descriptionI forced a run, but this will finish on its own in a few hours and not be replaced. Likely something wrong with luci scheduler. Filed for deputies to follow up on in the morning.
,
Feb 16 2018
+davidriley noticed this for som dispatcher
,
Feb 16 2018
I'll be looking into this shortly on luci-scheduler side (i'm both trooper and luci-scheduler owner)
,
Feb 16 2018
I kicked off pre-cq-launcher manually a few more times. Assigning to tandrii for followup.
,
Feb 16 2018
split off issue 813144 for go/som/chromeos dispatcher.
,
Feb 16 2018
Ah, CQ builder got stuck during yesterday's memcache/appengine outage: https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher (https://screenshot.googleplex.com/sW33WTSD1o9)
,
Feb 16 2018
Was the work around to just pause/unpause? Or something else?
,
Feb 16 2018
Abort, then Pause, then Unpause should be sufficient. I'll look into why it didn't recover on its own.
,
Feb 16 2018
I did "abort" + "run now". Unfortunately, there are a few more jobs which are in the same state. I didn't touch them while we figure out what's wrong. According to debug log, they keep checking buildbucket which returns job "STARTED" all the way to my manual abort. (UTC) [18:45:20.946] Scheduling timer "check-buildbucket-build-status" (chromiumos-chromite/pre-cq-launcher:9119933303029759984:1427:0) after 1m0s [18:46:21.129] Timer tick, asking Buildbucket for the build status [18:46:21.215] Build 8954503242156173472: status "STARTED", result "", failure_reason "", cancelation_reason "" [18:46:21.215] Scheduling timer "check-buildbucket-build-status" (chromiumos-chromite/pre-cq-launcher:9119933303029759984:1428:0) after 1m0s [18:46:42.679] Invocation is manually aborted by user:tandrii@google.com [18:46:42.679] Invocation finished in 24h52m18.909282802s with status ABORTED
,
Feb 16 2018
Which other builders are in the same state? I'd like to confirm it's okay to wait a while before fixing them.
,
Feb 16 2018
+nodir@ PTAL. Here is a job still "running": https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/link-depthcharge-full-firmware/9119927991341039344 Hm, buildbucket actually tells us that job is running! https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8954497930429868944&_h=1& { "build": { "status": "STARTED", "created_ts": "1518722329644670", "canary_preference": "AUTO", ... but job URL shows it finished long time ago: https://uberchromegw.corp.google.com/i/chromeos/builders/link-depthcharge-full-firmware/builds/47451 I think that's because buildbot master gave on retrying when reporting job competition to buildbucket during AppEngine outage.
,
Feb 16 2018
https://uberchromegw.corp.google.com/i/chromeos/builders/som-dispatcher is still in this state. Can we unblock this one?
,
Feb 16 2018
Good point, i didn't realize that's how SoM dispatcher worked for you. and done: https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/som-dispatcher
,
Feb 16 2018
,
Feb 16 2018
I've unblocked similarly stuck jobs in https://luci-scheduler.appspot.com/jobs/chromiumos-chromite Remaining items: (1) figure out if buildbucket can notice long ago finished buildbot jobs (2) urge you (ChromeOS) to move these builders off buildbot :)
,
Feb 16 2018
,
Feb 16 2018
We are working on 2, but with no end of distractions.
,
Feb 16 2018
yesterday, because of the appengine outage, I've disabled a cron job that resets builds with expired leases. I've just restored the cron job. It should reset that build.
,
Feb 16 2018
so, (1) is handled, and ChromeOS, as I very well knew, is working towards (2).
,
Feb 16 2018
Actually, this is still broken. Pre-cq-launcher wasn't running, so I kicked it off again.
,
Feb 17 2018
Hm, https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher/9119838041485698496 says a job scheduler triggered through buildbucket is still pending.
,
Feb 17 2018
I think buildbucket agent inside buildbot master is not working well.
,
Feb 17 2018
There is one easy way out - master restart. Or, we can spend hours trying to see what's wrong in buildbucket agent and try to fix it w/o restart, but i think chances of success are slim. How do you feel about master restart?
,
Feb 17 2018
master restart sounds good to me, if you think it will fix it. I don't have a good enough mental model of what's going wrong.
,
Feb 17 2018
I aborted the build through the scheduler again, then killed the in-progress build. The scheduler did created a new build. I killed it through buildbot shortly after that, and the scheduler created another build.
,
Feb 17 2018
A restart right now would kill a canary run, so is kinda heavy, but it's better than being down all weekend.
,
Feb 17 2018
Hm, so it appears buildbucket is working correctly now. Thanks, Don! I'm still puzzled why buildbucket reported back build status "SCHEDULED" for 4 hours until you aborted it through luci-scheduler.
,
Feb 17 2018
OK, should we close this or is there room for investigation that's worth doing?
,
Feb 17 2018
Let's keep this bug open. If this buildbucket build finishes and scheduler triggers another one, then this is resolved. Otherwise, let's restart master. When is canary run going to finish?
,
Feb 17 2018
About 5:30.
,
Feb 17 2018
https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher has been running since yesterday. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by akes...@chromium.org
, Feb 16 2018