New issue
Advanced search Search tips

Issue 812978 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

pre-cq-launcher is failing to be triggered

Project Member Reported by akes...@chromium.org, Feb 16 2018

Issue description

I forced a run, but this will finish on its own in a few hours and not be replaced. Likely something wrong with luci scheduler. Filed for deputies to follow up on in the morning.
 
Labels: -Pri-0 Pri-1
Actually, by clicking Force Build a few times, I built up a backlog of requests taht should last throughout the night.  https://uberchromegw.corp.google.com/i/chromeos/builders/pre-cq-launcher

Downgrading to P1.
Cc: davidri...@chromium.org
+davidriley noticed this for som dispatcher

Cc: tandrii@chromium.org
I'll be looking into this shortly on luci-scheduler side (i'm both trooper and luci-scheduler owner)

Comment 4 by pho...@chromium.org, Feb 16 2018

Owner: tandrii@chromium.org
I kicked off pre-cq-launcher manually a few more times. Assigning to tandrii for followup.

Comment 5 by pho...@chromium.org, Feb 16 2018

split off  issue 813144  for go/som/chromeos dispatcher.
Components: Infra>Platform>Scheduler
Status: Started (was: Assigned)
Ah, CQ builder got stuck during yesterday's memcache/appengine outage: https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher (https://screenshot.googleplex.com/sW33WTSD1o9)

Was the work around to just pause/unpause? Or something else?
Abort, then Pause, then Unpause should be sufficient.

I'll look into why it didn't recover on its own.
I did "abort" + "run now". Unfortunately, there are a few more jobs which are in the same state. I didn't touch them while we figure out what's wrong. According to debug log, they keep checking buildbucket which returns job "STARTED" all the way to my manual abort.

(UTC)
[18:45:20.946] Scheduling timer "check-buildbucket-build-status" (chromiumos-chromite/pre-cq-launcher:9119933303029759984:1427:0) after 1m0s
[18:46:21.129] Timer tick, asking Buildbucket for the build status
[18:46:21.215] Build 8954503242156173472: status "STARTED", result "", failure_reason "", cancelation_reason ""
[18:46:21.215] Scheduling timer "check-buildbucket-build-status" (chromiumos-chromite/pre-cq-launcher:9119933303029759984:1428:0) after 1m0s
[18:46:42.679] Invocation is manually aborted by user:tandrii@google.com
[18:46:42.679] Invocation finished in 24h52m18.909282802s with status ABORTED


Which other builders are in the same state? I'd like to confirm it's okay to wait a while before fixing them.
+nodir@ PTAL.

Here is a job still "running": https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/link-depthcharge-full-firmware/9119927991341039344 

Hm, buildbucket actually tells us that job is running!
https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8954497930429868944&_h=1&

{
 "build": {
  "status": "STARTED",
  "created_ts": "1518722329644670",
  "canary_preference": "AUTO",
...

but job URL shows it finished long time ago:
https://uberchromegw.corp.google.com/i/chromeos/builders/link-depthcharge-full-firmware/builds/47451

I think that's because buildbot master gave on retrying when reporting job competition to buildbucket during AppEngine outage.
https://uberchromegw.corp.google.com/i/chromeos/builders/som-dispatcher is still in this state.  Can we unblock this one?
Good point, i didn't realize that's how SoM dispatcher worked for you. and done: https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/som-dispatcher

Cc: snanda@chromium.org
 Issue 813144  has been merged into this issue.
I've unblocked similarly stuck jobs in https://luci-scheduler.appspot.com/jobs/chromiumos-chromite 

Remaining items:
(1) figure out if buildbucket can notice long ago finished buildbot jobs
(2) urge you (ChromeOS) to move these builders off buildbot :)
Labels: Infra-Troopers
We are working on 2, but with no end of distractions.

Comment 18 by no...@chromium.org, Feb 16 2018

yesterday, because of the appengine outage, I've disabled a cron job that resets builds with expired leases. I've just restored the cron job. It should reset that build.
Status: Fixed (was: Started)
so, (1) is handled, and ChromeOS, as I very well knew, is working towards (2).
Status: Assigned (was: Fixed)
Actually, this is still broken. Pre-cq-launcher wasn't running, so I kicked it off again.
Hm, https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher/9119838041485698496 says a job scheduler triggered through buildbucket is still pending.
I think buildbucket agent inside buildbot master is not working well.
Components: -Infra>Platform>Scheduler Infra>Platform>Buildbot Infra>Platform>Buildbucket
There is one easy way out - master restart. Or, we can spend hours trying to see what's wrong in buildbucket agent and try to fix it w/o restart, but i think chances of success are slim.

How do you feel about master restart?
master restart sounds good to me, if you think it will fix it. I don't have a good enough mental model of what's going wrong.
I aborted the build through the scheduler again, then killed the in-progress build.

The scheduler did created a new build. I killed it through buildbot shortly after that, and the scheduler created another build.
A restart right now would kill a canary run, so is kinda heavy, but it's better than being down all weekend.
Hm, so it appears buildbucket is working correctly now. Thanks, Don!

I'm still puzzled why buildbucket reported back build status "SCHEDULED" for 4 hours until you aborted it through luci-scheduler.
OK, should we close this or is there room for investigation that's worth doing?
Let's keep this bug open. If this buildbucket build finishes and scheduler triggers another one, then this is resolved. Otherwise, let's restart master.
When is canary run going to finish?
About 5:30.
Status: Fixed (was: Assigned)
https://luci-scheduler.appspot.com/jobs/chromiumos-chromite/pre-cq-launcher has been running since yesterday.

Sign in to add a comment