PFQ builders sometimes start late and time out (e.g. arm-generic-chromium-pfq, x86-generic-chromium-pfq) |
|||||||
Issue descriptionThe last 3 arm-generic-chromium-pfq builds started more than 1 hour later than other builders. I'll keep an eye on the next set of builds, but there as 15 hours between the last two, so it is not at all clear what might be going on. I will try digging through the logs also.
,
Sep 19 2016
,
Sep 20 2016
The builder started on time the last couple of runs so I am lowering the priority, but it would be nice to understand why / how this happened.
,
Sep 20 2016
If you look at the builds, you notice some happened on different build slaves. These are the "floating" build slaves, and they have the job of filling in if the primary build slave is offline for an extended period of time so the whole CI system doesn't shut down. It looks like the main builder for that PFQ was offline for a bit. It seems to be online now. I suspect the reason for the delay is that a floating build slave was employed and either had to do a longer full checkout (since it's not dedicated) or had to wait for other builds also using the floating build slaves to finish.
,
Sep 20 2016
Thanks for the explanation. My guess would be that the builder had to wait for other builds to complete since I would expect the build to at least show up as started before the code is checked out. Would it be possible to add logic to the master to not abort still running slaves if there is no other pending build?
,
Sep 20 2016
> Would it be possible to add logic to the master to not abort still running slaves if there is no other pending build? The timeouts are generally set around the master/slave build pattern. If a slave doesn't time out and consistently takes longer than the build window to build, that's a larger problem. Realistically we should not have to share floating builders, b/c the only kick in when a primary builder is offline. If multiple primary builders die, that's also a larger problem.
,
Sep 20 2016
If having "share floating builders" (let's pretend I understand what that means, I think I kind of do) is potentially causing this problem, I would just as soon have the build not start at all (hopefully calling attention to the primary builder failure) then starting late and being doomed to fail anyway. dnj@, if you can think of anything we can do to make this type of failure more predictable, I think that would be better. If that seems like more work than it is worth, go ahead and resolve this WontFix.
,
Sep 21 2016
Each set of builders has two floating builders that pick up enqueued builds if their dedicated builders are offline. This problem can only exist when 3+ dedicated builders are offline, which is a serious failure state. What we can do is not let three dedicated builders simultaneously be broken.
,
Sep 21 2016
FTR, confirmed that #4 was the problem. Filed https://bugs.chromium.org/p/chromium/issues/detail?id=648987 to resolve the offline slaves.
,
Sep 21 2016
So, x86-generic-chromium-pfq is currently having a similar problem: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-generic-chromium-pfq It was also "timed out" (by the master) after about an hour, twice. It started nearly 3 hours late both times. That builder appears to have 2 connected and 1 offline buildslave. The offline slave (build149-m2) is on the list in issue 648987. The other two: https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build200-m2 https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build140-m2 Do not appear especially over-subscribed, but it looks like both were maybe busy when the master started? Is it just that the number of offline primaries reached a tipping point so that the shared backups are over-subscribed? This is causing a cascading failure state which is bad. Any thoughts on how we could improve this? Maybe improve our method for detecting the offline slaves?
,
Sep 21 2016
> Do not appear especially over-subscribed, but it looks like both were maybe busy when the master started? That'd do it. > Is it just that the number of offline primaries reached a tipping point so that the shared backups are over-subscribed? Yep. > This is causing a cascading failure state which is bad. Yep. In a world without floating PFQ builders (aka last month and backwards), a PFQ offline buildslave would have immediately broken that PFQ run. This was, is, and will always be a huge critical failure. An offline PFQ buildslave is P0 and needs to be resolved ASAP. Now, a PFQ offline buildslave will not break a PFQ run since a floating builder will jump in and help. This works until at least two other buildslaves are also offline. Maybe an offline PFQ slave is ~P1 now instead of P0, but still is a big deal and needs to be resolved quickly. In this case, it looks like 2+ PFQ buildslaves failed and went offline for days without any resolution. P0/P1 issues not being reported/resolved for days is why there is a problem right now. > Any thoughts on how we could improve this? Maybe improve our method for detecting the offline slaves? We should fire trooper/labs reports when a buildslave goes offline. Unfortunately, this is not as trivial as it sounds. It's a known problem by monitoring team, and I have a specific bug tracking CrOS's need for this: https://bugs.chromium.org/p/chromium/issues/detail?id=638266 In the meantime, we are stuck with manual monitoring. The way to do this is: 1) Click the buildslaves tab: https://screenshot.googleplex.com/uRje21OFSdA.png 2) Search for the phrase, "Not Connected". Identify buildslaves that are "Not Connected" and also have a 30+ minute "Last Heard From" time: https://screenshot.googleplex.com/AQ2QTAiBj7A.png https://screenshot.googleplex.com/6OyXhXWmDXk.png 3) File a bug for them at go/bug-a-trooper and assign to "Infra>Labs" component as P1 or P0 depending on immediate impact. I have proposed building a system that scans for this and reports it automatically, but was told not to bother since the monitoring team has the intention of solving the problem via monitoring pipeline (see aforementioned bug).
,
Sep 21 2016
Thanks for the detailed info, I will put it into YAQS. I will go ahead and mark this as a duplicate of issue 638266 since addressing that should address the root cause here (offline builders causing failures). |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by steve...@chromium.org
, Sep 19 2016