New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 711778 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Chrome OS canary schedule is slipping

Reported by jrbarnette@chromium.org, Apr 14 2017

Issue description

Over the past day or two, the Chrome OS canaries have been starting
progressively later.  We've now reached the point where the finish
times of the late starts is past the next start time.  This can't
be good.

Below is the info from the buildbot page:
Apr 14 03:12	??	success	#3046	Build successful
Apr 13 19:08	??	success	#3045	Build successful
Apr 13 11:05	??	success	#3044	Build successful
Apr 13 03:00	??	success	#3043	Build successful
Apr 12 18:40	??	success	#3042	Build successful
Apr 12 10:36	??	success	#3041	Build successful
Apr 12 02:29	??	success	#3040	Build successful
Apr 11 18:12	??	success	#3039	Build successful
Apr 11 10:10	??	success	#3038	Build successful
Apr 11 02:09	??	success	#3037	Build successful
Apr 10 18:09	??	success	#3036	Build successful
Apr 10 10:08	??	success	#3035	Build successful
Apr 10 02:06	??	success	#3034	Build successful
Apr 09 18:05	??	success	#3033	Build successful
Apr 09 10:04	??	success	#3032	Build successful
Apr 09 02:03	??	success	#3031	Build successful
Apr 08 18:03	??	success	#3030	Build successful
Apr 08 10:03	??	success	#3029	Build successful
Apr 08 02:03	??	success	#3028	Build successful
Apr 07 18:01	??	success	#3027	Build successful

 
Cc: nxia@chromium.org dgarr...@chromium.org

Comment 2 by nxia@chromium.org, Apr 14 2017

/builds/3045

Start	Thu Apr 13 19:08:45 2017
End	Fri Apr 14 03:06:54 2017
Elapsed	7 hrs, 58 mins, 8 secs

/builds/3046

Start	Fri Apr 14 03:12:29 2017
End	Fri Apr 14 11:03:26 2017
Elapsed	7 hrs, 50 mins, 56 secs

The elapsed time of the master builds is expected (< 8 hours). The problem should be the next master starts 5 ~ 10 mins after the previous master completes. 
Owner: pprabhu@chromium.org
Status: Assigned (was: Available)
This is a cbuildbot issue.

http://shortn/_XJZN8W79pa

Basically:
- Master sets its own deadline (in this case 7:50 (http://shortn/_N98fDIxcem))
- Slaves, when they start, read the deadline set for master. They compute the time remaining and set their own deadline accordingly.
  - [BUG] To not get weird failures when deadline is too soon, we add 20 minutes to the slave deadline as compared the deadline set by master.
- Slaves abort themselves on deadline.
- Master waits for all slaves and then finishes.

Notice how the canary schedule seems to shift by 10 minutes (ie, it takes 8:10 to finish), that's (7:50 + 20) for the longest running slaves.

What confuses me is that this will always happen as long as there is even a single slave that times out. Why are we only seeing it now? Are we getting more likely to get slave timeouts lately?

Anyway, suggested fix is to only add 20 minutes. if the deadline is too small. This will still take care of corner cases where the slave is given a very small time to finish while not adding the time artificially each time.
We sometimes start the master late after waterfall restarts (coordination of restarts is hard).

Also, we reboot the master between builds, which might take a couple of minutes.

So... if the master is taking 7:58 minutes to run, and we get behind once, we may never catch up.
Re #4: Without the bug I mentioned, master is intended to take 7:50 precisely to catch up in these situations.

Comment 6 Deleted

We restarted the canary on 4/14, but over the weekend, we started slipping
again:

Apr 17 10:37	??	retry		#3056	Build successful exception slave lost exception canarycompletion
Apr 17 02:36	??	success		#3055	Build successful
Apr 16 18:30	??	success		#3054	Build successful
Apr 16 10:27	??	success		#3053	Build successful
Apr 16 02:23	??	success		#3052	Build successful
Apr 15 18:19	??	success		#3051	Build successful
Apr 15 10:13	??	success		#3050	Build successful
Apr 15 02:08	??	success		#3049	Build successful
Apr 14 18:00	??	success		#3048	Build successful
Apr 14 11:09	??	exception	#3047	Exception steps failed branchutiltest failed canarycompletion

The canary schedule is now two hours later than targeted schedule.

Looking at c#3, c#4, and c#5:  It seems that there's a bug that
we don't restart fast enough, right?  But if I understand properly,
that bug is triggered by a build that's taking too close to the
8 hour time limit.  So:  Why are the canaries suddenly taking so
much longer?

Status: Started (was: Assigned)
Posted https://chromium-review.googlesource.com/c/487101/ to fix the bug mentioned in c#3

I'll hand it over to the deputy afterwards to figure out why slaves are taking too long.

Comment 10 by nxia@chromium.org, Apr 25 2017

#2 isn't strictly correct. I do find a build which took 8h 14m

https://uberchromegw.corp.google.com/i/chromeos/builders/master-release/builds/3042

Comment 11 by nxia@chromium.org, Apr 25 2017

Re #3:

"
by master.
- Slaves abort themselves on deadline.
- Master waits for all slaves and then finishes.
"

1) As I saw from the slave builds, all the slaves finished/aborted on deadline.
2) the master doesn't wait for all slaves to finish, instead it has its own timeout at waiting. I think the problem is the masters used up their time at Completion stage to wait for the slaves, which means the earlier stages + completion stage used 8h, report stage used ~ 5 mins, gap between the finishing time of the old master and the starting time of the new master took ~5 mins.

When every CQ master takes 8h + ~10 mins, we have the slipping canary builds.


 
https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/stages/completion_stages.py?type=cs&q=GetTimeToDeadline&sq=package:%5Echromeos_(internal%7Cpublic)$&l=195

Cc: -dgarr...@chromium.org
Owner: dgarr...@chromium.org
nxia@ and I looked at this together. The root cause is something entirely different:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-release/builds/3042

For this build, the master took 8:14.

- 18:40 (cbuildbot starts)
- 18:59 (BuildStart stage starts, sets deadline to 7:50 for master)
- 02:50 (CanaryCompletion bails after 7:50 as expected)
- 02:54 (Build finishes after ~4 minutes in ReportStage as expected)

The problem is that 19 minutes are spent _before_ BuildStart, and before we've had a chance to set our deadline.

This is because the new bootstrap mechanism added by dgarrett@ syncs all of chromeos sources before cbuildbot even runs. This takes 19 minutes.

We need some way to account for this time spent before cbuildbot starts in the deadline computation for cbuildbot.
Project Member

Comment 13 by bugdroid1@chromium.org, Apr 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/e49f2aa32465d274fc145b1ba3c5df53be68a506

commit e49f2aa32465d274fc145b1ba3c5df53be68a506
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Apr 26 01:40:07 2017

cbuildbot: Set a minimum slave timeout.

Before this CL, we added 20 seconds to the timeout requested for a slave
build by the master build. This is only necessary if the requested
timeout is too small, and causes problems when the timeout is large and
the slave actually times out.
Instead, we now only cap the minimum.

BUG= chromium:711778 
TEST=unittests

Change-Id: Ibccee72c17a3aa4a79181560f188cf0dffa5724a
Reviewed-on: https://chromium-review.googlesource.com/487101
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/e49f2aa32465d274fc145b1ba3c5df53be68a506/scripts/cbuildbot.py

Since the change above, the canary builds have begun to trend earlier,
meaning they're headed towards "back on schedule."  At current rates,
we'd get back on track somewhere around the weekend, I think.

Apr 25 20:48	??	success	#3081	Build successful
Apr 25 13:22	??	success	#3080	Build successful
Apr 25 05:19	??	success	#3079	Build successful
Apr 24 21:12	??	success	#3078	Build successful
Apr 24 13:09	??	success	#3077	Build successful
Apr 24 05:03	??	success	#3076	Build successful
Apr 23 21:00	??	success	#3075	Build successful
Apr 23 12:54	??	success	#3074	Build successful
Apr 23 04:49	??	success	#3073	Build successful
Project Member

Comment 15 by bugdroid1@chromium.org, Apr 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/125d4dc1d7a51e30f51fc9175a66d61629a54dfa

commit 125d4dc1d7a51e30f51fc9175a66d61629a54dfa
Author: Don Garrett <dgarrett@google.com>
Date: Fri Apr 28 01:40:28 2017

cbuildbot_launch: Add buildbot stage tags.

Wrap major tasks done by cbuildbot_launch with stage tags, so they
will appear on the waterfall.

Also, treat not specifying a branch as an explicit 'master' for
cleanup / logging purposes. We have a few builders that explicitly set
the branch 'master', and some that get master by not specifying a
branch at all. I was hesitant to treat them identically, but it should
be okay.

BUG= chromium:711778 
TEST=Unittests

Change-Id: Ib068893fcd4211068a6448119aefbe72e5f06893
Reviewed-on: https://chromium-review.googlesource.com/487030
Tested-by: Don Garrett <dgarrett@chromium.org>
Trybot-Ready: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Commit-Queue: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/125d4dc1d7a51e30f51fc9175a66d61629a54dfa/scripts/cbuildbot_launch.py
[modify] https://crrev.com/125d4dc1d7a51e30f51fc9175a66d61629a54dfa/scripts/cbuildbot_launch_unittest.py

There are now waterfall stages breaking out performance for the major launcher tasks. However, those performance numbers are NOT gathered anywhere outside of the waterfall.
Status: Fixed (was: Started)
... aaaand we're back on schedule:

May 01 02:00	??	success	#3097	Build successful
Apr 30 18:00	??	success	#3096	Build successful
Apr 30 10:02	??	success	#3095	Build successful
Apr 30 02:00	??	success	#3094	Build successful
Apr 29 18:00	??	success	#3093	Build successful
Apr 29 10:18	??	success	#3092	Build successful
Apr 29 02:24	??	success	#3091	Build successful
Apr 28 18:18	??	success	#3090	Build successful
Apr 28 11:01	??	success	#3089	Build successful
Apr 28 02:55	??	success	#3088	Build successful
Apr 27 19:28	??	success	#3087	Build successful
Apr 27 12:16	??	success	#3086	Build successful
Apr 27 04:54	??	success	#3085	Build successful
Apr 26 20:51	??	success	#3084	Build successful
Apr 26 12:50	??	success	#3083	Build successful
Apr 26 04:49	??	success	#3082	Build successful
Apr 25 20:48	??	success	#3081	Build successful

I don't see any build timeouts in recent runs either, so I'm
declaring victory here.

Sign in to add a comment