Chrome OS canary schedule is slipping
Reported by
jrbarnette@chromium.org,
Apr 14 2017
|
|||||
Issue descriptionOver the past day or two, the Chrome OS canaries have been starting progressively later. We've now reached the point where the finish times of the late starts is past the next start time. This can't be good. Below is the info from the buildbot page: Apr 14 03:12 ?? success #3046 Build successful Apr 13 19:08 ?? success #3045 Build successful Apr 13 11:05 ?? success #3044 Build successful Apr 13 03:00 ?? success #3043 Build successful Apr 12 18:40 ?? success #3042 Build successful Apr 12 10:36 ?? success #3041 Build successful Apr 12 02:29 ?? success #3040 Build successful Apr 11 18:12 ?? success #3039 Build successful Apr 11 10:10 ?? success #3038 Build successful Apr 11 02:09 ?? success #3037 Build successful Apr 10 18:09 ?? success #3036 Build successful Apr 10 10:08 ?? success #3035 Build successful Apr 10 02:06 ?? success #3034 Build successful Apr 09 18:05 ?? success #3033 Build successful Apr 09 10:04 ?? success #3032 Build successful Apr 09 02:03 ?? success #3031 Build successful Apr 08 18:03 ?? success #3030 Build successful Apr 08 10:03 ?? success #3029 Build successful Apr 08 02:03 ?? success #3028 Build successful Apr 07 18:01 ?? success #3027 Build successful
,
Apr 14 2017
/builds/3045 Start Thu Apr 13 19:08:45 2017 End Fri Apr 14 03:06:54 2017 Elapsed 7 hrs, 58 mins, 8 secs /builds/3046 Start Fri Apr 14 03:12:29 2017 End Fri Apr 14 11:03:26 2017 Elapsed 7 hrs, 50 mins, 56 secs The elapsed time of the master builds is expected (< 8 hours). The problem should be the next master starts 5 ~ 10 mins after the previous master completes.
,
Apr 14 2017
This is a cbuildbot issue. http://shortn/_XJZN8W79pa Basically: - Master sets its own deadline (in this case 7:50 (http://shortn/_N98fDIxcem)) - Slaves, when they start, read the deadline set for master. They compute the time remaining and set their own deadline accordingly. - [BUG] To not get weird failures when deadline is too soon, we add 20 minutes to the slave deadline as compared the deadline set by master. - Slaves abort themselves on deadline. - Master waits for all slaves and then finishes. Notice how the canary schedule seems to shift by 10 minutes (ie, it takes 8:10 to finish), that's (7:50 + 20) for the longest running slaves. What confuses me is that this will always happen as long as there is even a single slave that times out. Why are we only seeing it now? Are we getting more likely to get slave timeouts lately? Anyway, suggested fix is to only add 20 minutes. if the deadline is too small. This will still take care of corner cases where the slave is given a very small time to finish while not adding the time artificially each time.
,
Apr 14 2017
We sometimes start the master late after waterfall restarts (coordination of restarts is hard). Also, we reboot the master between builds, which might take a couple of minutes. So... if the master is taking 7:58 minutes to run, and we get behind once, we may never catch up.
,
Apr 14 2017
Re #4: Without the bug I mentioned, master is intended to take 7:50 precisely to catch up in these situations.
,
Apr 18 2017
We restarted the canary on 4/14, but over the weekend, we started slipping again: Apr 17 10:37 ?? retry #3056 Build successful exception slave lost exception canarycompletion Apr 17 02:36 ?? success #3055 Build successful Apr 16 18:30 ?? success #3054 Build successful Apr 16 10:27 ?? success #3053 Build successful Apr 16 02:23 ?? success #3052 Build successful Apr 15 18:19 ?? success #3051 Build successful Apr 15 10:13 ?? success #3050 Build successful Apr 15 02:08 ?? success #3049 Build successful Apr 14 18:00 ?? success #3048 Build successful Apr 14 11:09 ?? exception #3047 Exception steps failed branchutiltest failed canarycompletion
,
Apr 21 2017
The canary schedule is now two hours later than targeted schedule. Looking at c#3, c#4, and c#5: It seems that there's a bug that we don't restart fast enough, right? But if I understand properly, that bug is triggered by a build that's taking too close to the 8 hour time limit. So: Why are the canaries suddenly taking so much longer?
,
Apr 25 2017
Posted https://chromium-review.googlesource.com/c/487101/ to fix the bug mentioned in c#3 I'll hand it over to the deputy afterwards to figure out why slaves are taking too long.
,
Apr 25 2017
#2 isn't strictly correct. I do find a build which took 8h 14m https://uberchromegw.corp.google.com/i/chromeos/builders/master-release/builds/3042
,
Apr 25 2017
Re #3: " by master. - Slaves abort themselves on deadline. - Master waits for all slaves and then finishes. " 1) As I saw from the slave builds, all the slaves finished/aborted on deadline. 2) the master doesn't wait for all slaves to finish, instead it has its own timeout at waiting. I think the problem is the masters used up their time at Completion stage to wait for the slaves, which means the earlier stages + completion stage used 8h, report stage used ~ 5 mins, gap between the finishing time of the old master and the starting time of the new master took ~5 mins. When every CQ master takes 8h + ~10 mins, we have the slipping canary builds. https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/stages/completion_stages.py?type=cs&q=GetTimeToDeadline&sq=package:%5Echromeos_(internal%7Cpublic)$&l=195
,
Apr 25 2017
nxia@ and I looked at this together. The root cause is something entirely different: https://uberchromegw.corp.google.com/i/chromeos/builders/master-release/builds/3042 For this build, the master took 8:14. - 18:40 (cbuildbot starts) - 18:59 (BuildStart stage starts, sets deadline to 7:50 for master) - 02:50 (CanaryCompletion bails after 7:50 as expected) - 02:54 (Build finishes after ~4 minutes in ReportStage as expected) The problem is that 19 minutes are spent _before_ BuildStart, and before we've had a chance to set our deadline. This is because the new bootstrap mechanism added by dgarrett@ syncs all of chromeos sources before cbuildbot even runs. This takes 19 minutes. We need some way to account for this time spent before cbuildbot starts in the deadline computation for cbuildbot.
,
Apr 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/e49f2aa32465d274fc145b1ba3c5df53be68a506 commit e49f2aa32465d274fc145b1ba3c5df53be68a506 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Apr 26 01:40:07 2017 cbuildbot: Set a minimum slave timeout. Before this CL, we added 20 seconds to the timeout requested for a slave build by the master build. This is only necessary if the requested timeout is too small, and causes problems when the timeout is large and the slave actually times out. Instead, we now only cap the minimum. BUG= chromium:711778 TEST=unittests Change-Id: Ibccee72c17a3aa4a79181560f188cf0dffa5724a Reviewed-on: https://chromium-review.googlesource.com/487101 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/e49f2aa32465d274fc145b1ba3c5df53be68a506/scripts/cbuildbot.py
,
Apr 26 2017
Since the change above, the canary builds have begun to trend earlier, meaning they're headed towards "back on schedule." At current rates, we'd get back on track somewhere around the weekend, I think. Apr 25 20:48 ?? success #3081 Build successful Apr 25 13:22 ?? success #3080 Build successful Apr 25 05:19 ?? success #3079 Build successful Apr 24 21:12 ?? success #3078 Build successful Apr 24 13:09 ?? success #3077 Build successful Apr 24 05:03 ?? success #3076 Build successful Apr 23 21:00 ?? success #3075 Build successful Apr 23 12:54 ?? success #3074 Build successful Apr 23 04:49 ?? success #3073 Build successful
,
Apr 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/125d4dc1d7a51e30f51fc9175a66d61629a54dfa commit 125d4dc1d7a51e30f51fc9175a66d61629a54dfa Author: Don Garrett <dgarrett@google.com> Date: Fri Apr 28 01:40:28 2017 cbuildbot_launch: Add buildbot stage tags. Wrap major tasks done by cbuildbot_launch with stage tags, so they will appear on the waterfall. Also, treat not specifying a branch as an explicit 'master' for cleanup / logging purposes. We have a few builders that explicitly set the branch 'master', and some that get master by not specifying a branch at all. I was hesitant to treat them identically, but it should be okay. BUG= chromium:711778 TEST=Unittests Change-Id: Ib068893fcd4211068a6448119aefbe72e5f06893 Reviewed-on: https://chromium-review.googlesource.com/487030 Tested-by: Don Garrett <dgarrett@chromium.org> Trybot-Ready: Don Garrett <dgarrett@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Commit-Queue: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/125d4dc1d7a51e30f51fc9175a66d61629a54dfa/scripts/cbuildbot_launch.py [modify] https://crrev.com/125d4dc1d7a51e30f51fc9175a66d61629a54dfa/scripts/cbuildbot_launch_unittest.py
,
Apr 28 2017
There are now waterfall stages breaking out performance for the major launcher tasks. However, those performance numbers are NOT gathered anywhere outside of the waterfall.
,
May 1 2017
... aaaand we're back on schedule: May 01 02:00 ?? success #3097 Build successful Apr 30 18:00 ?? success #3096 Build successful Apr 30 10:02 ?? success #3095 Build successful Apr 30 02:00 ?? success #3094 Build successful Apr 29 18:00 ?? success #3093 Build successful Apr 29 10:18 ?? success #3092 Build successful Apr 29 02:24 ?? success #3091 Build successful Apr 28 18:18 ?? success #3090 Build successful Apr 28 11:01 ?? success #3089 Build successful Apr 28 02:55 ?? success #3088 Build successful Apr 27 19:28 ?? success #3087 Build successful Apr 27 12:16 ?? success #3086 Build successful Apr 27 04:54 ?? success #3085 Build successful Apr 26 20:51 ?? success #3084 Build successful Apr 26 12:50 ?? success #3083 Build successful Apr 26 04:49 ?? success #3082 Build successful Apr 25 20:48 ?? success #3081 Build successful I don't see any build timeouts in recent runs either, so I'm declaring victory here. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Apr 14 2017