Swarming does not respect build_timeout. |
|||||||||||
Issue descriptionhttps://ci.chromium.org/p/chromeos/builds/b8949387457722752368 was interrupted after 12 hrs. Also seen the same on othre tryjobs as well. The build_timeout is specified as 18 hours in the config. Swarming should use that for the timeout value. https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/chromeos_config.py?type=cs&q=%22chromiumos-sdk%22+package:%5Echromeos_public$&l=3965
,
Apr 13 2018
what is the impact? is this really only a P2?
,
Apr 13 2018
Some tryjobs complete in 12 hours, but some not. Since, it only impacts tryjobs so I am putting it as a P2 for now (I am assuming we are not blocked by this issue right now).
,
Apr 13 2018
Actually, the new tryjob system is killing the builds after a 24 hour timeout (which is a hard limit), but the UI is reporting the run time (http://b/77643541). One of the links in the top left of the build details jumps to the Milo build details page which has the correct times.
,
Apr 13 2018
So... your bug is correct about not respecting the timeouts, but not in the way you think. All builds are currently running with 24 hour timeouts. I should probably keep this bug and decide how to deal with that issue.
,
Apr 13 2018
First thought for a solution is this.... Remove build_timeout from chromeos_config, but add timeouts to the LUCI Builders, when they are created. This means that timeouts will be constrolled at the same level as build priority, and pooling, which seems reasonable. That would leave all tryjobs with a 24 hour timeout. NOTE: I think that timeout is from scheduling of the build, not start of the build.
,
Apr 13 2018
PS: If I'm wrong and there are any builds that Milo shows as being killed sooner PLEASE point out the examples.
,
Apr 13 2018
https://ci.chromium.org/p/chromeos/builds/b8949387457722752368 shows a timeout after 12 hours. IIUC, this is the milo link you are referring to? Timing: Create 2018-04-12 10:07 PM (PDT) Start 2018-04-12 10:07 PM (PDT) End 2018-04-13 10:07 AM (PDT) Pending 723 ms Execution 12 hrs Same story with the job here (I started it yesterday night so it is not definitely 24 hours from job creation): https://ci.chromium.org/p/chromeos/builds/b8949384961709083280 Timing: Create 2018-04-12 10:47 PM (PDT) Start 2018-04-12 10:47 PM (PDT) End 2018-04-13 10:47 AM (PDT) Pending 1 secs Execution 12 hrs
,
Apr 13 2018
Oh... interesting. I should have checked more carefully. I've launched chromiumos-sdk-tryjob recently and observed the 24 hour timeout, because of problems seen by Mike. I just assumed this was the same.
,
Apr 16 2018
Issue 833191 has been merged into this issue.
,
Apr 18 2018
I was able to reproduce a hang during an emerge with: cros tryjob --local chromiumos-sdk-tryjob Unfortunately, I did not record anything else about the hang.
,
Apr 19 2018
https://chrome-swarming.appspot.com/task?id=3cd6e50722e45b10&refresh=10&request_detail=true&show_raw=1&wide_logs=true Shows 12 hours "Execution timeout". So this is definitely a builder timeout setting issue.
,
Apr 19 2018
Note regarding #12, expand the more details field to see the "Execution timeout" value. Another currently running related task (https://chrome-swarming.appspot.com/task?id=3cf7aac760428510&refresh=10) has the same "Execution timeout" set to 12 hours.
,
Apr 19 2018
I believe that the timeout field is this one (execution_timeout_secs) and needs to be set to a higher value. https://cs.chromium.org/chromium/infra/luci/client/swarming.py?type=cs&q=%22execution_timeout_secs%22&sq=package:chromium&l=117: # See ../appengine/swarming/swarming_rpcs.py. TaskProperties = collections.namedtuple( 'TaskProperties', [ 'caches', 'cipd_input', 'command', 'relative_cwd', 'dimensions', 'env', 'env_prefixes', 'execution_timeout_secs', 'extra_args', 'grace_period_secs', 'idempotent', 'inputs_ref', 'io_timeout_secs', 'outputs', 'secret_bytes', ])
,
Apr 19 2018
Have to bump to P1 now since we haven't had a fully working chromiumos-sdk tryjob now for a week.
,
Apr 23 2018
Are we certain that chromiumos-sdk-tryjob was working before a week or two ago? I did reproduce the hang on my workstation with a local tryjob.
,
Apr 23 2018
Passing tryjob before move to swarming (April 4) : https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/etc/1551 Start 2018-04-04 4:00 PM (PDT) End 2018-04-05 6:46 AM (PDT) Pending N/A Execution 14 hrs 46 mins After move to swarming (April 5): https://ci.chromium.org/p/chromeos/builds/b8950050300777031680 Killed after 12 hours: Create 2018-04-05 2:31 PM (PDT) Start 2018-04-05 2:32 PM (PDT) End 2018-04-06 2:32 AM (PDT) Pending 395 ms Execution 12 hrs All other jobs are also getting killed at 12 hours at different locations. So it can't be a hang.
,
Apr 23 2018
Thanks!
,
Apr 24 2018
Any progress on this? We need it in the incoming toolchain upgrade, which is supposed to happen this week.
,
Apr 25 2018
Is there any possible way to work this around? I.e., any hack that I can get a successful build of chromiumos-sdk? For example, is it possible to use the old builder? IIUC this has been broken for 3 weeks and our tests rely on this. We really need help.
,
Apr 25 2018
Should it be a P0 now?
,
Apr 25 2018
I still believe this is at least partially misdiagnosed, but the 12h timeout is also something I can't yet explain. I'm running a local tryjob against chromiumos-sdk-tryjob (which has no external timeout) to see if it reproduces the hang I've seen before, and will record the details here if it does. I'll also investigate previous builds to try to get a better understanding of what timeouts are happening.
,
Apr 25 2018
Looking at this build in detail: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8949589205302240816 It's cbuildbot timeout (18 hours, as expected): "build_timeout": 64800, Milo UI shows 12 hours + timeout error: https://ci.chromium.org/p/chromeos/builds/b8949589205302240816 Buildbucket details are here: https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8949589205302240816&_h=1& Swarming Task details are here: https://chrome-swarming.appspot.com/task?id=3ccac08fe3793a10&refresh=10&request_detail=true&show_raw=1&wide_logs=true I don't yet see where the 12 hours came from.
,
Apr 25 2018
Don, what is the timeout value you are setting for swarming tryjobs? As I stated in #13, the jobs have an execution timeout set to 12 hours. e.g. Open https://chrome-swarming.appspot.com/task?id=3cf7aac760428510&refresh=10 and expand the more details field to see the 12 hours "Execution timeout" value. If you are setting 24 hours, I suspect swarming is enforcing an internal hard limit of 12 hours irrespective of value passed to it.
,
Apr 25 2018
I found where it comes from. It's the Generic builder definition. I'm going to readjust the builder configurations, which I've meant to do for a while. Afterwards, I'll update the cros tryjob command to use the proper configs, which should help with this.
,
Apr 25 2018
I've put up this CL: https://crrev.com/i/614798 After it lands, I'll tweak cros tryjob to do the right thing. That should fix this problem.
,
Apr 25 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/manifest-internal/+/80ae6336d832f37181b6874fd8cb51e4223b4f8c commit 80ae6336d832f37181b6874fd8cb51e4223b4f8c Author: Don Garrett <dgarrett@google.com> Date: Wed Apr 25 21:12:21 2018
,
Apr 25 2018
,
Apr 25 2018
Running this chromiumos-sdk-tryjob that includes the above changes in it's scheduling: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8948238308408745776
,
Apr 25 2018
thanks!
,
Apr 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9 commit ad0b2d2757e84c16bca4413f506dd8d9f8a123f9 Author: Don Garrett <dgarrett@google.com> Date: Wed Apr 25 23:24:49 2018 config_lib: Add luci_builder build config value. Define a new build config property that defines which LUCI Builder should be used for a given build config. BUG= chromium:832747 chromium:824550 TEST=run_tests Change-Id: Iee146c1338d94dce0856c1a36a4a31415bb42784 Reviewed-on: https://chromium-review.googlesource.com/1028849 Tested-by: Don Garrett <dgarrett@chromium.org> Trybot-Ready: Don Garrett <dgarrett@chromium.org> Reviewed-by: Manoj Gupta <manojgupta@chromium.org> [modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/cbuildbot/config_dump.json [modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/lib/config_lib.py [modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/cbuildbot/chromeos_config.py
,
Apr 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/48a13aacf8809f30db9ad92c463d500df235ab77 commit 48a13aacf8809f30db9ad92c463d500df235ab77 Author: Don Garrett <dgarrett@google.com> Date: Wed Apr 25 23:26:32 2018 remote_try: Use luci_builder build config value. When scheduling a build via remote_try, use the luci_builder value from it's build config instead of "Generic". This allows timeouts and build priority to be correctly applied to the builds. Use "Try" as the default for all unknown builds (ie: some branch builds). Currently, this distinguishes between 'Try', 'PreCQ', and 'Prod'. BUG= chromium:832747 chromium:824550 TEST=run_tests && lib/remote_try_unittest.py --network Change-Id: I0a9029e1fb78057d2b91d41d75f87b36b2b1a833 Reviewed-on: https://chromium-review.googlesource.com/1028850 Tested-by: Don Garrett <dgarrett@chromium.org> Trybot-Ready: Don Garrett <dgarrett@chromium.org> Reviewed-by: Manoj Gupta <manojgupta@chromium.org> Reviewed-by: Jason Clinton <jclinton@chromium.org> Commit-Queue: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/48a13aacf8809f30db9ad92c463d500df235ab77/lib/remote_try.py [modify] https://crrev.com/48a13aacf8809f30db9ad92c463d500df235ab77/lib/remote_try_unittest.py
,
Apr 25 2018
This should now be fixed. Just do a "repo sync" before submitting any more tryjobs.
,
Apr 25 2018
,
Apr 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/60798588c4ed780d47333594ff5218bd93fc3516 commit 60798588c4ed780d47333594ff5218bd93fc3516 Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Thu Apr 26 00:47:46 2018 Roll src/third_party/chromite/ 47b24cc1c..ad0b2d275 (1 commit) https://chromium.googlesource.com/chromiumos/chromite.git/+log/47b24cc1ce35..ad0b2d2757e8 $ git log 47b24cc1c..ad0b2d275 --date=short --no-merges --format='%ad %ae %s' 2018-04-25 dgarrett config_lib: Add luci_builder build config value. Created with: roll-dep src/third_party/chromite BUG= chromium:832747 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: Ida61a24a67a28c35a7f9faa2ab4262fafe948a40 Reviewed-on: https://chromium-review.googlesource.com/1029178 Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#553856} [modify] https://crrev.com/60798588c4ed780d47333594ff5218bd93fc3516/DEPS
,
Apr 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/14c132c20712673ffbe58d98d409b32d00b734c1 commit 14c132c20712673ffbe58d98d409b32d00b734c1 Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Thu Apr 26 04:03:15 2018 Roll src/third_party/chromite/ ad0b2d275..48a13aacf (1 commit) https://chromium.googlesource.com/chromiumos/chromite.git/+log/ad0b2d2757e8..48a13aacf880 $ git log ad0b2d275..48a13aacf --date=short --no-merges --format='%ad %ae %s' 2018-04-25 dgarrett remote_try: Use luci_builder build config value. Created with: roll-dep src/third_party/chromite BUG= chromium:832747 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I100607416aef078b66dbe44a837db19069ca5ecb Reviewed-on: https://chromium-review.googlesource.com/1029352 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#553907} [modify] https://crrev.com/14c132c20712673ffbe58d98d409b32d00b734c1/DEPS
,
Apr 26 2018
I see a timeout of 23h:50m for new jobs. (Also started a repo sync on chrotomation2 so that new jobs see the higher timeout)
,
Apr 26 2018
I filed this after reproducing the local hang again. https://crbug.com/837330
,
May 1 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by manojgupta@chromium.org
, Apr 13 2018