New issue
Advanced search Search tips

Issue 832747 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Chrome
Pri: 1
Type: Bug
Build-Toolchain

Blocking:
issue 834078



Sign in to add a comment

Swarming does not respect build_timeout.

Project Member Reported by manojgupta@chromium.org, Apr 13 2018

Issue description

https://ci.chromium.org/p/chromeos/builds/b8949387457722752368
was interrupted after 12 hrs. Also seen the same on othre tryjobs as well.

The build_timeout is specified as 18 hours in the config. Swarming should use that for the timeout value.

https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/chromeos_config.py?type=cs&q=%22chromiumos-sdk%22+package:%5Echromeos_public$&l=3965
 
Components: Infra>Client>ChromeOS
what is the impact? is this really only a P2?
Some tryjobs complete in 12 hours, but some not. Since, it only impacts tryjobs so I am putting it as a P2 for now (I am assuming we are not blocked by this issue right now).
Actually, the new tryjob system is killing the builds after a 24 hour timeout (which is a hard limit), but the UI is reporting the run time (http://b/77643541).

One of the links in the top left of the build details jumps to the Milo build details page which has the correct times.
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Labels: Swarming OS-Linux
Status: Assigned (was: Untriaged)
Summary: Swarming does not respect build_timeout. (was: Swarming does not respect build_timeout for chromiumos-sdk tryjobs)
So... your bug is correct about not respecting the timeouts, but not in the way you think. All builds are currently running with 24 hour timeouts.

I should probably keep this bug and decide how to deal with that issue.

First thought for a solution is this....

Remove build_timeout from chromeos_config, but add timeouts to the LUCI Builders, when they are created. This means that timeouts will be constrolled at the same level as build priority, and pooling, which seems reasonable.

That would leave all tryjobs with a 24 hour timeout.

NOTE: I think that timeout is from scheduling of the build, not start of the build.
PS: If I'm wrong and there are any builds that Milo shows as being killed sooner PLEASE point out the examples.
https://ci.chromium.org/p/chromeos/builds/b8949387457722752368 shows a timeout after 12 hours. IIUC, this is the milo link you are referring to?

Timing:
Create	2018-04-12 10:07 PM (PDT)
Start	2018-04-12 10:07 PM (PDT)
End	2018-04-13 10:07 AM (PDT)
Pending	723 ms
Execution	12 hrs

Same story with the job here (I started it yesterday night so it is not definitely 24 hours from job creation):
https://ci.chromium.org/p/chromeos/builds/b8949384961709083280

Timing:
Create	2018-04-12 10:47 PM (PDT)
Start	2018-04-12 10:47 PM (PDT)
End	2018-04-13 10:47 AM (PDT)
Pending	1 secs
Execution	12 hrs
Cc: mikenichols@chromium.org
Oh... interesting. I should have checked more carefully.

I've launched chromiumos-sdk-tryjob recently and observed the 24 hour timeout, because of problems seen by Mike. I just assumed this was the same.

Cc: akes...@chromium.org dgarr...@chromium.org manojgupta@chromium.org
 Issue 833191  has been merged into this issue.
I was able to reproduce a hang during an emerge with:

  cros tryjob --local chromiumos-sdk-tryjob

Unfortunately, I did not record anything else about the hang.

https://chrome-swarming.appspot.com/task?id=3cd6e50722e45b10&refresh=10&request_detail=true&show_raw=1&wide_logs=true

Shows 12 hours "Execution timeout". So this is definitely a builder timeout setting issue.
Note regarding #12, expand the more details field to see the "Execution timeout" value.

Another currently running related task (https://chrome-swarming.appspot.com/task?id=3cf7aac760428510&refresh=10) has the same "Execution timeout" set to 12 hours.


I believe that the timeout field is this one (execution_timeout_secs) and needs to be set to a higher value.
https://cs.chromium.org/chromium/infra/luci/client/swarming.py?type=cs&q=%22execution_timeout_secs%22&sq=package:chromium&l=117:

# See ../appengine/swarming/swarming_rpcs.py.
TaskProperties = collections.namedtuple(
    'TaskProperties',
    [
      'caches',
      'cipd_input',
      'command',
      'relative_cwd',
      'dimensions',
      'env',
      'env_prefixes',
      'execution_timeout_secs',
      'extra_args',
      'grace_period_secs',
      'idempotent',
      'inputs_ref',
      'io_timeout_secs',
      'outputs',
      'secret_bytes',
    ])
Labels: -Pri-2 Pri-1
Have to bump to P1 now since we haven't had a fully working chromiumos-sdk tryjob now for a week.
Are we certain that chromiumos-sdk-tryjob was working before a week or two ago?

I did reproduce the hang on my workstation with a local tryjob.
Passing tryjob before move to swarming (April 4) : https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/etc/1551

Start	2018-04-04 4:00 PM (PDT)
End	2018-04-05 6:46 AM (PDT)
Pending	N/A
Execution	14 hrs 46 mins

After move to swarming (April 5):
https://ci.chromium.org/p/chromeos/builds/b8950050300777031680
Killed after 12 hours:

Create	2018-04-05 2:31 PM (PDT)
Start	2018-04-05 2:32 PM (PDT)
End	2018-04-06 2:32 AM (PDT)
Pending	395 ms
Execution	12 hrs

All other jobs are also getting killed at 12 hours at different locations. So it can't be a hang.
Thanks!
Any progress on this? We need it in the incoming toolchain upgrade, which is supposed to happen this week. 
Is there any possible way to work this around? I.e., any hack that I can get a successful build of chromiumos-sdk? For example, is it possible to use the old builder?

IIUC this has been broken for 3 weeks and our tests rely on this. We really need help.
Labels: -Pri-1 Pri-0
Should it be a P0 now?
I still believe this is at least partially misdiagnosed, but the 12h timeout is also something I can't yet explain.

I'm running a local tryjob against chromiumos-sdk-tryjob (which has no external timeout) to see if it reproduces the hang I've seen before, and will record the details here if it does.

I'll also investigate previous builds to try to get a better understanding of what timeouts are happening.
Don, what is the timeout value you are setting for swarming tryjobs?

As I stated in #13, the jobs have an execution timeout set to 12 hours.
e.g. Open https://chrome-swarming.appspot.com/task?id=3cf7aac760428510&refresh=10
and expand the more details field to see the 12 hours "Execution timeout" value.

If you are setting 24 hours, I suspect swarming is enforcing an internal hard limit of 12 hours irrespective of value passed to it.
I found where it comes from. It's the Generic builder definition.

I'm going to readjust the builder configurations, which I've meant to do for a while. Afterwards, I'll update the cros tryjob command to use the proper configs, which should help with this.
Status: Started (was: Assigned)
I've put up this CL:

https://crrev.com/i/614798

After it lands, I'll tweak cros tryjob to do the right thing. That should fix this problem.
Project Member

Comment 27 by bugdroid1@chromium.org, Apr 25 2018

Labels: merge-merged-config
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/manifest-internal/+/80ae6336d832f37181b6874fd8cb51e4223b4f8c

commit 80ae6336d832f37181b6874fd8cb51e4223b4f8c
Author: Don Garrett <dgarrett@google.com>
Date: Wed Apr 25 21:12:21 2018

Labels: -Pri-0 Pri-1
Tryjob tweaks:
  https://crrev.com/c/1028849/1
  https://crrev.com/c/1028850/1

Running this chromiumos-sdk-tryjob that includes the above changes in it's scheduling:

http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8948238308408745776
thanks!
Project Member

Comment 31 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9

commit ad0b2d2757e84c16bca4413f506dd8d9f8a123f9
Author: Don Garrett <dgarrett@google.com>
Date: Wed Apr 25 23:24:49 2018

config_lib: Add luci_builder build config value.

Define a new build config property that defines which LUCI Builder
should be used for a given build config.

BUG= chromium:832747 
    chromium:824550
TEST=run_tests

Change-Id: Iee146c1338d94dce0856c1a36a4a31415bb42784
Reviewed-on: https://chromium-review.googlesource.com/1028849
Tested-by: Don Garrett <dgarrett@chromium.org>
Trybot-Ready: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>

[modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/cbuildbot/config_dump.json
[modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/lib/config_lib.py
[modify] https://crrev.com/ad0b2d2757e84c16bca4413f506dd8d9f8a123f9/cbuildbot/chromeos_config.py

Project Member

Comment 32 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/48a13aacf8809f30db9ad92c463d500df235ab77

commit 48a13aacf8809f30db9ad92c463d500df235ab77
Author: Don Garrett <dgarrett@google.com>
Date: Wed Apr 25 23:26:32 2018

remote_try: Use luci_builder build config value.

When scheduling a build via remote_try, use the luci_builder value
from it's build config instead of "Generic". This allows timeouts and
build priority to be correctly applied to the builds.

Use "Try" as the default for all unknown builds (ie: some branch
builds).

Currently, this distinguishes between 'Try', 'PreCQ', and 'Prod'.

BUG= chromium:832747 
    chromium:824550
TEST=run_tests  && lib/remote_try_unittest.py --network

Change-Id: I0a9029e1fb78057d2b91d41d75f87b36b2b1a833
Reviewed-on: https://chromium-review.googlesource.com/1028850
Tested-by: Don Garrett <dgarrett@chromium.org>
Trybot-Ready: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>
Reviewed-by: Jason Clinton <jclinton@chromium.org>
Commit-Queue: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/48a13aacf8809f30db9ad92c463d500df235ab77/lib/remote_try.py
[modify] https://crrev.com/48a13aacf8809f30db9ad92c463d500df235ab77/lib/remote_try_unittest.py

This should now be fixed.

Just do a "repo sync" before submitting any more tryjobs.
Status: Fixed (was: Started)
Project Member

Comment 35 by bugdroid1@chromium.org, Apr 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/60798588c4ed780d47333594ff5218bd93fc3516

commit 60798588c4ed780d47333594ff5218bd93fc3516
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Thu Apr 26 00:47:46 2018

Roll src/third_party/chromite/ 47b24cc1c..ad0b2d275 (1 commit)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/47b24cc1ce35..ad0b2d2757e8

$ git log 47b24cc1c..ad0b2d275 --date=short --no-merges --format='%ad %ae %s'
2018-04-25 dgarrett config_lib: Add luci_builder build config value.

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:832747 


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: Ida61a24a67a28c35a7f9faa2ab4262fafe948a40
Reviewed-on: https://chromium-review.googlesource.com/1029178
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#553856}
[modify] https://crrev.com/60798588c4ed780d47333594ff5218bd93fc3516/DEPS

Project Member

Comment 36 by bugdroid1@chromium.org, Apr 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/14c132c20712673ffbe58d98d409b32d00b734c1

commit 14c132c20712673ffbe58d98d409b32d00b734c1
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Thu Apr 26 04:03:15 2018

Roll src/third_party/chromite/ ad0b2d275..48a13aacf (1 commit)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/ad0b2d2757e8..48a13aacf880

$ git log ad0b2d275..48a13aacf --date=short --no-merges --format='%ad %ae %s'
2018-04-25 dgarrett remote_try: Use luci_builder build config value.

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:832747 


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: I100607416aef078b66dbe44a837db19069ca5ecb
Reviewed-on: https://chromium-review.googlesource.com/1029352
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#553907}
[modify] https://crrev.com/14c132c20712673ffbe58d98d409b32d00b734c1/DEPS

Status: Verified (was: Fixed)
I see a timeout of 23h:50m for new jobs.

(Also started a repo sync on chrotomation2 so that new jobs see the higher timeout)
I filed this after reproducing the local hang again.

 https://crbug.com/837330 
Blocking: 834078

Sign in to add a comment