New issue
Advanced search Search tips

Issue 876579 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

The expiration seconds for a fallback job

Project Member Reported by xixuan@chromium.org, Aug 22

Issue description

We schedule CQ job as a fallback job now:
https://chrome-swarming.appspot.com/task?id=3f76f4db054ca210&refresh=10

And set its expiration_secs as 90mins as it's the timeout for the suite.

However, this job waits till 3:36:35 to start instead of getting expired. The pending time is 2h 18m 17s.

I guess it's because for each slice, I pass in the expiration_secs as 5400s. 

""" The command to schedule this job:

2018-08-21 13:18:17,719 INFO | Scheduling test logging_CrashSender
2018-08-21 13:18:17,721 INFO | RunCommand: /usr/local/google/home/chromeos-test/chromiumos/chromite/third_party/swarming.client/swarming.py post --auth-service-account-json /creds/skylab_swarming_bot/skylab_bot_service_account.json --swarming https://chrome-swarming.appspot.com tasks/new
2018-08-21 13:18:18,341 INFO | Input: {'priority': 50, 'parent_task_id': '3f76f45ee1eac911', 'task_slices': [{'expiration_secs': 5400, 'properties': {'execution_timeout_secs': 3600, 'io_timeout_secs': 3600, 'grace_period_secs': 3600, 'command': ['/opt/infra-tools/usr/bin/skylab_swarming_worker', '-client-test', '-keyvals', '{"suite": "bvt-inline", "parent_job_id": "3f76f45ee1eac911", "build": "reef-paladin/R70-10987.0.0-rc2", "experimental": "False", "builds": "{\\"cros-version\\": \\"reef-paladin/R70-10987.0.0-rc2\\"}"}', '-task-name', 'logging_CrashSender'], 'dimensions': [{'value': 'reef', 'key': 'label-board'}, {'value': 'DUT_POOL_CQ', 'key': 'label-pool'}, {'value': 'ChromeOSSkylab', 'key': 'pool'}, {'value': 'reef-paladin/R70-10987.0.0-rc2', 'key': 'provisionable-cros-version'}]}}, {'expiration_secs': 5400, 'properties': {'execution_timeout_secs': 3600, 'io_timeout_secs': 3600, 'grace_period_secs': 3600, 'command': ['/opt/infra-tools/usr/bin/skylab_swarming_worker', '-client-test', '-keyvals', '{"suite": "bvt-inline", "parent_job_id": "3f76f45ee1eac911", "build": "reef-paladin/R70-10987.0.0-rc2", "experimental": "False", "builds": "{\\"cros-version\\": \\"reef-paladin/R70-10987.0.0-rc2\\"}"}', '-task-name', 'logging_CrashSender', '-provision-labels', 'cros-version:reef-paladin/R70-10987.0.0-rc2'], 'dimensions': [{'value': 'reef', 'key': 'label-board'}, {'value': 'DUT_POOL_CQ', 'key': 'label-pool'}, {'value': 'ChromeOSSkylab', 'key': 'pool'}]}}], 'name': 'logging_CrashSender', 'tags': ['luci_project:chromiumos', 'build:reef-paladin/R70-10987.0.0-rc2', 'parent_task_id:3f76f45ee1eac911'], 'user': 'skylab_suite_runner'}
"""

Q for @maruel:
1) To make the expiration_secs for the whole task as 5400s, should I pass "'expiration_secs': 5400" in the task level?
2) Can I set different expiration secs for different slice?
3) What happened if the sum of expiration secs of slices > the expiration secs of the task?

Q For pprabhu: If for a normal job (including CQ job), how long we want it to wait for first slice (with provisionable-cros-version) and second slice?
 
For now, I believe we don't want to wait for the first slice too long, if no DUTs are available, (every DUTs with correct-provisioned build is busy), we want to grab other DUTs to run first. So I'm gonna set the first piece's expirations secs shorter than 5 mins to relieve the reef-paladin and see whether it goes better.
Project Member

Comment 2 by bugdroid1@chromium.org, Aug 22

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/651694146aed03326b994abc2789f61283fbd241

commit 651694146aed03326b994abc2789f61283fbd241
Author: Xixuan Wu <xixuan@chromium.org>
Date: Wed Aug 22 18:36:37 2018

autotest: Set expiration_secs as the whole task's expiration_secs.

BUG=chromium:876579
TEST=Test it on staging.

Change-Id: I44094203dc514d959f1f56be31cd07fd2efc55e6
Reviewed-on: https://chromium-review.googlesource.com/1185374
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/swarming_lib_unittest.py
[modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/suite_runner.py
[modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/swarming_lib.py

After testing, I can't pass "'expiration_secs': 5400" in the task level as it reports:

69879 2018-08-22 18:08:16.411 E: Request to https://chromium-swarm-dev.appspot.com/_ah/api/swarming/v1/tasks/new failed with HTTP status code 400: 400 Client Error: Bad Request for url: https://chromium-swarm-dev.appspot.com/_ah/api/swarming/v1/tasks/new - When using task_slices, do not specify a global expiration_secs

So I just pass-in different expiration secs for each slice temporarily in #2.
Cc: ayatane@chromium.org
 Issue 877097  has been merged into this issue.
Project Member

Comment 5 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/26f2ace0ee533e10d5d695af1e7bd37d98ad8984

commit 26f2ace0ee533e10d5d695af1e7bd37d98ad8984
Author: Xixuan Wu <xixuan@chromium.org>
Date: Thu Aug 23 21:08:03 2018

autotest: Set the provision slice expiration as 85mins (longer).

BUG=chromium:876579
TEST=None

Change-Id: Ie9bf9067e73229abd2eb5b0bc763c948815e1ace
Reviewed-on: https://chromium-review.googlesource.com/1187137
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/26f2ace0ee533e10d5d695af1e7bd37d98ad8984/venv/skylab_suite/suite_runner.py

Based on observation in  Issue 877097 , I should set the expiration seconds of first slice much longer.

1) If I set the first slice's expiration seconds short, 5mins vs 85mins,

    Imagine that a task is pending for picking up, during the pending, the first slice will soon expire.
    So when this task is matched to one bot, the first slice is already expired? so it's the second slice that's matched to the bot, and the matched bot will get provisioned again.

2) If I set the first slice's expiration seconds long, 85mins vs 5mins, 

    Imagine that a task is pending for picking up, during the pending, the first slice won't expire.
    So when this task is matched to one bot, the first slice is used. Since 'wait_for_capacity' by default is False, if there's no available bots, second slice will be used for matching. This sounds more reasonable.
The current solution is for CQ jobs, the expiration seconds are split by 95% and 5%.

The reason is in #6.

For non-CQ jobs, the expiration seconds are split by 5% and 95%,

as the first slice may expire very soon due to 'wait_for_capacity'=False. So it quickly jumps to the second slice. If the expiration seconds of second slice is short, these non-CQ jobs will soon expire.
As you correctly figured out. I don't think I have any AI here?
Owner: ayatane@chromium.org
I'll take it for now.  The quota scheduler or similar project should take care of the shortcomings here, under the current suites model I don't think we can do much better.

Sign in to add a comment