The expiration seconds for a fallback job |
||
Issue descriptionWe schedule CQ job as a fallback job now: https://chrome-swarming.appspot.com/task?id=3f76f4db054ca210&refresh=10 And set its expiration_secs as 90mins as it's the timeout for the suite. However, this job waits till 3:36:35 to start instead of getting expired. The pending time is 2h 18m 17s. I guess it's because for each slice, I pass in the expiration_secs as 5400s. """ The command to schedule this job: 2018-08-21 13:18:17,719 INFO | Scheduling test logging_CrashSender 2018-08-21 13:18:17,721 INFO | RunCommand: /usr/local/google/home/chromeos-test/chromiumos/chromite/third_party/swarming.client/swarming.py post --auth-service-account-json /creds/skylab_swarming_bot/skylab_bot_service_account.json --swarming https://chrome-swarming.appspot.com tasks/new 2018-08-21 13:18:18,341 INFO | Input: {'priority': 50, 'parent_task_id': '3f76f45ee1eac911', 'task_slices': [{'expiration_secs': 5400, 'properties': {'execution_timeout_secs': 3600, 'io_timeout_secs': 3600, 'grace_period_secs': 3600, 'command': ['/opt/infra-tools/usr/bin/skylab_swarming_worker', '-client-test', '-keyvals', '{"suite": "bvt-inline", "parent_job_id": "3f76f45ee1eac911", "build": "reef-paladin/R70-10987.0.0-rc2", "experimental": "False", "builds": "{\\"cros-version\\": \\"reef-paladin/R70-10987.0.0-rc2\\"}"}', '-task-name', 'logging_CrashSender'], 'dimensions': [{'value': 'reef', 'key': 'label-board'}, {'value': 'DUT_POOL_CQ', 'key': 'label-pool'}, {'value': 'ChromeOSSkylab', 'key': 'pool'}, {'value': 'reef-paladin/R70-10987.0.0-rc2', 'key': 'provisionable-cros-version'}]}}, {'expiration_secs': 5400, 'properties': {'execution_timeout_secs': 3600, 'io_timeout_secs': 3600, 'grace_period_secs': 3600, 'command': ['/opt/infra-tools/usr/bin/skylab_swarming_worker', '-client-test', '-keyvals', '{"suite": "bvt-inline", "parent_job_id": "3f76f45ee1eac911", "build": "reef-paladin/R70-10987.0.0-rc2", "experimental": "False", "builds": "{\\"cros-version\\": \\"reef-paladin/R70-10987.0.0-rc2\\"}"}', '-task-name', 'logging_CrashSender', '-provision-labels', 'cros-version:reef-paladin/R70-10987.0.0-rc2'], 'dimensions': [{'value': 'reef', 'key': 'label-board'}, {'value': 'DUT_POOL_CQ', 'key': 'label-pool'}, {'value': 'ChromeOSSkylab', 'key': 'pool'}]}}], 'name': 'logging_CrashSender', 'tags': ['luci_project:chromiumos', 'build:reef-paladin/R70-10987.0.0-rc2', 'parent_task_id:3f76f45ee1eac911'], 'user': 'skylab_suite_runner'} """ Q for @maruel: 1) To make the expiration_secs for the whole task as 5400s, should I pass "'expiration_secs': 5400" in the task level? 2) Can I set different expiration secs for different slice? 3) What happened if the sum of expiration secs of slices > the expiration secs of the task? Q For pprabhu: If for a normal job (including CQ job), how long we want it to wait for first slice (with provisionable-cros-version) and second slice?
,
Aug 22
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/651694146aed03326b994abc2789f61283fbd241 commit 651694146aed03326b994abc2789f61283fbd241 Author: Xixuan Wu <xixuan@chromium.org> Date: Wed Aug 22 18:36:37 2018 autotest: Set expiration_secs as the whole task's expiration_secs. BUG=chromium:876579 TEST=Test it on staging. Change-Id: I44094203dc514d959f1f56be31cd07fd2efc55e6 Reviewed-on: https://chromium-review.googlesource.com/1185374 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/swarming_lib_unittest.py [modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/suite_runner.py [modify] https://crrev.com/651694146aed03326b994abc2789f61283fbd241/venv/skylab_suite/swarming_lib.py
,
Aug 22
After testing, I can't pass "'expiration_secs': 5400" in the task level as it reports: 69879 2018-08-22 18:08:16.411 E: Request to https://chromium-swarm-dev.appspot.com/_ah/api/swarming/v1/tasks/new failed with HTTP status code 400: 400 Client Error: Bad Request for url: https://chromium-swarm-dev.appspot.com/_ah/api/swarming/v1/tasks/new - When using task_slices, do not specify a global expiration_secs So I just pass-in different expiration secs for each slice temporarily in #2.
,
Aug 23
,
Aug 23
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/26f2ace0ee533e10d5d695af1e7bd37d98ad8984 commit 26f2ace0ee533e10d5d695af1e7bd37d98ad8984 Author: Xixuan Wu <xixuan@chromium.org> Date: Thu Aug 23 21:08:03 2018 autotest: Set the provision slice expiration as 85mins (longer). BUG=chromium:876579 TEST=None Change-Id: Ie9bf9067e73229abd2eb5b0bc763c948815e1ace Reviewed-on: https://chromium-review.googlesource.com/1187137 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/26f2ace0ee533e10d5d695af1e7bd37d98ad8984/venv/skylab_suite/suite_runner.py
,
Aug 23
Based on observation in Issue 877097 , I should set the expiration seconds of first slice much longer. 1) If I set the first slice's expiration seconds short, 5mins vs 85mins, Imagine that a task is pending for picking up, during the pending, the first slice will soon expire. So when this task is matched to one bot, the first slice is already expired? so it's the second slice that's matched to the bot, and the matched bot will get provisioned again. 2) If I set the first slice's expiration seconds long, 85mins vs 5mins, Imagine that a task is pending for picking up, during the pending, the first slice won't expire. So when this task is matched to one bot, the first slice is used. Since 'wait_for_capacity' by default is False, if there's no available bots, second slice will be used for matching. This sounds more reasonable.
,
Aug 29
The current solution is for CQ jobs, the expiration seconds are split by 95% and 5%. The reason is in #6. For non-CQ jobs, the expiration seconds are split by 5% and 95%, as the first slice may expire very soon due to 'wait_for_capacity'=False. So it quickly jumps to the second slice. If the expiration seconds of second slice is short, these non-CQ jobs will soon expire.
,
Aug 29
As you correctly figured out. I don't think I have any AI here?
,
Aug 29
I'll take it for now. The quota scheduler or similar project should take care of the shortcomings here, under the current suites model I don't think we can do much better. |
||
►
Sign in to add a comment |
||
Comment 1 by xixuan@chromium.org
, Aug 22