swarming timeouts are confusing |
|||||||
Issue description
The following snippet exists in cbuildbot/commands.py:
# pylint: disable=docstring-missing-args
def _CreateSwarmingArgs(build, suite, timeout_mins=None):
"""Create args for swarming client.
Args:
build: Name of the build, will be part of the swarming task name.
suite: Name of the suite, will be part of the swarming task name.
timeout_mins: run_suite timeout mins, will be used to figure out
timeouts for swarming task.
Returns:
A dictionary of args for swarming client.
"""
swarming_timeout = timeout_mins or _DEFAULT_HWTEST_TIMEOUT_MINS
swarming_timeout = swarming_timeout * 60 + _SWARMING_ADDITIONAL_TIMEOUT
swarming_args = {
'swarming_server': topology.topology.get(
topology.SWARMING_PROXY_HOST_KEY),
'task_name': '-'.join([build, suite]),
'dimensions': [('os', 'Ubuntu-14.04'),
('pool', 'default')],
'print_status_updates': True,
'timeout_secs': swarming_timeout,
'io_timeout_secs': swarming_timeout,
'hard_timeout_secs': swarming_timeout,
'expiration_secs': _SWARMING_EXPIRATION}
return swarming_args
What seems strange to me is "swarming_timeout = swarming_timeout * 60 + _SWARMING_ADDITIONAL_TIMEOUT" . _SWARMING_ADDITIONAL_TIMEOUT looks like an extra hour, that we are adding to both timeout_secs, io_timeout_secs, and hard_timeout_secs. This seems wrong -- shouldn't we be be more agressively matching the timeout to the requested timeout from the caller?
,
May 27 2016
,
May 31 2016
Draft CL for changing timeout logic. https://chromium-review.googlesource.com/#/c/348070/
,
May 31 2016
,
Jun 2 2016
,
Jul 14 2016
,
Jan 7 2017
,
Jan 9 2017
I don't think this can wait for the next scheduled FixIt. Many lab failures appear to surface as swarming errors, and it's confusing both sheriffs and myself.
,
Jan 9 2017
falco_li canaries have been failing regularly with "not enough DUT" messages for many weeks now, but just recently it has started showing "swarming timeouts". Maybe it's a good board to use to identify where the regression occurred.
,
Jan 9 2017
With Fang's help, I found the chromeos-proxy page for the failing swarming command in https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fwhirlwind-paladin%2F6716%2F%2B%2Frecipes%2Fsteps%2FHWTest__jetstream_cq_%2F0%2Fstdout: https://chromeos-proxy.appspot.com/task?id=339c82b6d4477410&refresh=10&show_raw=1 and I check the DUT status of several DUTs that this swarming command want to schedule jobs on, like chromeos4-row10-jetstream-host6/7/8, all of them's status is Ready, but has been queued some jobs. The last completed job is on 1.7. Is there sth wrong with their shard (chromeos-server82.cbf) so that no job can be executed actually?
,
Jan 9 2017
I *THINK* there are two different issues here. One is that that shard/duts aren't working, the other is that the way we report errors seems to be broken. This bug is about the reporting. crbug.com/679410 is about the shard/duts.
,
Jan 9 2017
So this bug is about swarming timeout is not controlled by chromite and is also too large which should be reduced, or is about "don't report errors as swarming proxy error if it's infra failure"?
,
Jan 10 2017
The bug is that we see several canaries and some paladin runs reporting the swarming timeout instead of a more precise error message. This is obscuring the root failure cause and makes it harder to fix it.
,
Feb 1 2017
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, May 27 2016