New issue
Advanced search Search tips

Issue 615569 link

Starred by 3 users

Issue metadata

Status: Duplicate
Owner: ----
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

swarming timeouts are confusing

Project Member Reported by akes...@chromium.org, May 27 2016

Issue description

The following snippet exists in cbuildbot/commands.py:

# pylint: disable=docstring-missing-args
def _CreateSwarmingArgs(build, suite, timeout_mins=None):
  """Create args for swarming client.

  Args:
    build: Name of the build, will be part of the swarming task name.
    suite: Name of the suite, will be part of the swarming task name.
    timeout_mins: run_suite timeout mins, will be used to figure out
                  timeouts for swarming task.

  Returns:
    A dictionary of args for swarming client.
  """

  swarming_timeout = timeout_mins or _DEFAULT_HWTEST_TIMEOUT_MINS
  swarming_timeout = swarming_timeout * 60 + _SWARMING_ADDITIONAL_TIMEOUT

  swarming_args = {
      'swarming_server': topology.topology.get(
          topology.SWARMING_PROXY_HOST_KEY),
      'task_name': '-'.join([build, suite]),
      'dimensions': [('os', 'Ubuntu-14.04'),
                     ('pool', 'default')],
      'print_status_updates': True,
      'timeout_secs': swarming_timeout,
      'io_timeout_secs': swarming_timeout,
      'hard_timeout_secs': swarming_timeout,
      'expiration_secs': _SWARMING_EXPIRATION}
  return swarming_args


What seems strange to me is "swarming_timeout = swarming_timeout * 60 + _SWARMING_ADDITIONAL_TIMEOUT" . _SWARMING_ADDITIONAL_TIMEOUT looks like an extra hour, that we are adding to both timeout_secs, io_timeout_secs, and hard_timeout_secs. This seems wrong -- shouldn't we be be more agressively matching the timeout to the requested timeout from the caller?
 
Also, I see in swarming_lib.py that we run the swarming command in a bare RunCommand. Looks like we might be leaving the timeout behavior up to swarming_client by passing in timeout related command line args to it.

Shouldn't we wrap that call insite a timout_util.Timeout(...) context, so that we can give cbuildbot some control over killing it if it runs too long?

Looks to me like hangs in here are resulting in sort of weird logs when we get suites that run too long and cause swarming client to over-run build or stage timeouts (and get forcibly killed by parent). See for instance https://uberchromegw.corp.google.com/i/chromeos_release/builders/strago-release-group%20release-R51-8172.B/builds/39/steps/HWTest%20%5Bceles%5D%20%5Bbvt-inline%5D/logs/stdio
Cc: dgarr...@chromium.org
Draft CL for changing timeout logic.
https://chromium-review.googlesource.com/#/c/348070/
Status: Assigned (was: Untriaged)
Labels: -current-issue
Cc: semenzato@chromium.org snanda@chromium.org
Cc: akes...@chromium.org
Labels: -Pri-2 Hotlist-Fixit Pri-1
Owner: ----
Status: Available (was: Assigned)
Summary: swarming timeouts are confusing (was: swarming_client timeout logic seems wrong)
I don't think this can wait for the next scheduled FixIt.

Many lab failures appear to surface as swarming errors, and it's confusing both sheriffs and myself.
falco_li canaries have been failing regularly with "not enough DUT" messages for many weeks now, but just recently it has started showing "swarming timeouts".  Maybe it's a good board to use to identify where the regression occurred.
With Fang's help, I found the chromeos-proxy page for the failing swarming command in https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fwhirlwind-paladin%2F6716%2F%2B%2Frecipes%2Fsteps%2FHWTest__jetstream_cq_%2F0%2Fstdout:

https://chromeos-proxy.appspot.com/task?id=339c82b6d4477410&refresh=10&show_raw=1

and I check the DUT status of several DUTs that this swarming command want to schedule jobs on, like chromeos4-row10-jetstream-host6/7/8, all of them's status is Ready, but has been queued some jobs. The last completed job is on 1.7. Is there sth wrong with their shard (chromeos-server82.cbf) so that no job can be executed actually?
I *THINK* there are two different issues here.

One is that that shard/duts aren't working, the other is that the way we report errors seems to be broken.

This bug is about the reporting.  crbug.com/679410  is about the shard/duts.
So this bug is about swarming timeout is not controlled by chromite and is also too large which should be reduced, or is about "don't report errors as swarming proxy error if it's infra failure"?
The bug is that we see several canaries and some paladin runs reporting the swarming timeout instead of a more precise error message.  This is obscuring the root failure cause and makes it harder to fix it.
Mergedinto: 645259
Status: Duplicate (was: Available)

Sign in to add a comment