New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 730700 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 730729
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: ----
Type: ----

Blocked on:
issue 730729



Sign in to add a comment

swarming timeout : _ShutDownException: (15, 'Received signal 15; shutting down')

Project Member Reported by bleung@google.com, Jun 7 2017

Issue description

Cc: mcchou@chromium.org dgarr...@chromium.org bleung@chromium.org
Components: Infra>Client>ChromeOS
It seems that we are killing the swarming requests because no logs were generated for long enough to hit the timeout.

BackgroundFailure: <class 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds

IE: The request hung for 2.4 hours.

I don't yet know if this was a swarming service hang, or if it happened on the swarm bot (and/or whatever swarm bot was talking too).

Cc: xixuan@chromium.org
This is the command that hung, and so far it looks like it generated no output.

06:21:25: INFO: RunCommand: /b/cbuild/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmp_jyTdQ/tmp7DrBWH/temp_summary.json --raw-cmd --task-name veyron_minnie-paladin/R61-9626.0.0-rc2-arc-bvt-cq --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:arc-bvt-cq' '--tags=build:veyron_minnie-paladin/R61-9626.0.0-rc2' '--tags=task_name:veyron_minnie-paladin/R61-9626.0.0-rc2-arc-bvt-cq' '--tags=board:veyron_minnie' -- /usr/local/autotest/site_utils/run_suite.py --build veyron_minnie-paladin/R61-9626.0.0-rc2 --board veyron_minnie --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority Build --timeout_mins 180 --retry True --max_retries 5 --offload_failures_only False --suite_args "{'attr_filter': u'(suite:arc-bvt-cq) and (subsystem:default)'}" --job_keyvals "{'datastore_parent_key': ('Build', 1577657, 'BuildStage', 46846659L)}" -m 122017797

Follow up work to mitigate future issues:

We should put a hard timeout around the swarming request with a short timeout (5 minutes?) so that we can at least report these errors in a semi-understandable way.
The examples above all made requests between about 6:30 and 7:30 AM.

Looking at current jobs, nothing currently running has been running for more than 20 minutes. I don't think this is happening currently.
Cc: shuqianz@chromium.org
Owner: shuqianz@chromium.org
Issue 1)

In the example that xixuan has investigated, one of the sub-jobs was aborted for unknown reasons, which might have caused the suite run to not complete before the suite timeout.

 Issue 2 )

We are no longer running run_suite with --no-wait, which means that if the suite takes a long time to complete, we can hang the builder without generating any logs. This causes buildbot to decide that the build is hung and kill it with logging that's very hard to understand.

Charlene made the change to remove --no-wait, so we're hoping she can give feedback about why.
Blockedon: 730729
We may have a bug in tot:

chromeos-test@chromeos-server14:~$ nice -n 10 /usr/local/autotest/server/autoserv -p -r /tmp/122035835-chromeos-test/chromeos4-row6-rack10-host5 -m chromeos4-r
ow6-rack10-host5 -u chromeos-test -l veyron_mighty-paladin/R61-9626.0.0-rc3/bvt-inline/provision_AutoUpdate.double -s --lab True -P 122035835-chromeos-test/chromeos4-row6-rack10-host5 -n /usr/local/autotest/results/drone_tmp/attach.16637 --require-ssp --verify_job_repo_url
DEBUG:root:Refreshing HostInfo using store <autotest_lib.server.hosts.afe_store.AfeStore object at 0x7f68ef57a4d0>
DEBUG:root:Old host_info: None
DEBUG:root:New host_info: HostInfo [Labels: ['bluetooth', 'ec:cros', 'veyron_mighty', 'gpu_family:mali', 'board_freq_mem:veyron_mighty_1.8GHz_2GB', 'graphics:gles', 'hw_video_acc_vp8', 'hw_video_acc_enc_vp8', 'audio_loopback_dongle', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_h264', 'storage:mmc', 'servo', 'board:veyron_mighty', 'arc', 'webcam', 'sku:mighty_rk3288_2Gb', 'variant:mighty', 'phase:PVT2', 'touchpad', 'cros-version:veyron_mighty-paladin/R61-9626.0.0-rc3', 'pool:suites'], Attributes: {'powerunit_hostname': 'chromeos4-row5_6-rack10-rpm1', 'powerunit_outlet': '.A5'}]
DEBUG:root:Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
INFO:root:Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_z8D9j2ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack10-host5'
DEBUG:root:Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_z8D9j2ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack10-host5'
INFO:root:Getting devservers for host: None
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 782, in <module>
    main()
  File "/usr/local/autotest/server/autoserv", line 628, in main
    ssp_url, ssp_error_msg = _stage_ssp(parser)
  File "/usr/local/autotest/server/autoserv", line 134, in _stage_ssp
    execfile(STAGE_SERVER_SIDE_PACKAGE_CONTROL_FILE, namespace, script_locals)
  File "/usr/local/autotest/server/control_segments/stage_server_side_package", line 21, in <module>
    ssp_url = host.stage_server_side_package(image)
  File "/usr/local/autotest/server/hosts/cros_host.py", line 463, in stage_server_side_package
    match = re.match('.*/R\d+-(\d+)\.', image_name)
UnboundLocalError: local variable 'image_name' referenced before assignment
Our current belief is that we have a problem in which shards sometimes fail to run the actual test command.

crbug.com/730729 covers fixing the reporting chain to report properly.
 crbug.com/730431  covers the actual failure.
Or... this bug can cover the actual failure. ;>
Mergedinto: 730729
Status: Duplicate (was: Available)

Sign in to add a comment