Issue metadata
Sign in to add a comment
|
swarming timeout : _ShutDownException: (15, 'Received signal 15; shutting down') |
||||||||||||||||||||||||
Issue descriptionswarming timeout Builders failed on: - caroline-release: https://luci-milo.appspot.com/buildbot/chromeos/caroline-release/734 - expresso-release: https://luci-milo.appspot.com/buildbot/chromeos/expresso-release/1356 - falco_li-release: https://luci-milo.appspot.com/buildbot/chromeos/falco_li-release/1296 - kip-release: https://luci-milo.appspot.com/buildbot/chromeos/kip-release/1272 - lulu-release: https://luci-milo.appspot.com/buildbot/chromeos/lulu-release/1193 - setzer-release: https://luci-milo.appspot.com/buildbot/chromeos/setzer-release/1186 - veyron_minnie-paladin: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/2820 - veyron_rialto-release: https://luci-milo.appspot.com/buildbot/chromeos/veyron_rialto-release/1191
,
Jun 7 2017
It seems that we are killing the swarming requests because no logs were generated for long enough to hit the timeout. BackgroundFailure: <class 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds IE: The request hung for 2.4 hours. I don't yet know if this was a swarming service hang, or if it happened on the swarm bot (and/or whatever swarm bot was talking too).
,
Jun 7 2017
This is the command that hung, and so far it looks like it generated no output. 06:21:25: INFO: RunCommand: /b/cbuild/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmp_jyTdQ/tmp7DrBWH/temp_summary.json --raw-cmd --task-name veyron_minnie-paladin/R61-9626.0.0-rc2-arc-bvt-cq --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:arc-bvt-cq' '--tags=build:veyron_minnie-paladin/R61-9626.0.0-rc2' '--tags=task_name:veyron_minnie-paladin/R61-9626.0.0-rc2-arc-bvt-cq' '--tags=board:veyron_minnie' -- /usr/local/autotest/site_utils/run_suite.py --build veyron_minnie-paladin/R61-9626.0.0-rc2 --board veyron_minnie --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority Build --timeout_mins 180 --retry True --max_retries 5 --offload_failures_only False --suite_args "{'attr_filter': u'(suite:arc-bvt-cq) and (subsystem:default)'}" --job_keyvals "{'datastore_parent_key': ('Build', 1577657, 'BuildStage', 46846659L)}" -m 122017797
,
Jun 7 2017
Follow up work to mitigate future issues: We should put a hard timeout around the swarming request with a short timeout (5 minutes?) so that we can at least report these errors in a semi-understandable way.
,
Jun 7 2017
The examples above all made requests between about 6:30 and 7:30 AM. Looking at current jobs, nothing currently running has been running for more than 20 minutes. I don't think this is happening currently.
,
Jun 7 2017
Issue 1) In the example that xixuan has investigated, one of the sub-jobs was aborted for unknown reasons, which might have caused the suite run to not complete before the suite timeout. Issue 2 ) We are no longer running run_suite with --no-wait, which means that if the suite takes a long time to complete, we can hang the builder without generating any logs. This causes buildbot to decide that the build is hung and kill it with logging that's very hard to understand. Charlene made the change to remove --no-wait, so we're hoping she can give feedback about why.
,
Jun 7 2017
,
Jun 7 2017
We may have a bug in tot:
chromeos-test@chromeos-server14:~$ nice -n 10 /usr/local/autotest/server/autoserv -p -r /tmp/122035835-chromeos-test/chromeos4-row6-rack10-host5 -m chromeos4-r
ow6-rack10-host5 -u chromeos-test -l veyron_mighty-paladin/R61-9626.0.0-rc3/bvt-inline/provision_AutoUpdate.double -s --lab True -P 122035835-chromeos-test/chromeos4-row6-rack10-host5 -n /usr/local/autotest/results/drone_tmp/attach.16637 --require-ssp --verify_job_repo_url
DEBUG:root:Refreshing HostInfo using store <autotest_lib.server.hosts.afe_store.AfeStore object at 0x7f68ef57a4d0>
DEBUG:root:Old host_info: None
DEBUG:root:New host_info: HostInfo [Labels: ['bluetooth', 'ec:cros', 'veyron_mighty', 'gpu_family:mali', 'board_freq_mem:veyron_mighty_1.8GHz_2GB', 'graphics:gles', 'hw_video_acc_vp8', 'hw_video_acc_enc_vp8', 'audio_loopback_dongle', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_h264', 'storage:mmc', 'servo', 'board:veyron_mighty', 'arc', 'webcam', 'sku:mighty_rk3288_2Gb', 'variant:mighty', 'phase:PVT2', 'touchpad', 'cros-version:veyron_mighty-paladin/R61-9626.0.0-rc3', 'pool:suites'], Attributes: {'powerunit_hostname': 'chromeos4-row5_6-rack10-rpm1', 'powerunit_outlet': '.A5'}]
DEBUG:root:Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
INFO:root:Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_z8D9j2ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack10-host5'
DEBUG:root:Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_z8D9j2ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack10-host5'
INFO:root:Getting devservers for host: None
Traceback (most recent call last):
File "/usr/local/autotest/server/autoserv", line 782, in <module>
main()
File "/usr/local/autotest/server/autoserv", line 628, in main
ssp_url, ssp_error_msg = _stage_ssp(parser)
File "/usr/local/autotest/server/autoserv", line 134, in _stage_ssp
execfile(STAGE_SERVER_SIDE_PACKAGE_CONTROL_FILE, namespace, script_locals)
File "/usr/local/autotest/server/control_segments/stage_server_side_package", line 21, in <module>
ssp_url = host.stage_server_side_package(image)
File "/usr/local/autotest/server/hosts/cros_host.py", line 463, in stage_server_side_package
match = re.match('.*/R\d+-(\d+)\.', image_name)
UnboundLocalError: local variable 'image_name' referenced before assignment
,
Jun 7 2017
Our current belief is that we have a problem in which shards sometimes fail to run the actual test command. crbug.com/730729 covers fixing the reporting chain to report properly. crbug.com/730431 covers the actual failure.
,
Jun 7 2017
Or... this bug can cover the actual failure. ;>
,
Jun 7 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by bleung@chromium.org
, Jun 7 2017Components: Infra>Client>ChromeOS