New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 651236 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

Have RunHWTestSuite return provisional statuses if all DUTs are down.

Project Member Reported by aaboagye@chromium.org, Sep 28 2016

Issue description

Recently, there has been failures during the HWTest and Paygen stages that have failed with the trace having to do with swarming.

For example:

06:18:29: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpdUQiY_/tmptOXi5t/temp_summary.json --raw-cmd --task-name buddy-release/R55-8844.0.0-paygen_au_canary --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:paygen_au_canary' '--tags=build:buddy-release/R55-8844.0.0' '--tags=task_name:buddy-release/R55-8844.0.0-paygen_au_canary' '--tags=board:buddy' -- /usr/local/autotest/site_utils/run_suite.py --build buddy-release/R55-8844.0.0 --board buddy --suite_name paygen_au_canary --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -c
Autotest instance: cautotest
09-28-2016 [06:18:39] Submitted create_suite_job rpc
09-28-2016 [06:18:52] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=78427412
@@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=78427412@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
06:18:54: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpdUQiY_/tmpkjVjlj/temp_summary.json --raw-cmd --task-name buddy-release/R55-8844.0.0-paygen_au_canary --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:paygen_au_canary' '--tags=build:buddy-release/R55-8844.0.0' '--tags=task_name:buddy-release/R55-8844.0.0-paygen_au_canary' '--tags=board:buddy' -- /usr/local/autotest/site_utils/run_suite.py --build buddy-release/R55-8844.0.0 --board buddy --suite_name paygen_au_canary --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -m 78427412
08:42:25: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-7:6:7:3, started)>]
08:42:25: WARNING: Killing 7104 (sig=24 SIGXCPU)
08:42:25: WARNING: RunCommand: pstree -Apals 7104
08:42:25: WARNING: (stdout):
init,1
  `-chromebuild-sta,1058 -u /opt/chromebuild/chromebuild-startup.py --log /var/log/messages/chromebuild/startup.log
      `-python,1436 /opt/infra-bot-setup/infra-python/run.py infra.tools.bot_setup.start --root_dir /b --password_file /home/chrome-bot/.password_json --slave_name cros-beefy16-c2
          `-python,3110 /b/build/slave/run_slave.py --no_save --no-gclient-sync --python buildbot.tac --nodaemon --logfile twistd.log
              `-python,11350 -u ../../../scripts/slave/annotated_run.py --use-factory-properties-from-disk --build-properties-gz=eNqdUcFOhDAQ/Zeely10QXY56d2TiSdjSFsGqEKL08JmY/x3BxY0Rk9emsl7nTdv3rwz1ckeOuMDK56YbtH1ECkXbpfSjP3eYcOed0yhtLplBeulD4CMkNF0FX19fLgnuA1h8AXnowK8yjTnvXY47Bvnmg6o7rnhV8p5vgkAWjJAAmqsqkuE0IH0sLF27EmPFekh3TGtVPnbxoxqZ2vT/CEyk5NEIy3tx7S0Ei8ENyaUCJPxxlnCoRYqVrmIq/iYxgLyOEm1yOhRp0zUyTE/1ac0jqnzOrZczJWmYkVCLYdUbMy6zLYmdQzoXkCHxbI1NfgQTYDz5JlFGJw3wZGt7wzXMxi7CHZrgt6NqK85fqX4t+TbSBBUdzQ0SfMszg65yGfiPxt73UI1dvMV1gBL38kJ5lFLsa2MzkcKoL4kN5EWxJ4dvlZm7uOKL4nxpYH/uFK0XvLjE/YB3EQ=
                  `-python,12104 ../../../scripts/slave/annotated_run.py --use-factory-properties-from-disk --build-properties-gz=eNqdUcFOhDAQ/Zeely10QXY56d2TiSdjSFsGqEKL08JmY/x3BxY0Rk9emsl7nTdv3rwz1ckeOuMDK56YbtH1ECkXbpfSjP3eYcOed0yhtLplBeulD4CMkNF0FX19fLgnuA1h8AXnowK8yjTnvXY47Bvnmg6o7rnhV8p5vgkAWjJAAmqsqkuE0IH0sLF27EmPFekh3TGtVPnbxoxqZ2vT/CEyk5NEIy3tx7S0Ei8ENyaUCJPxxlnCoRYqVrmIq/iYxgLyOEm1yOhRp0zUyTE/1ac0jqnzOrZczJWmYkVCLYdUbMy6zLYmdQzoXkCHxbI1NfgQTYDz5JlFGJw3wZGt7wzXMxi7CHZrgt6NqK85fqX4t+TbSBBUdzQ0SfMszg65yGfiPxt73UI1dvMV1gBL38kJ5lFLsa2MzkcKoL4kN5EWxJ4dvlZm7uOKL4nxpYH/uFK0XvLjE/YB3EQ=
                      `-logdog_butler,12126 -log-level warning -project chromeos -prefix bb/chromeos/buddy-release/434 -output logdog,host="services-dot-luci-logdog.appspot.com" -output-max-buffer-age 30s run -stdout tee=stdout -stderr tee=stderr -streamserver-uri unix:/b/build/.recipe_runtime/tmp9Fr4rz/butler.sock -- /b/build/slave/buddy-release-master/.recipe_cipd/logdog_annotee -log-level warning -project chromeos -butler-stream-server unix:/b/build/.recipe_runtime/tmp9Fr4rz/butler.sock -logdog-host luci-logdog.appspot.com -annotate tee -name-base recipes -print-summary -tee -json-args-path /b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/logdog_annotee_cmd.json -result-path /b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/bootstrap_result.json
                          `-logdog_annotee,12139 -log-level warning -project chromeos -butler-stream-server unix:/b/build/.recipe_runtime/tmp9Fr4rz/butler.sock -logdog-host luci-logdog.appspot.com -annotate tee -name-base recipes -print-summary -tee -json-args-path /b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/logdog_annotee_cmd.json -result-path /b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/bootstrap_result.json
                              `-python,12143 -u /b/build_internal/scripts/slave/recipes.py --verbose run --workdir=/b/build/slave/buddy-release-master/build --properties-file=/b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/recipe_properties.json cros/cbuildbot_internal
                                  `-python,12154 -u /b/build_internal/scripts/slave/.recipe_deps/recipe_engine/recipes.py --package /b/build_internal/scripts/slave/infra/config/recipes.cfg --bootstrap-script /b/build_internal/scripts/slave/recipes.py --verbose run --workdir=/b/build/slave/buddy-release-master/build --properties-file=/b/build/slave/buddy-release-master/.recipe_runtime/tmp9tm0_Z/recipe_properties.json cros/cbuildbot_internal
                                      `-python2,12508 /b/build/slave/buddy-release-master/build/chromite/bin/cbuildbot --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 buddy-release
                                          `-python2,32503 chromite/bin/cbuildbot buddy-release --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 --resume --timeout 0 --notee --nocgroups --buildroot /b/cbuild/internal_master --version 8844.0.0 --metadata_dump /tmp/cbuildbot-tmpdUQiY_/metadataRhKfxg
                                              `-python2,19189 chromite/bin/cbuildbot buddy-release --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 --resume --timeout 0 --notee --nocgroups --buildroot /b/cbuild/internal_master --version 8844.0.0 --metadata_dump /tmp/cbuildbot-tmpdUQiY_/metadataRhKfxg
                                                  `-python2,22405 chromite/bin/cbuildbot buddy-release --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 --resume --timeout 0 --notee --nocgroups --buildroot /b/cbuild/internal_master --version 8844.0.0 --metadata_dump /tmp/cbuildbot-tmpdUQiY_/metadataRhKfxg
                                                      `-python2,22590 chromite/bin/cbuildbot buddy-release --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 --resume --timeout 0 --notee --nocgroups --buildroot /b/cbuild/internal_master --version 8844.0.0 --metadata_dump /tmp/cbuildbot-tmpdUQiY_/metadataRhKfxg
                                                          `-python2,7104 chromite/bin/cbuildbot buddy-release --buildroot /b/cbuild/internal_master --buildbot --branch master --buildnumber 434 --git-cache-dir /b/cros_git_cache --master-build-id 1084342 --resume --timeout 0 --notee --nocgroups --buildroot /b/cbuild/internal_master --version 8844.0.0 --metadata_dump /tmp/cbuildbot-tmpdUQiY_/metadataRhKfxg
                                                              `-python,22530 /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpdUQiY_/tmpkjVjlj/temp_summary.json --raw-cmd --task-name buddy-release/R55-8844.0.0-paygen_au_canary --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 --tags=priority:Build --tags=suite:paygen_au_canary --tags=build:buddy-release/R55-8844.0.0 --tags=task_name:buddy-release/R55-8844.0.0-paygen_au_canary --tags=board:buddy -- /usr/local/autotest/site_utils/run_suite.py --build buddy-release/R55-8844.0.0 --board buddy --suite_name paygen_au_canary --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -m 78427412
                                                                  `-{python},22643

08:42:25: WARNING: RunCommand: lsof -p 7104
08:42:25: WARNING: (stdout):
COMMAND  PID       USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
python2 7104 chrome-bot  cwd    DIR                8,1     4096  9568489 /b/cbuild/internal_master
python2 7104 chrome-bot  rtd    DIR                8,1     4096        2 /

[..snip..]

My theory of what is happening is that RunHWTestSuite isn't getting a response back regarding the status of the tests. It turns out that all of the DUTs were in the "repair failed" state. It tried to repair the DUTs and find a good one to run the tests, but it seems that it was unsuccessful.

Attempting to display pool info: bvt
host: chromeos4-row13-rack6-host1, status: Repair Failed, locked: False diagnosis: Failed repair
labels: ['board:buddy', 'storage:ssd', 'ec:cros', 'buddy', 'audio_loopback_dongle', 'bluetooth', 'pool:bvt', 'cts_abi_x86', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_enc_h264', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_video_acc_h264', 'webcam', 'arc']
Last 10 jobs within 3:18:00:
146460 Repair started on: 2016-09-28 09:09:10 status FAIL
146448 Verify started on: 2016-09-28 09:08:46 status FAIL
146320 Repair started on: 2016-09-28 08:39:03 status FAIL
146314 Verify started on: 2016-09-28 08:38:39 status FAIL
146231 Repair started on: 2016-09-28 08:08:50 status FAIL
146225 Verify started on: 2016-09-28 08:08:26 status FAIL
146138 Repair started on: 2016-09-28 07:38:39 status FAIL
146128 Verify started on: 2016-09-28 07:38:15 status FAIL
146074 Repair started on: 2016-09-28 07:08:29 status FAIL
146068 Verify started on: 2016-09-28 07:08:05 status FAIL

host: chromeos4-row13-rack6-host9, status: Repair Failed, locked: False diagnosis: Failed repair
labels: ['board:buddy', 'bluetooth', 'storage:ssd', 'ec:cros', 'buddy', 'pool:bvt', 'audio_loopback_dongle', 'cts_abi_x86', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_video_acc_h264', 'webcam', 'cros-version:buddy-release/R55-8841.0.0', 'arc']
Last 10 jobs within 3:18:00:
146458 Repair started on: 2016-09-28 09:09:10 status FAIL
146445 Verify started on: 2016-09-28 09:08:46 status FAIL
146321 Repair started on: 2016-09-28 08:39:03 status FAIL
146315 Verify started on: 2016-09-28 08:38:40 status FAIL
146232 Repair started on: 2016-09-28 08:08:50 status FAIL
146228 Verify started on: 2016-09-28 08:08:26 status FAIL
146137 Repair started on: 2016-09-28 07:38:39 status FAIL
146126 Verify started on: 2016-09-28 07:38:15 status FAIL
146072 Repair started on: 2016-09-28 07:08:29 status FAIL
146064 Verify started on: 2016-09-28 07:08:04 status FAIL

host: chromeos4-row13-rack6-host11, status: Repair Failed, locked: False diagnosis: Failed repair
labels: ['board:buddy', 'storage:ssd', 'ec:cros', 'buddy', 'audio_loopback_dongle', 'hw_video_acc_enc_h264', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'bluetooth', 'cts_abi_x86', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_h264', 'webcam', 'pool:bvt', 'arc', 'cros-version:buddy-release/R55-8841.0.0']
Last 10 jobs within 3:18:00:
146459 Repair started on: 2016-09-28 09:09:10 status FAIL
146446 Verify started on: 2016-09-28 09:08:46 status FAIL
146319 Repair started on: 2016-09-28 08:39:03 status FAIL
146305 Verify started on: 2016-09-28 08:38:38 status FAIL
146230 Repair started on: 2016-09-28 08:08:50 status FAIL
146222 Verify started on: 2016-09-28 08:08:26 status FAIL
146136 Repair started on: 2016-09-28 07:38:39 status FAIL
146124 Verify started on: 2016-09-28 07:38:15 status FAIL
146073 Repair started on: 2016-09-28 07:08:29 status FAIL
146065 Verify started on: 2016-09-28 07:08:04 status FAIL

host: chromeos4-row13-rack7-host1, status: Repair Failed, locked: False diagnosis: Failed repair
labels: ['board:buddy', 'bluetooth', 'storage:ssd', 'ec:cros', 'buddy', 'pool:bvt', 'audio_loopback_dongle', 'cts_abi_x86', 'cts_abi_arm', 'internal_display', 'os:cros', 'power:battery', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_video_acc_h264', 'webcam', 'servo', 'arc']
Last 10 jobs within 3:18:00:
146461 Repair started on: 2016-09-28 09:09:18 status FAIL
146450 Verify started on: 2016-09-28 09:08:47 status FAIL
146323 Repair started on: 2016-09-28 08:39:12 status FAIL
146307 Verify started on: 2016-09-28 08:38:39 status FAIL
146233 Repair started on: 2016-09-28 08:09:02 status FAIL
146217 Verify started on: 2016-09-28 08:08:25 status FAIL
146139 Repair started on: 2016-09-28 07:38:46 status FAIL
146125 Verify started on: 2016-09-28 07:38:15 status FAIL
146075 Repair started on: 2016-09-28 07:08:37 status FAIL
146067 Verify started on: 2016-09-28 07:08:05 status FAIL


Because it kept trying repair the DUTs, it never sent back a status to the Paygen stage which was waiting. Eventually, the Paygen stage gets killed because of a timeout. That causes the really gross swarming output.

I have a proposal for the HWTest framework. I think that the test suites should return _something_ before the stage that launched it gets killed. A sort of provisional status. It's non-obvious from the logs that that was what had occurred. For example, normally we don't fail the Paygen stage for lab failures, and this seems like it would be a lab failure.

The test suite could return an early status which indicates all available DUTs are dead (or something similar) which would give the stage something to report
and perhaps the stage should have a timeout of its own waiting for the results.

Feel free to cc folks that could own this/might be interested.
 
Labels: -Type-Bug Type-Feature
Status: Untriaged (was: Unconfirmed)
Cc: akes...@chromium.org pprabhu@chromium.org
Components: Infra>Client>ChromeOS
Labels: current-issue
I agree that swarming timeouts are sometimes hiding the lab failures.
Updating components, CC'ing people to put it in the queue.

Comment 3 by dshi@chromium.org, Nov 29 2016

Labels: -current-issue Hotlist-Fixit
Status: Available (was: Untriaged)
Project Member

Comment 4 by sheriffbot@chromium.org, Dec 11 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Archived (was: Untriaged)
We have swarming alerts now. archiving

Sign in to add a comment