Issue metadata
Sign in to add a comment
|
HWTest [sanity] failing with _ShutDownException on many Chrome PFQ builders |
||||||||||||||||||||||
Issue descriptionMany Chrome PFQ builders are failing in the "HWTest [sanity]" stage with "_ShutDownException": http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893050929475664 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893056631816224 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893045621321904 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893047135868368 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893051795879888 The stage looks like it's taking a long time and getting killed. I don't see any information beyond that. When I follow the "stdout" link on http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893051795879888, I don't see any useful information: ************************************************************ ** Start Stage HWTest [sanity] - Sun, 25 Nov 2018 05:07:54 -0800 (PST) ** ** Stage that runs tests in the Autotest lab. ************************************************************ 05:07:54: INFO: Created cidb engine bot@130.211.191.11 for pid 10465 05:07:54: INFO: Running cidb query on pid 10465, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f3ec0463810> 05:07:54: INFO: Waiting up to forever for payloads and test artifacts ... Preconditions for the stage successfully met. Beginning to execute stage... 05:12:46: INFO: Running cidb query on pid 10465, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f3ebad73650> 05:12:46: INFO: Re-run swarming_cmd to avoid buildbot salency check. 05:12:46: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpjP7Bu0/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -c Autotest instance created: cautotest-prod 11-25-2018 [05:12:53] Submitted create_suite_job rpc 11-25-2018 [05:13:10] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=261016597 --create_and_return was specified, terminating now. Will return from run_suite with status: OK 05:13:12: INFO: Re-run swarming_cmd to avoid buildbot salency check. 05:13:12: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpltUFWV/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -m 261016597 05:56:21: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 05:56:21: INFO: Refreshing access_token 06:09:21: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 06:09:21: INFO: Refreshing access_token 06:11:45: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 06:11:45: INFO: Refreshing access_token 06:50:13: INFO: Re-run swarming_cmd to avoid buildbot salency check. 06:50:13: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpltUFWV/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -m 261016597 06:56:24: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 06:56:24: INFO: Refreshing access_token 07:09:24: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 07:09:24: INFO: Refreshing access_token 07:11:50: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 07:11:51: INFO: Refreshing access_token 07:56:27: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 07:56:27: INFO: Refreshing access_token 08:09:27: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 08:09:27: INFO: Refreshing access_token [1;31m08:10:44: ERROR: Timeout occurred- waited 18094 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.[0m [1;31m08:10:44: ERROR: Timeout occurred- waited 17953 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.[0m 08:10:45: INFO: Running cidb query on pid 25733, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f3ec056eb10> 08:10:45: INFO: Get service account /creds/service_accounts/service-account-chromeos.json 08:10:45: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/.cache/cipd/packages/infra/tools/luci-auth/linux-amd64/luci-auth token '-service-account-json=/creds/service_accounts/service-account-chromeos.json' --- When I follow "Link to suite" (which I'd prefer to not need to do, because cautotest takes an unreasonable amount of time to load), I can get to the logs dir at http://stainless/browse/chromeos-autotest-results/261016597-chromeos-test/, where autoserv.DEBUG contains some warnings and stack traces but nothing pointing at an actual problem. It ends with this: 11/25 05:18:52.204 DEBUG| dynamic_suite:0608| Waiting on suite. 11/25 08:18:17.574 ERROR| logging_manager:0626| Current thread 0x00007f065f3a0740: 11/25 08:18:17.575 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 144 in _sleep 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 136 in wait_for_results 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1170 in wait 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 609 in _run_suite 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 573 in _run_suite_with_spec 11/25 08:18:17.576 DEBUG| autoserv:0376| Received SIGTERM 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 559 in _perform_reimage_and_run 11/25 08:18:17.576 DEBUG| autoserv:0379| Finished writing to pid_file. Killing process. 11/25 08:18:17.576 ERROR| logging_manager:0626| File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 513 in reimage_and_run 11/25 08:18:17.577 ERROR| logging_manager:0626| File "/usr/local/autotest/results/261016597-chromeos-test/hostless/control.srv", line 55 in <module> 11/25 08:18:17.577 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 1340 in _execute_code 11/25 08:18:17.577 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 817 in run 11/25 08:18:17.577 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 592 in run_autoserv 11/25 08:18:17.577 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 806 in main 11/25 08:18:17.578 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 823 in <module> That looks like it's just saying that it that it received SIGTERM while it was waiting for results, though. The child job is at http://cautotest-prod/afe/#tab_id=view_job&object_id=261017028, but I don't see any logs for it.
,
Nov 26
The same stage is failing on other Chrome PFQ builders with ABORT: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893065309882944 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893061382277984 http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893054555277056 When I look at autoserv.DEBUG there, I see similar stack traces with wait_for_results interrupted by SIGTERM. If it's the same problem, I'm not sure why it's being reported differently.
,
Nov 26
,
Nov 26
We're seeing this across the board with all builders; something in Autotest is not happy. iHN4HRNIfFo6CJ5vRCG0wR3Y4iJ51Mbxr_E2eEBomsTpfCJU3ecVRDqXvWW0s8qlrNJjdq2mQilTqaupQ2dVm6sGyLYIZ4QyM75ZwR59Nf1Jo0HEPsI1cjN9vYPmuN4C_6rli9HORk6W7nAwBdIfLEttG9K3Wjcu9L_g9SqmtDFPml2pg -- Mike
,
Nov 26
,
Nov 26
Looks like timeout again. The child provision jobs are starting more than an hour after suite creation. The delay seems to be from getting through shard client.
,
Nov 26
There's two or three significant symptoms 2018-11-12 13:10:00 shard scheduler tick time dominated by schedule_running_host_queue_entries 2018-11-13 14:50:00 master scheduler tick schedule_new_jobs, run_cleanup increasing 2018-11-22 22:30:00 run_cleanup time spikes (due to passing some threshold?) After pushing, run_cleanup time recovered, but everything is still shot. The initial symptom was also accompanied by a sharp master scheduler tick drop shortly after.
,
Nov 26
Switching shard queries back to replicas helped significantly
,
Nov 26
,
Dec 11
Does the error below fall under this bug, or should I file a separate bug? https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927538827488826320 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927537531719115776 https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927537531719115776/+/steps/HWTest__provision_/0/stdout cmd=['/b/swarming/wi3Ta5A/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/b/swarming/wi3Ta5A/ir/tmp/t/cbuildbot-tmpBA7jmC/tmpM10CO3/temp_summary.json', '--print-status-updates', '--timeout', '9000', '--raw-cmd', '--task-name', u'veyron_mighty-paladin/R73-11379.0.0-rc2-provision', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:provision', u'--tags=build:veyron_mighty-paladin/R73-11379.0.0-rc2', u'--tags=task_name:veyron_mighty-paladin/R73-11379.0.0-rc2-provision', u'--tags=board:veyron_mighty', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'veyron_mighty-paladin/R73-11379.0.0-rc2', '--board', u'veyron_mighty', '--suite_name', u'provision', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--suite_args', "{u'num_required': 1}", '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 102047152L, 'cidb_build_id': 3211368, 'datastore_parent_key': ('Build', 3211368, 'BuildStage', 102047152L)}", '--test_args', "{'fast': 'True'}", '-c'][0m Autotest instance created: cautotest-prod 12-10-2018 [02:33:31] Submitted create_suite_job rpc Traceback (most recent call last): File "/usr/local/autotest/site_utils/run_suite.py", line 2076, in <module> sys.exit(main()) File "/usr/local/autotest/site_utils/run_suite.py", line 2065, in main result = _run_task(options) File "/usr/local/autotest/site_utils/run_suite.py", line 2000, in _run_task return _run_suite(options) File "/usr/local/autotest/site_utils/run_suite.py", line 1739, in _run_suite job_id = create_suite(afe, options) File "/usr/local/autotest/client/common_lib/cros/retry.py", line 252, in func_retry raise error.TimeoutException(exception_message) autotest_lib.client.common_lib.error.TimeoutException: retry exception (function="create_suite()"), timeout = 600s 02:43:32: INFO: No json dump found, no HWTest results to report 02:43:32: INFO: Running cidb query on pid 29044, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f398f83be50> [1;31m02:43:32: ERROR: ** HWTest failed (code 1) **[0m 02:43:32: INFO: Translating result ** HWTest failed (code 1) ** to fail.
,
Today
(11 hours ago)
Something similar has flared up again today, example: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923641353615131552 ayatane: how do I check the symptoms you listed to see if this is the same issue?
,
Today
(11 hours ago)
Whether or not it is the same issue, deputy should take a look
,
Today
(11 hours ago)
Informational PFQ builders are also failing with similar logs, eg https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923706148236319824
,
Today
(10 hours ago)
The failures from today and the last couple of days are due to issue 924181 and hence unrelated to earlier failures. The underlying issue has been fixed.
,
Today
(9 hours ago)
Looks like the latest caroline-tot-chrome-pfq-informational run is passing provision and failing at a later stage for an unrelated reason: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923614706063606928 I declare the issue resolved. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by derat@chromium.org
, Nov 26