New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 908300 link

Starred by 6 users

Issue metadata

Status: Fixed
Owner:
Closed: Today
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

HWTest [sanity] failing with _ShutDownException on many Chrome PFQ builders

Project Member Reported by derat@chromium.org, Nov 26

Issue description

Many Chrome PFQ builders are failing in the "HWTest [sanity]" stage with "_ShutDownException":

http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893050929475664
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893056631816224
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893045621321904
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893047135868368
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893051795879888

The stage looks like it's taking a long time and getting killed. I don't see any information beyond that.

When I follow the "stdout" link on http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893051795879888, I don't see any useful information:

************************************************************
** Start Stage HWTest [sanity] - Sun, 25 Nov 2018 05:07:54 -0800 (PST)
** 
** Stage that runs tests in the Autotest lab.
************************************************************
05:07:54: INFO: Created cidb engine bot@130.211.191.11 for pid 10465
05:07:54: INFO: Running cidb query on pid 10465, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f3ec0463810>
05:07:54: INFO: Waiting up to forever for payloads and test artifacts ...
Preconditions for the stage successfully met. Beginning to execute stage...
05:12:46: INFO: Running cidb query on pid 10465, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f3ebad73650>
05:12:46: INFO: Re-run swarming_cmd to avoid buildbot salency check.
05:12:46: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpjP7Bu0/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -c
Autotest instance created: cautotest-prod
11-25-2018 [05:12:53] Submitted create_suite_job rpc
11-25-2018 [05:13:10] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=261016597
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
05:13:12: INFO: Re-run swarming_cmd to avoid buildbot salency check.
05:13:12: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpltUFWV/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -m 261016597
05:56:21: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
05:56:21: INFO: Refreshing access_token
06:09:21: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
06:09:21: INFO: Refreshing access_token
06:11:45: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
06:11:45: INFO: Refreshing access_token
06:50:13: INFO: Re-run swarming_cmd to avoid buildbot salency check.
06:50:13: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmppMQrAW/tmpltUFWV/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:sanity' '--tags=build:peach_pit-chrome-pfq/R72-11294.0.0-rc1' '--tags=task_name:peach_pit-chrome-pfq/R72-11294.0.0-rc1-sanity' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11294.0.0-rc1 --board peach_pit --suite_name sanity --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 1 --suite_min_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 100223231L, 'cidb_build_id': 3161896, 'datastore_parent_key': ('Build', 3161896, 'BuildStage', 100223231L)}" -m 261016597
06:56:24: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
06:56:24: INFO: Refreshing access_token
07:09:24: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
07:09:24: INFO: Refreshing access_token
07:11:50: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
07:11:51: INFO: Refreshing access_token
07:56:27: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
07:56:27: INFO: Refreshing access_token
08:09:27: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
08:09:27: INFO: Refreshing access_token

08:10:44: ERROR: Timeout occurred- waited 18094 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.

08:10:44: ERROR: Timeout occurred- waited 17953 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.
08:10:45: INFO: Running cidb query on pid 25733, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f3ec056eb10>
08:10:45: INFO: Get service account /creds/service_accounts/service-account-chromeos.json
08:10:45: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/.cache/cipd/packages/infra/tools/luci-auth/linux-amd64/luci-auth token '-service-account-json=/creds/service_accounts/service-account-chromeos.json'

---

When I follow "Link to suite" (which I'd prefer to not need to do, because cautotest takes an unreasonable amount of time to load), I can get to the logs dir at http://stainless/browse/chromeos-autotest-results/261016597-chromeos-test/, where autoserv.DEBUG contains some warnings and stack traces but nothing pointing at an actual problem. It ends with this:

11/25 05:18:52.204 DEBUG|     dynamic_suite:0608| Waiting on suite.
11/25 08:18:17.574 ERROR|   logging_manager:0626| Current thread 0x00007f065f3a0740:
11/25 08:18:17.575 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 144 in _sleep
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 136 in wait_for_results
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1170 in wait
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 609 in _run_suite
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 573 in _run_suite_with_spec
11/25 08:18:17.576 DEBUG|          autoserv:0376| Received SIGTERM
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 559 in _perform_reimage_and_run
11/25 08:18:17.576 DEBUG|          autoserv:0379| Finished writing to pid_file. Killing process.
11/25 08:18:17.576 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 513 in reimage_and_run
11/25 08:18:17.577 ERROR|   logging_manager:0626|   File "/usr/local/autotest/results/261016597-chromeos-test/hostless/control.srv", line 55 in <module>
11/25 08:18:17.577 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 1340 in _execute_code
11/25 08:18:17.577 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 817 in run
11/25 08:18:17.577 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 592 in run_autoserv
11/25 08:18:17.577 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 806 in main
11/25 08:18:17.578 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 823 in <module>

That looks like it's just saying that it that it received SIGTERM while it was waiting for results, though.

The child job is at http://cautotest-prod/afe/#tab_id=view_job&object_id=261017028, but I don't see any logs for it.
 
There's a mention of _ShutDownException in  issue 895247 , but I'm not sure if this is describing the same problem. If it is, and I'm missing the logs for some reason, please rename the exception to something more descriptive like "_NoSubmittableChangesException".
The same stage is failing on other Chrome PFQ builders with ABORT:

http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893065309882944
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893061382277984
http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928893054555277056

When I look at autoserv.DEBUG there, I see similar stack traces with wait_for_results interrupted by SIGTERM. If it's the same problem, I'm not sure why it's being reported differently.
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
We're seeing this across the board with all builders; something in Autotest is not happy.

iHN4HRNIfFo6CJ5vRCG0wR3Y4iJ51Mbxr_E2eEBomsTpfCJU3ecVRDqXvWW0s8qlrNJjdq2mQilTqaupQ2dVm6sGyLYIZ4QyM75ZwR59Nf1Jo0HEPsI1cjN9vYPmuN4C_6rli9HORk6W7nAwBdIfLEttG9K3Wjcu9L_g9SqmtDFPml2pg

-- Mike
Owner: ayatane@chromium.org
Looks like timeout again.  The child provision jobs are starting more than an hour after suite creation.  The delay seems to be from getting through shard client.
There's two or three significant symptoms

2018-11-12 13:10:00 shard scheduler tick time dominated by schedule_running_host_queue_entries 

2018-11-13 14:50:00 master scheduler tick schedule_new_jobs, run_cleanup increasing

2018-11-22 22:30:00 run_cleanup time spikes (due to passing some threshold?)

After pushing, run_cleanup time recovered, but everything is still shot.

The initial symptom was also accompanied by a sharp master scheduler tick drop shortly after.
Switching shard queries back to replicas helped significantly
Labels: -Pri-0 Hotlist-Deputy Pri-2
Cc: mathewb@chromium.org
Does the error below fall under this bug, or should I file a separate bug?

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927538827488826320

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927537531719115776

https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927537531719115776/+/steps/HWTest__provision_/0/stdout

cmd=['/b/swarming/wi3Ta5A/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/b/swarming/wi3Ta5A/ir/tmp/t/cbuildbot-tmpBA7jmC/tmpM10CO3/temp_summary.json', '--print-status-updates', '--timeout', '9000', '--raw-cmd', '--task-name', u'veyron_mighty-paladin/R73-11379.0.0-rc2-provision', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:provision', u'--tags=build:veyron_mighty-paladin/R73-11379.0.0-rc2', u'--tags=task_name:veyron_mighty-paladin/R73-11379.0.0-rc2-provision', u'--tags=board:veyron_mighty', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'veyron_mighty-paladin/R73-11379.0.0-rc2', '--board', u'veyron_mighty', '--suite_name', u'provision', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--suite_args', "{u'num_required': 1}", '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 102047152L, 'cidb_build_id': 3211368, 'datastore_parent_key': ('Build', 3211368, 'BuildStage', 102047152L)}", '--test_args', "{'fast': 'True'}", '-c']
Autotest instance created: cautotest-prod
12-10-2018 [02:33:31] Submitted create_suite_job rpc
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/run_suite.py", line 2076, in <module>
    sys.exit(main())
  File "/usr/local/autotest/site_utils/run_suite.py", line 2065, in main
    result = _run_task(options)
  File "/usr/local/autotest/site_utils/run_suite.py", line 2000, in _run_task
    return _run_suite(options)
  File "/usr/local/autotest/site_utils/run_suite.py", line 1739, in _run_suite
    job_id = create_suite(afe, options)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 252, in func_retry
    raise error.TimeoutException(exception_message)
autotest_lib.client.common_lib.error.TimeoutException: retry exception (function="create_suite()"), timeout = 600s
02:43:32: INFO: No json dump found, no HWTest results to report
02:43:32: INFO: Running cidb query on pid 29044, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f398f83be50>

02:43:32: ERROR: ** HWTest failed (code 1) **
02:43:32: INFO: Translating result ** HWTest failed (code 1) ** to fail.

Comment 11 by michaelpg@google.com, Today (11 hours ago)

Cc: michae...@chromium.org
Something similar has flared up again today, example: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923641353615131552

ayatane: how do I check the symptoms you listed to see if this is the same issue?

Comment 12 by ayatane@chromium.org, Today (11 hours ago)

Labels: -Pri-2 Pri-1
Owner: zamorzaev@chromium.org
Whether or not it is the same issue, deputy should take a look

Comment 13 by michaelpg@google.com, Today (11 hours ago)

Informational PFQ builders are also failing with similar logs, eg https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923706148236319824

Comment 14 by zamorzaev@chromium.org, Today (10 hours ago)

The failures from today and the last couple of days are due to issue 924181 and hence unrelated to earlier failures. The underlying issue has been fixed.

Comment 15 by zamorzaev@chromium.org, Today (9 hours ago)

Status: Fixed (was: Assigned)
Looks like the latest caroline-tot-chrome-pfq-informational run is passing provision and failing at a later stage for an unrelated reason: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923614706063606928

I declare the issue resolved.

Sign in to add a comment