stout is repeatedly failing the AU tests. |
||||||||
Issue descriptionstout is repeatedly failing the AU tests: https://stainless.corp.google.com/search?exclude_retried=true&first_date=2018-04-13&master_builder_name=&builder_name_number=&shard=&exclude_acts=true&builder_name=&master_builder_name_number=&owner=&retry=&exclude_cts=true&exclude_non_production=true&hostname=&board=%5Estout%24&test=&exclude_not_run=false&build=%5ER66%5C-10452%5C.74%5C.0%24&status=FAIL&status=ERROR&status=ABORT&reason=&waterfall=&suite=%5Epaygen_au_stable%24&last_date=2018-04-26&exclude_non_release=true&exclude_au=false&model=&view=list GE:https://cros-goldeneye.corp.google.com/chromeos/console/qaRelease?releaseName=M66-STABLE-CHROMEOS-1 Log info ------------------------------------------------------------------------ 4/25 06:32:22.351 DEBUG|autoupdate_EndToEn:0366| The test configuration supplied: {'source_payload_uri': 'gs://chromeos-releases/stable-channel/stout/8530.96.0/payloads/chromeos_8530.96.0_stout_stable-channel_full_test.bin-1cdca09989bb3e3866dbdef992e8c194', 'name': 'paygen_au_stable', 'target_archive_uri': None, 'target_release': '10452.74.0', 'source_release': '8530.96.0', 'update_type': 'delta', 'source_archive_uri': 'gs://chromeos-releases/stable-channel/stout/8530.96.0', 'target_payload_uri': 'gs://chromeos-releases/stable-channel/stout/10452.74.0/payloads/chromeos_8530.96.0-10452.74.0_stout_stable-channel_delta_test.bin-7e8ce17288f12defd0b840c7cb4c885d'} 04/25 06:32:22.351 DEBUG| dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.ImageServer'> 04/25 06:32:22.352 INFO | dev_server:0798| Getting devservers for host: chromeos6-row2-rack4-host20 04/25 06:32:22.357 DEBUG| dev_server:0819| The host chromeos6-row2-rack4-host20 (100.115.128.218) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17. 04/25 06:32:22.358 DEBUG| dev_server:0770| Pick one healthy devserver from ['http://100.115.219.129:8082', 'http://100.115.219.130:8082', 'http://100.115.245.197:8082', 'http://100.115.219.132:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.136:8082', 'http://100.115.185.228:8082', 'http://100.115.185.227:8082', 'http://100.115.185.226:8082', 'http://100.115.219.135:8082', 'http://100.115.245.201:8082', 'http://100.115.245.200:8082', 'http://100.115.219.139:8082', 'http://100.115.219.137:8082'] 04/25 06:32:22.358 DEBUG| dev_server:0774| Check health for http://100.115.219.129:8082 04/25 06:32:22.358 DEBUG| utils:0214| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 04/25 06:32:24.240 DEBUG| dev_server:0779| Pick http://100.115.219.129:8082 04/25 06:32:24.253 DEBUG| utils:0214| Running 'ssh 100.115.219.130 'curl "http://100.115.219.130:8082/check_health?"'' 04/25 06:32:24.254 DEBUG| utils:0214| Running 'ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"'' 04/25 06:32:24.264 DEBUG| utils:0214| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/check_health?"'' 04/25 06:32:24.266 DEBUG| utils:0214| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"'' 04/25 06:32:24.269 DEBUG| utils:0214| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"'' 04/25 06:32:24.270 DEBUG| utils:0214| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 04/25 06:32:24.284 DEBUG| utils:0214| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"'' 04/25 06:32:24.292 DEBUG| utils:0214| Running 'ssh 100.115.185.227 'curl "http://100.115.185.227:8082/check_health?"'' 04/25 06:32:24.299 DEBUG| utils:0214| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"'' 04/25 06:32:24.301 DEBUG| utils:0214| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"'' 04/25 06:32:24.303 DEBUG| utils:0214| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"'' 04/25 06:32:24.303 DEBUG| utils:0214| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"'' 04/25 06:32:24.307 DEBUG| utils:0214| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"'' 04/25 06:32:24.319 DEBUG| utils:0214| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"'' 04/25 06:32:26.249 DEBUG| dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.130. 04/25 06:32:26.251 ERROR| dev_server:0520| Devserver call failed: "http://100.115.219.130:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s 04/25 06:32:26.264 DEBUG| dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.132. 04/25 06:32:26.265 ERROR| dev_server:0520| Devserver call failed: "http://100.115.219.132:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
,
Apr 26 2018
All failures are on chromeos6* DUT's. When i try to run this command from my desk it also fails: curl http://100.115.245.132:8082/check_health? Adding infra deputies.
,
Apr 26 2018
,
Apr 26 2018
This shouldn't caused by RPC call timeout. 04/25 05:52:50.923 ERROR| dev_server:0520| Devserver call failed: "http://100.115.219.139:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s 04/25 06:55:58.603 ERROR| logging_manager:0626| Current thread 0x00007f15bfac2740: 04/25 06:55:58.604 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/forking.py", line 135 in poll 04/25 06:55:58.604 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/forking.py", line 154 in wait 04/25 06:55:58.605 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/process.py", line 145 in join 04/25 06:55:58.605 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2834 in get_least_loaded_devserver 04/25 06:55:58.605 DEBUG| autoserv:0402| Received SIGTERM Something wrong happened when joining a process.
,
Apr 26 2018
,
Apr 26 2018
All the tests are aborted after hitting the suite deadline, and sth is blocking the test running. I don't think it's simply called by some devserver RPC failures since finally there will be a healthy devserver to serve in this restricted network, or it won't only hurt this test. Assign to @dhaddock, Do you have any idea about how to find the logs of the one hour gap between 'dev_server failed RPC' and 'SIGTERM' abortion? or why there's no logging in the one hour gap? 04/25 06:32:26.264 DEBUG| dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.132. 04/25 06:32:26.265 ERROR| dev_server:0520| Devserver call failed: "http://100.115.219.132:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s ----- One hour gap, then get aborted ----- ----- One hour gap, then get aborted ----- ----- One hour gap, then get aborted ----- 04/25 07:36:21.211 DEBUG| autoserv:0402| Received SIGTERM 04/25 07:36:21.212 DEBUG| autoserv:0405| Finished writing to pid_file. Killing process.
,
Apr 26 2018
I don't think there are any _test_ logs for that hour. I think the test was blocked waiting for get_least_loaded_devserver to return. This could be similar to issue 824919 where we were hit by the new process kicked off was blocked
,
Apr 26 2018
Re #7, thanks for the pointer. Assign back to deputy to decide whether to restart this shard / add more loggings/ add timeouts.
,
Apr 26 2018
rebooting the shard chromeos-server125. Will file another bug to track the deadlock issue.
,
Apr 27 2018
Shard restared. Mark this ticket as fixed. File crbug.com/837448 to track the timeout issue.
,
Apr 27 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by abod...@chromium.org
, Apr 26 2018