New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 837336 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
OOO
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

stout is repeatedly failing the AU tests.

Project Member Reported by abod...@chromium.org, Apr 26 2018

Issue description


stout is repeatedly failing the AU tests:

https://stainless.corp.google.com/search?exclude_retried=true&first_date=2018-04-13&master_builder_name=&builder_name_number=&shard=&exclude_acts=true&builder_name=&master_builder_name_number=&owner=&retry=&exclude_cts=true&exclude_non_production=true&hostname=&board=%5Estout%24&test=&exclude_not_run=false&build=%5ER66%5C-10452%5C.74%5C.0%24&status=FAIL&status=ERROR&status=ABORT&reason=&waterfall=&suite=%5Epaygen_au_stable%24&last_date=2018-04-26&exclude_non_release=true&exclude_au=false&model=&view=list


GE:https://cros-goldeneye.corp.google.com/chromeos/console/qaRelease?releaseName=M66-STABLE-CHROMEOS-1


Log info
------------------------------------------------------------------------


4/25 06:32:22.351 DEBUG|autoupdate_EndToEn:0366| The test configuration supplied: {'source_payload_uri': 'gs://chromeos-releases/stable-channel/stout/8530.96.0/payloads/chromeos_8530.96.0_stout_stable-channel_full_test.bin-1cdca09989bb3e3866dbdef992e8c194', 'name': 'paygen_au_stable', 'target_archive_uri': None, 'target_release': '10452.74.0', 'source_release': '8530.96.0', 'update_type': 'delta', 'source_archive_uri': 'gs://chromeos-releases/stable-channel/stout/8530.96.0', 'target_payload_uri': 'gs://chromeos-releases/stable-channel/stout/10452.74.0/payloads/chromeos_8530.96.0-10452.74.0_stout_stable-channel_delta_test.bin-7e8ce17288f12defd0b840c7cb4c885d'}
04/25 06:32:22.351 DEBUG|        dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.ImageServer'>
04/25 06:32:22.352 INFO |        dev_server:0798| Getting devservers for host: chromeos6-row2-rack4-host20
04/25 06:32:22.357 DEBUG|        dev_server:0819| The host chromeos6-row2-rack4-host20 (100.115.128.218) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17.
04/25 06:32:22.358 DEBUG|        dev_server:0770| Pick one healthy devserver from ['http://100.115.219.129:8082', 'http://100.115.219.130:8082', 'http://100.115.245.197:8082', 'http://100.115.219.132:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.136:8082', 'http://100.115.185.228:8082', 'http://100.115.185.227:8082', 'http://100.115.185.226:8082', 'http://100.115.219.135:8082', 'http://100.115.245.201:8082', 'http://100.115.245.200:8082', 'http://100.115.219.139:8082', 'http://100.115.219.137:8082']
04/25 06:32:22.358 DEBUG|        dev_server:0774| Check health for http://100.115.219.129:8082
04/25 06:32:22.358 DEBUG|             utils:0214| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"''
04/25 06:32:24.240 DEBUG|        dev_server:0779| Pick http://100.115.219.129:8082
04/25 06:32:24.253 DEBUG|             utils:0214| Running 'ssh 100.115.219.130 'curl "http://100.115.219.130:8082/check_health?"''
04/25 06:32:24.254 DEBUG|             utils:0214| Running 'ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"''
04/25 06:32:24.264 DEBUG|             utils:0214| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/check_health?"''
04/25 06:32:24.266 DEBUG|             utils:0214| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"''
04/25 06:32:24.269 DEBUG|             utils:0214| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"''
04/25 06:32:24.270 DEBUG|             utils:0214| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"''
04/25 06:32:24.284 DEBUG|             utils:0214| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"''
04/25 06:32:24.292 DEBUG|             utils:0214| Running 'ssh 100.115.185.227 'curl "http://100.115.185.227:8082/check_health?"''
04/25 06:32:24.299 DEBUG|             utils:0214| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"''
04/25 06:32:24.301 DEBUG|             utils:0214| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"''
04/25 06:32:24.303 DEBUG|             utils:0214| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"''
04/25 06:32:24.303 DEBUG|             utils:0214| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"''
04/25 06:32:24.307 DEBUG|             utils:0214| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"''
04/25 06:32:24.319 DEBUG|             utils:0214| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"''
04/25 06:32:26.249 DEBUG|        dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.130.
04/25 06:32:26.251 ERROR|        dev_server:0520| Devserver call failed: "http://100.115.219.130:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
04/25 06:32:26.264 DEBUG|        dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.132.
04/25 06:32:26.265 ERROR|        dev_server:0520| Devserver call failed: "http://100.115.219.132:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
 
Labels: OS-Chrome
Cc: dgarr...@chromium.org
Components: Infra>Client>ChromeOS
Owner: gu...@chromium.org
Status: Assigned (was: Untriaged)
All failures are on chromeos6* DUT's. 

When i try to run this command from my desk it also fails:
curl http://100.115.245.132:8082/check_health?

Adding infra deputies. 

Comment 3 by gu...@chromium.org, Apr 26 2018

Status: Started (was: Assigned)

Comment 4 by gu...@chromium.org, Apr 26 2018

This shouldn't caused by RPC call timeout. 

04/25 05:52:50.923 ERROR|        dev_server:0520| Devserver call failed: "http://100.115.219.139:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
04/25 06:55:58.603 ERROR|   logging_manager:0626| Current thread 0x00007f15bfac2740:
04/25 06:55:58.604 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/forking.py", line 135 in poll
04/25 06:55:58.604 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/forking.py", line 154 in wait
04/25 06:55:58.605 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/process.py", line 145 in join
04/25 06:55:58.605 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2834 in get_least_loaded_devserver
04/25 06:55:58.605 DEBUG|          autoserv:0402| Received SIGTERM

Something wrong happened when joining a process.

Comment 5 by gu...@chromium.org, Apr 26 2018

Cc: xixuan@chromium.org

Comment 6 by xixuan@chromium.org, Apr 26 2018

Owner: dhadd...@chromium.org
All the tests are aborted after hitting the suite deadline, and sth is blocking the test running. I don't think it's simply called by some devserver RPC failures since finally there will be a healthy devserver to serve in this restricted network, or it won't only hurt this test.

Assign to @dhaddock, Do you have any idea about how to find the logs of the one hour gap between 'dev_server failed RPC' and 'SIGTERM' abortion? or why there's no logging in the one hour gap?


04/25 06:32:26.264 DEBUG|        dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.132.
04/25 06:32:26.265 ERROR|        dev_server:0520| Devserver call failed: "http://100.115.219.132:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
----- One hour gap, then get aborted -----
----- One hour gap, then get aborted  -----
----- One hour gap, then get aborted  -----
04/25 07:36:21.211 DEBUG|          autoserv:0402| Received SIGTERM
04/25 07:36:21.212 DEBUG|          autoserv:0405| Finished writing to pid_file. Killing process.
I don't think there are any _test_ logs for that hour. I think the test was blocked waiting for get_least_loaded_devserver to return.

This could be similar to  issue 824919  where we were hit by the new process kicked off was blocked 

Comment 8 by xixuan@chromium.org, Apr 26 2018

Owner: gu...@chromium.org
Re #7, thanks for the pointer. Assign back to deputy to decide whether to 
restart this shard / add more loggings/ add timeouts.

Comment 9 by gu...@chromium.org, Apr 26 2018

rebooting the shard chromeos-server125.
Will file another bug to track the deadlock issue.

Comment 10 by gu...@chromium.org, Apr 27 2018

Status: Fixed (was: Started)
Shard restared. Mark this ticket as fixed.
File crbug.com/837448 to track the timeout issue.

Comment 11 by gu...@chromium.org, Apr 27 2018

Mergedinto: 837448
Status: Duplicate (was: Fixed)

Sign in to add a comment