New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 913695 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Failing AU tests a lot: Due to failing to find devserver (in a restricted subnet)

Project Member Reported by dhaddock@google.com, Dec 10

Issue description

Looking at the AU tests for coral/astronaut on dev channel looks bad:

https://stainless.corp.google.com/search?view=matrix&row=board_model&col=build&first_date=2018-12-06&last_date=2018-12-10&suite=%5Epaygen_au_dev&board=coral&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=false&exclude_acts=true&exclude_retried=true&exclude_non_production=false

They are trying to find a devserver to use and failing:

2/09 17:20:46.132 INFO |        dev_server:0809| Getting devservers for host: chromeos6-row6-rack3-host14
12/09 17:20:46.193 DEBUG|        dev_server:0830| The host chromeos6-row6-rack3-host14 (100.115.136.190) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17.
12/09 17:20:46.194 DEBUG|        dev_server:0781| Pick one healthy devserver from ['http://100.115.245.193:8082', 'http://100.115.245.194:8082', 'http://100.115.245.200:8082', 'http://100.115.245.201:8082', 'http://100.115.219.129:8082', 'http://100.115.219.139:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.135:8082', 'http://100.115.219.136:8082', 'http://100.115.219.137:8082', 'http://100.115.185.226:8082', 'http://100.115.185.227:8082']
12/09 17:20:46.194 DEBUG|        dev_server:0785| Check health for http://100.115.245.193:8082
12/09 17:20:46.195 DEBUG|             utils:0219| Running 'ssh 100.115.245.193 'curl "http://100.115.245.193:8082/check_health?"''
12/09 17:20:47.493 DEBUG|        dev_server:0790| Pick http://100.115.245.193:8082
12/09 17:20:47.590 DEBUG|             utils:0219| Running 'ssh 100.115.245.194 'curl "http://100.115.245.194:8082/check_health?"''
12/09 17:20:47.624 DEBUG|             utils:0219| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"''
12/09 17:20:47.628 DEBUG|             utils:0219| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"''
12/09 17:20:47.633 DEBUG|             utils:0219| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"''
12/09 17:20:47.640 DEBUG|             utils:0219| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"''
12/09 17:20:47.647 DEBUG|             utils:0219| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"''
12/09 17:20:47.655 DEBUG|             utils:0219| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"''
12/09 17:20:47.664 DEBUG|             utils:0219| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"''
12/09 17:20:47.669 DEBUG|             utils:0219| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"''
12/09 17:20:47.682 DEBUG|             utils:0219| Running 'ssh 100.115.185.227 'curl "http://100.115.185.227:8082/check_health?"''
12/09 17:20:47.691 DEBUG|             utils:0219| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"''
12/09 17:20:47.706 DEBUG|             utils:0219| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"''
12/09 18:29:04.053 ERROR|   logging_manager:0626| Current thread 0x00007ffac360d740:

This looks similar to bugs we had previously:  issue 824919  and issue 837448

 
This is happening intermittently 
Summary: coral (astronaut/robo) failing AU tests a lot: Due to failing to find devserver (in a restricted subnet) (was: Astronaut failing AU tests a lot: Due to failing to find devserver (in a restricted subnet) )
coral/robo is also failing. UPdating title 
This line in the logs means it found a devserver that could work:
Pick http://100.115.245.193:8082

But then it queries all other devservers in the subnet to find the least loaded and then gets stuck somewhere

This check on the devserver times out for me too.
curl "http://100.115.219.136:8082/check_health?"

So I guess there could be a problem when there are a few devservers in this state 
Cc: akes...@chromium.org
Owner: xixuan@chromium.org
+new infra deputy 

Another example from today:
https://stainless.corp.google.com/browse/chromeos-autotest-results/268364816-chromeos-test/
Following  Issue 824919 , I will reboot chromeos-skunk-5.mtv.corp.google.com first.
Bad, it doesn't return from reboot...
b/121222129, tracking the machine reboot.
only coral belongs to chromeos-skunk-5. You can use lab-tools to check it:

xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep coral
212  chromeos-skunk-5.mtv.corp.google.com         board:coral

Seems the other 2 are not coral. Maybe I need to restart more shards.
xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep caroline
227  cros-full-0005.mtv.corp.google.com           board:caroline, board:veyron_speedy, board:stumpy

xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep nami
213  chromeos-skunk-1.mtv.corp.google.com         board:nami, board:atlas
I see this failure also on octopus and grunt

https://stainless.corp.google.com/browse/chromeos-autotest-results/268315865-chromeos-test/
https://stainless.corp.google.com/browse/chromeos-autotest-results/268691576-chromeos-test/


dhaddock@dhaddock:~/cros/src$ atest shard list | grep grunt
248  cros-full-0039.mtv.corp.google.com           board:samus, board:grunt
dhaddock@dhaddock:~/cros/src$ atest shard list | grep octopus
240  cros-full-0018.mtv.corp.google.com           board:gale, board:terra, board:octopus
that's..... I will restart all of them this morning. 
skunk-1.mtv is down after reboot, tracking bug: https://buganizer.corp.google.com/issues/121325013
Soraka:
https://stainless.corp.google.com/browse/chromeos-autotest-results/269074715-chromeos-test/

Sand:
https://stainless.corp.google.com/browse/chromeos-autotest-results/269064578-chromeos-test/

Both share the same shard 

dhaddock@dhaddock:~/cros/src/third_party/autotest/files$ atest shard list | grep soraka
241  cros-full-0019.mtv.corp.google.com           board:soraka, board:panther, board:sand, board:wizpig


Cc: xixuan@chromium.org
Owner: akes...@chromium.org
Another soraka failure:
https://stainless.corp.google.com/browse/chromeos-autotest-results/270533139-chromeos-test/

And one on snappy
https://stainless.corp.google.com/browse/chromeos-autotest-results/270226202-chromeos-test/

dhaddock@dhaddock:~/cros/src$ atest shard list | grep snappy
229  cros-full-0007.mtv.corp.google.com           board:kevin, board:snappy, board:daisy_skate
Owner: dhadd...@chromium.org
Sorry, I was OOO. Is this still an issue?
We managed to get the first stable pushed but it required regenerating payloads, repeated testing, etc.  I wouldn't call our current system robust yet.  I have a list of related bugs for the postmortem...
Owner: akes...@chromium.org
Restarting shards is only kicking this problem down the road. It keeps reappearing. 

Can anybody spend some time figuring it out?
Summary: Failing AU tests a lot: Due to failing to find devserver (in a restricted subnet) (was: coral (astronaut/robo) failing AU tests a lot: Due to failing to find devserver (in a restricted subnet) )
Another one from atlas 
https://stainless.corp.google.com/browse/chromeos-autotest-results/274678294-chromeos-test/

Wizpig:
https://stainless.corp.google.com/browse/chromeos-autotest-results/274864652-chromeos-test/
Owner: ayatane@chromium.org
Bumping to new deputy 
Looking at the atlas one in #22

01/07 14:57:46.232 DEBUG|          ssh_host:0310| Running (ssh) '/tmp/sysinfo/autoserv-hQa6qu/bin/autotestd_monitor /tmp/autoserv-JILux2 0 0' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly'
01/07 14:57:46.779 DEBUG|          autotest:1283| Entered autotestd_monitor.
01/07 14:57:46.779 INFO |          autotest:1342| Entered autotestd_monitor.
01/07 14:57:46.779 DEBUG|          autotest:1283| Finished launching tail subprocesses.
01/07 14:57:46.780 INFO |          autotest:1342| Finished launching tail subprocesses.
01/07 14:57:46.780 DEBUG|          autotest:1283| Finished waiting on autotestd to start.
01/07 14:57:46.780 INFO |          autotest:1342| Finished waiting on autotestd to start.
01/07 15:15:42.137 DEBUG|          autotest:0958| Result exit status is 255.

14:57:46 DEBUG| Running 'mkdir -p /tmp/sysinfo/autoserv-hQa6qu/results/default/sysinfo/mnt/stateful_partition/unencrypted/preserve'
14:57:46 DEBUG| Running 'rsync --no-perms --chmod=ugo+r -a --safe-links --exclude=/log/**autoserv* /mnt/stateful_partition/unencrypted/preserve/log /tmp/sysinfo/autoserv-hQa6qu/results/default/sysinfo/mnt/stateful_partition/unencrypted/preserve'
14:59:41 ERROR| [stderr] rsync: read errors mapping "/mnt/stateful_partition/unencrypted/preserve/log/secure": Input/output error (5)

The test is hanging
The wizpig in #22 is the devserver issue.

Ah, I hate timeouts in Python.

01/08 05:21:25.296 DEBUG|        dev_server:2820| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.ImageServer'>
01/08 05:21:25.296 INFO |        dev_server:0809| Getting devservers for host: chromeos6-row2-rack20-host18
01/08 05:21:25.357 DEBUG|        dev_server:0830| The host chromeos6-row2-rack20-host18 (100.115.131.48) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17.
01/08 05:21:25.360 DEBUG|        dev_server:0781| Pick one healthy devserver from ['http://100.115.245.193:8082', 'http://100.115.245.194:8082', 'http://100.115.245.200:8082', 'http://100.115.245.201:8082', 'http://100.115.219.129:8082', 'http://100.115.219.139:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.135:8082', 'http://100.115.219.136:8082', 'http://100.115.219.137:8082', 'http://100.115.185.226:8082', 'http://100.115.185.228:8082']
01/08 05:21:25.360 DEBUG|        dev_server:0785| Check health for http://100.115.245.193:8082
01/08 05:21:25.361 DEBUG|             utils:0219| Running 'ssh 100.115.245.193 'curl "http://100.115.245.193:8082/check_health?"''
01/08 05:21:26.696 DEBUG|        dev_server:0790| Pick http://100.115.245.193:8082
01/08 05:21:26.705 DEBUG|             utils:0219| Running 'ssh 100.115.245.194 'curl "http://100.115.245.194:8082/check_health?"''
01/08 05:21:26.706 DEBUG|             utils:0219| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"''
01/08 05:21:26.708 DEBUG|             utils:0219| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"''
01/08 05:21:26.711 DEBUG|             utils:0219| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"''
01/08 05:21:26.714 DEBUG|             utils:0219| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"''
01/08 05:21:26.714 DEBUG|             utils:0219| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"''
01/08 05:21:26.715 DEBUG|             utils:0219| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"''
01/08 05:21:26.718 DEBUG|             utils:0219| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"''
01/08 05:21:26.718 DEBUG|             utils:0219| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"''
01/08 05:21:26.722 DEBUG|             utils:0219| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"''
01/08 05:21:26.725 DEBUG|             utils:0219| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"''
01/08 05:21:26.745 DEBUG|             utils:0219| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"''
01/08 05:21:28.711 DEBUG|        dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.133.
01/08 05:21:28.712 ERROR|        dev_server:0531| Devserver call failed: "http://100.115.219.133:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
01/08 05:21:28.721 DEBUG|        dev_server:0386| RPC call get_load has timed out on devserver 100.115.185.226.
01/08 05:21:28.722 ERROR|        dev_server:0531| Devserver call failed: "http://100.115.185.226:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s
01/08 06:28:55.085 ERROR|   logging_manager:0626| Current thread 0x00007f2dee9f3740:
01/08 06:28:55.086 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/forking.py", line 135 in poll
01/08 06:28:55.087 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/forking.py", line 154 in wait
01/08 06:28:55.087 ERROR|   logging_manager:0626|   File "/usr/lib/python2.7/multiprocessing/process.py", line 145 in join
01/08 06:28:55.087 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2844 in get_least_loaded_devserver
01/08 06:28:55.087 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 117 in _get_least_loaded_devserver
01/08 06:28:55.088 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 368 in run_once
01/08 06:28:55.088 DEBUG|          autoserv:0376| Received SIGTERM
01/08 06:28:55.088 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/test.py", line 371 in _call_run_once
01/08 06:28:55.088 DEBUG|          autoserv:0379| Finished writing to pid_file. Killing process.
01/08 06:28:55.088 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/test.py", line 464 in execute
01/08 06:28:55.089 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/test.py", line 800 in _call_test_function
01/08 06:28:55.089 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/test.py", line 600 in _exec
01/08 06:28:55.089 ERROR|   logging_manager:0626|   File "/usr/local/autotest/client/common_lib/test.py", line 888 in runtest
01/08 06:28:55.089 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/test.py", line 309 in runtest
01/08 06:28:55.090 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 886 in group_func
01/08 06:28:55.090 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 910 in _run_group
01/08 06:28:55.090 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 898 in run_test
01/08 06:28:55.091 ERROR|   logging_manager:0626|   File "/usr/local/autotest/results/274864652-chromeos-test/chromeos6-row2-rack20-host18/control.srv", line 194 in run_test
01/08 06:28:55.091 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/subcommand.py", line 98 in parallel_simple
01/08 06:28:55.091 ERROR|   logging_manager:0626|   File "/usr/local/autotest/results/274864652-chromeos-test/chromeos6-row2-rack20-host18/control.srv", line 205 in <module>
01/08 06:28:55.091 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 1340 in _execute_code
01/08 06:28:55.092 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 817 in run
01/08 06:28:55.092 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 592 in run_autoserv
01/08 06:28:55.092 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 806 in main
01/08 06:28:55.092 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 823 in <module>
Labels: Hotlist-Deputy
Status: Started (was: Assigned)
Labels: BlockingAuSignoff
Thanks for looking into this 

Comment 32 by teravest@chromium.org, Jan 17 (6 days ago)

Cc: teravest@chromium.org
Can we get some more context on this bug? This impacts some potential FSI release candidates on a new platform, and we're trying to understand if this is scoped as a real issue with auto-update functionality, or if this is constrained to configuration in autotest.

Unsurprisingly, there are probably a lot of teams interested in a bug that would block signing off an image for autoupdate issues.

Thanks!
Project Member

Comment 33 by bugdroid1@chromium.org, Jan 18 (5 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0bb60d52acc1b9640ce6d30280fa12e4e0036f4c

commit 0bb60d52acc1b9640ce6d30280fa12e4e0036f4c
Author: Allen Li <ayatane@chromium.org>
Date: Fri Jan 18 04:20:50 2019

autotest: Don't block on child processes in get_least_loaded_devserver

This spawns subprocesses that ultimately pass a timeout to
socket.settimeout() and calls urlopen.

This appears to not be reliable:

https://stackoverflow.com/questions/27327787/python-urllib2-does-not-respect-timeout
(See also bug)

To avoid blocking a test for an hour, we add a timeout to the join.

BUG=chromium:913695
TEST=None

Change-Id: I8a5e429a905614447116ce595d69e4c6d6e5b0a3
Reviewed-on: https://chromium-review.googlesource.com/1407807
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/0bb60d52acc1b9640ce6d30280fa12e4e0036f4c/client/common_lib/cros/dev_server.py

Sign in to add a comment