Failing AU tests a lot: Due to failing to find devserver (in a restricted subnet) |
||||||||||||||
Issue descriptionLooking at the AU tests for coral/astronaut on dev channel looks bad: https://stainless.corp.google.com/search?view=matrix&row=board_model&col=build&first_date=2018-12-06&last_date=2018-12-10&suite=%5Epaygen_au_dev&board=coral&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=false&exclude_acts=true&exclude_retried=true&exclude_non_production=false They are trying to find a devserver to use and failing: 2/09 17:20:46.132 INFO | dev_server:0809| Getting devservers for host: chromeos6-row6-rack3-host14 12/09 17:20:46.193 DEBUG| dev_server:0830| The host chromeos6-row6-rack3-host14 (100.115.136.190) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17. 12/09 17:20:46.194 DEBUG| dev_server:0781| Pick one healthy devserver from ['http://100.115.245.193:8082', 'http://100.115.245.194:8082', 'http://100.115.245.200:8082', 'http://100.115.245.201:8082', 'http://100.115.219.129:8082', 'http://100.115.219.139:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.135:8082', 'http://100.115.219.136:8082', 'http://100.115.219.137:8082', 'http://100.115.185.226:8082', 'http://100.115.185.227:8082'] 12/09 17:20:46.194 DEBUG| dev_server:0785| Check health for http://100.115.245.193:8082 12/09 17:20:46.195 DEBUG| utils:0219| Running 'ssh 100.115.245.193 'curl "http://100.115.245.193:8082/check_health?"'' 12/09 17:20:47.493 DEBUG| dev_server:0790| Pick http://100.115.245.193:8082 12/09 17:20:47.590 DEBUG| utils:0219| Running 'ssh 100.115.245.194 'curl "http://100.115.245.194:8082/check_health?"'' 12/09 17:20:47.624 DEBUG| utils:0219| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"'' 12/09 17:20:47.628 DEBUG| utils:0219| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"'' 12/09 17:20:47.633 DEBUG| utils:0219| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 12/09 17:20:47.640 DEBUG| utils:0219| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"'' 12/09 17:20:47.647 DEBUG| utils:0219| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"'' 12/09 17:20:47.655 DEBUG| utils:0219| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"'' 12/09 17:20:47.664 DEBUG| utils:0219| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 12/09 17:20:47.669 DEBUG| utils:0219| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"'' 12/09 17:20:47.682 DEBUG| utils:0219| Running 'ssh 100.115.185.227 'curl "http://100.115.185.227:8082/check_health?"'' 12/09 17:20:47.691 DEBUG| utils:0219| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"'' 12/09 17:20:47.706 DEBUG| utils:0219| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"'' 12/09 18:29:04.053 ERROR| logging_manager:0626| Current thread 0x00007ffac360d740: This looks similar to bugs we had previously: issue 824919 and issue 837448
,
Dec 14
coral/robo is also failing. UPdating title
,
Dec 15
This line in the logs means it found a devserver that could work: Pick http://100.115.245.193:8082 But then it queries all other devservers in the subnet to find the least loaded and then gets stuck somewhere This check on the devserver times out for me too. curl "http://100.115.219.136:8082/check_health?" So I guess there could be a problem when there are a few devservers in this state
,
Dec 17
,
Dec 18
+new infra deputy Another example from today: https://stainless.corp.google.com/browse/chromeos-autotest-results/268364816-chromeos-test/
,
Dec 18
All of the failures on coral on todays stable candidate are failing for this error: https://stainless.corp.google.com/search?view=matrix&first_date=2018-12-05&last_date=2018-12-19&suite=%5CQpaygen_au_stable%5CE&exclude_au=false&exclude_retried=false&owner=chromeos-test&row=test&col=build&board=coral&build=R71-11151.59.2
,
Dec 18
Following Issue 824919 , I will reboot chromeos-skunk-5.mtv.corp.google.com first.
,
Dec 19
Bad, it doesn't return from reboot...
,
Dec 19
b/121222129, tracking the machine reboot.
,
Dec 19
Three more runs lately hit this: https://stainless.corp.google.com/browse/chromeos-autotest-results/268691978-chromeos-test/ https://stainless.corp.google.com/browse/chromeos-autotest-results/268364816-chromeos-test/ https://stainless.corp.google.com/browse/chromeos-autotest-results/268381911-chromeos-test/ Are these all due to chromeos-skunk-5? Where do I see that?
,
Dec 19
only coral belongs to chromeos-skunk-5. You can use lab-tools to check it: xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep coral 212 chromeos-skunk-5.mtv.corp.google.com board:coral Seems the other 2 are not coral. Maybe I need to restart more shards.
,
Dec 19
xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep caroline 227 cros-full-0005.mtv.corp.google.com board:caroline, board:veyron_speedy, board:stumpy xixuan@xixuan0:~/chromiumos_labtools/src/third_party/autotest/files$ atest shard list | grep nami 213 chromeos-skunk-1.mtv.corp.google.com board:nami, board:atlas
,
Dec 20
I see this failure also on octopus and grunt https://stainless.corp.google.com/browse/chromeos-autotest-results/268315865-chromeos-test/ https://stainless.corp.google.com/browse/chromeos-autotest-results/268691576-chromeos-test/ dhaddock@dhaddock:~/cros/src$ atest shard list | grep grunt 248 cros-full-0039.mtv.corp.google.com board:samus, board:grunt dhaddock@dhaddock:~/cros/src$ atest shard list | grep octopus 240 cros-full-0018.mtv.corp.google.com board:gale, board:terra, board:octopus
,
Dec 20
that's..... I will restart all of them this morning.
,
Dec 20
skunk-1.mtv is down after reboot, tracking bug: https://buganizer.corp.google.com/issues/121325013
,
Dec 21
Soraka: https://stainless.corp.google.com/browse/chromeos-autotest-results/269074715-chromeos-test/ Sand: https://stainless.corp.google.com/browse/chromeos-autotest-results/269064578-chromeos-test/ Both share the same shard dhaddock@dhaddock:~/cros/src/third_party/autotest/files$ atest shard list | grep soraka 241 cros-full-0019.mtv.corp.google.com board:soraka, board:panther, board:sand, board:wizpig
,
Dec 26
Another soraka failure: https://stainless.corp.google.com/browse/chromeos-autotest-results/270533139-chromeos-test/ And one on snappy https://stainless.corp.google.com/browse/chromeos-autotest-results/270226202-chromeos-test/ dhaddock@dhaddock:~/cros/src$ atest shard list | grep snappy 229 cros-full-0007.mtv.corp.google.com board:kevin, board:snappy, board:daisy_skate
,
Jan 2
Sorry, I was OOO. Is this still an issue?
,
Jan 3
We managed to get the first stable pushed but it required regenerating payloads, repeated testing, etc. I wouldn't call our current system robust yet. I have a list of related bugs for the postmortem...
,
Jan 4
Yes it is still happening. I've checked the paygen_au* failures since Jan 1 and found these: Octopus https://stainless.corp.google.com/browse/chromeos-autotest-results/273128644-chromeos-test/ Grunt: https://stainless.corp.google.com/browse/chromeos-autotest-results/273708880-kbleicher/ Sand: https://stainless.corp.google.com/browse/chromeos-autotest-results/272632039-chromeos-test/ Soraka: https://stainless.corp.google.com/browse/chromeos-autotest-results/272304648-chromeos-test/ Soraka: https://stainless.corp.google.com/browse/chromeos-autotest-results/273695537-chromeos-test/ Samus: https://stainless.corp.google.com/browse/chromeos-autotest-results/273674814-chromeos-test/ Sand: https://stainless.corp.google.com/browse/chromeos-autotest-results/273635013-chromeos-test/
,
Jan 4
Restarting shards is only kicking this problem down the road. It keeps reappearing. Can anybody spend some time figuring it out?
,
Jan 9
Another one from atlas https://stainless.corp.google.com/browse/chromeos-autotest-results/274678294-chromeos-test/ Wizpig: https://stainless.corp.google.com/browse/chromeos-autotest-results/274864652-chromeos-test/
,
Jan 9
Bumping to new deputy
,
Jan 12
All of these runs are failing with this error which means we will have to skip this board in the stable release https://stainless.corp.google.com/search?view=list&row=test&col=build&first_date=2019-01-03&last_date=2019-01-17&owner=chromeos-test&suite=%5CQpaygen_au_stable%5CE&build=R71-11151.113.0&board=sand&status=FAIL&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=false&exclude_acts=true&exclude_retried=false&exclude_non_production=false
,
Jan 12
Looking at the atlas one in #22 01/07 14:57:46.232 DEBUG| ssh_host:0310| Running (ssh) '/tmp/sysinfo/autoserv-hQa6qu/bin/autotestd_monitor /tmp/autoserv-JILux2 0 0' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly' 01/07 14:57:46.779 DEBUG| autotest:1283| Entered autotestd_monitor. 01/07 14:57:46.779 INFO | autotest:1342| Entered autotestd_monitor. 01/07 14:57:46.779 DEBUG| autotest:1283| Finished launching tail subprocesses. 01/07 14:57:46.780 INFO | autotest:1342| Finished launching tail subprocesses. 01/07 14:57:46.780 DEBUG| autotest:1283| Finished waiting on autotestd to start. 01/07 14:57:46.780 INFO | autotest:1342| Finished waiting on autotestd to start. 01/07 15:15:42.137 DEBUG| autotest:0958| Result exit status is 255.
,
Jan 12
14:57:46 DEBUG| Running 'mkdir -p /tmp/sysinfo/autoserv-hQa6qu/results/default/sysinfo/mnt/stateful_partition/unencrypted/preserve' 14:57:46 DEBUG| Running 'rsync --no-perms --chmod=ugo+r -a --safe-links --exclude=/log/**autoserv* /mnt/stateful_partition/unencrypted/preserve/log /tmp/sysinfo/autoserv-hQa6qu/results/default/sysinfo/mnt/stateful_partition/unencrypted/preserve' 14:59:41 ERROR| [stderr] rsync: read errors mapping "/mnt/stateful_partition/unencrypted/preserve/log/secure": Input/output error (5) The test is hanging
,
Jan 12
The wizpig in #22 is the devserver issue. Ah, I hate timeouts in Python. 01/08 05:21:25.296 DEBUG| dev_server:2820| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.ImageServer'> 01/08 05:21:25.296 INFO | dev_server:0809| Getting devservers for host: chromeos6-row2-rack20-host18 01/08 05:21:25.357 DEBUG| dev_server:0830| The host chromeos6-row2-rack20-host18 (100.115.131.48) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17. 01/08 05:21:25.360 DEBUG| dev_server:0781| Pick one healthy devserver from ['http://100.115.245.193:8082', 'http://100.115.245.194:8082', 'http://100.115.245.200:8082', 'http://100.115.245.201:8082', 'http://100.115.219.129:8082', 'http://100.115.219.139:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.135:8082', 'http://100.115.219.136:8082', 'http://100.115.219.137:8082', 'http://100.115.185.226:8082', 'http://100.115.185.228:8082'] 01/08 05:21:25.360 DEBUG| dev_server:0785| Check health for http://100.115.245.193:8082 01/08 05:21:25.361 DEBUG| utils:0219| Running 'ssh 100.115.245.193 'curl "http://100.115.245.193:8082/check_health?"'' 01/08 05:21:26.696 DEBUG| dev_server:0790| Pick http://100.115.245.193:8082 01/08 05:21:26.705 DEBUG| utils:0219| Running 'ssh 100.115.245.194 'curl "http://100.115.245.194:8082/check_health?"'' 01/08 05:21:26.706 DEBUG| utils:0219| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/check_health?"'' 01/08 05:21:26.708 DEBUG| utils:0219| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 01/08 05:21:26.711 DEBUG| utils:0219| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"'' 01/08 05:21:26.714 DEBUG| utils:0219| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"'' 01/08 05:21:26.714 DEBUG| utils:0219| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"'' 01/08 05:21:26.715 DEBUG| utils:0219| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"'' 01/08 05:21:26.718 DEBUG| utils:0219| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 01/08 05:21:26.718 DEBUG| utils:0219| Running 'ssh 100.115.219.137 'curl "http://100.115.219.137:8082/check_health?"'' 01/08 05:21:26.722 DEBUG| utils:0219| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"'' 01/08 05:21:26.725 DEBUG| utils:0219| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"'' 01/08 05:21:26.745 DEBUG| utils:0219| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"'' 01/08 05:21:28.711 DEBUG| dev_server:0386| RPC call get_load has timed out on devserver 100.115.219.133. 01/08 05:21:28.712 ERROR| dev_server:0531| Devserver call failed: "http://100.115.219.133:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s 01/08 05:21:28.721 DEBUG| dev_server:0386| RPC call get_load has timed out on devserver 100.115.185.226. 01/08 05:21:28.722 ERROR| dev_server:0531| Devserver call failed: "http://100.115.185.226:8082/check_health?", timeout: 2.0 seconds, Error: retry exception (label="get_load"), timeout = 2s 01/08 06:28:55.085 ERROR| logging_manager:0626| Current thread 0x00007f2dee9f3740: 01/08 06:28:55.086 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/forking.py", line 135 in poll 01/08 06:28:55.087 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/forking.py", line 154 in wait 01/08 06:28:55.087 ERROR| logging_manager:0626| File "/usr/lib/python2.7/multiprocessing/process.py", line 145 in join 01/08 06:28:55.087 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2844 in get_least_loaded_devserver 01/08 06:28:55.087 ERROR| logging_manager:0626| File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 117 in _get_least_loaded_devserver 01/08 06:28:55.088 ERROR| logging_manager:0626| File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 368 in run_once 01/08 06:28:55.088 DEBUG| autoserv:0376| Received SIGTERM 01/08 06:28:55.088 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/test.py", line 371 in _call_run_once 01/08 06:28:55.088 DEBUG| autoserv:0379| Finished writing to pid_file. Killing process. 01/08 06:28:55.088 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/test.py", line 464 in execute 01/08 06:28:55.089 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/test.py", line 800 in _call_test_function 01/08 06:28:55.089 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/test.py", line 600 in _exec 01/08 06:28:55.089 ERROR| logging_manager:0626| File "/usr/local/autotest/client/common_lib/test.py", line 888 in runtest 01/08 06:28:55.089 ERROR| logging_manager:0626| File "/usr/local/autotest/server/test.py", line 309 in runtest 01/08 06:28:55.090 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 886 in group_func 01/08 06:28:55.090 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 910 in _run_group 01/08 06:28:55.090 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 898 in run_test 01/08 06:28:55.091 ERROR| logging_manager:0626| File "/usr/local/autotest/results/274864652-chromeos-test/chromeos6-row2-rack20-host18/control.srv", line 194 in run_test 01/08 06:28:55.091 ERROR| logging_manager:0626| File "/usr/local/autotest/server/subcommand.py", line 98 in parallel_simple 01/08 06:28:55.091 ERROR| logging_manager:0626| File "/usr/local/autotest/results/274864652-chromeos-test/chromeos6-row2-rack20-host18/control.srv", line 205 in <module> 01/08 06:28:55.091 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 1340 in _execute_code 01/08 06:28:55.092 ERROR| logging_manager:0626| File "/usr/local/autotest/server/server_job.py", line 817 in run 01/08 06:28:55.092 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 592 in run_autoserv 01/08 06:28:55.092 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 806 in main 01/08 06:28:55.092 ERROR| logging_manager:0626| File "/usr/local/autotest/server/autoserv", line 823 in <module>
,
Jan 12
,
Jan 14
,
Jan 14
,
Jan 15
Thanks for looking into this
,
Jan 17
(6 days ago)
Can we get some more context on this bug? This impacts some potential FSI release candidates on a new platform, and we're trying to understand if this is scoped as a real issue with auto-update functionality, or if this is constrained to configuration in autotest. Unsurprisingly, there are probably a lot of teams interested in a bug that would block signing off an image for autoupdate issues. Thanks!
,
Jan 18
(5 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0bb60d52acc1b9640ce6d30280fa12e4e0036f4c commit 0bb60d52acc1b9640ce6d30280fa12e4e0036f4c Author: Allen Li <ayatane@chromium.org> Date: Fri Jan 18 04:20:50 2019 autotest: Don't block on child processes in get_least_loaded_devserver This spawns subprocesses that ultimately pass a timeout to socket.settimeout() and calls urlopen. This appears to not be reliable: https://stackoverflow.com/questions/27327787/python-urllib2-does-not-respect-timeout (See also bug) To avoid blocking a test for an hour, we add a timeout to the join. BUG=chromium:913695 TEST=None Change-Id: I8a5e429a905614447116ce595d69e4c6d6e5b0a3 Reviewed-on: https://chromium-review.googlesource.com/1407807 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/0bb60d52acc1b9640ce6d30280fa12e4e0036f4c/client/common_lib/cros/dev_server.py |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by dhaddock@google.com
, Dec 10