Lumpy failing AU test a lot due to not being able to find devserver in restricted subnet |
||||||||
Issue descriptionWe are tyring to get the latest stable build out: https://cros-goldeneye.corp.google.com/chromeos/console/viewRelease?releaseName=M65-STABLE-CHROMEOS-2 One board (lumpy) is repeatedly failing the AU tests: https://ubercautotest.corp.google.com/new_tko/#tab_id=spreadsheet_view&row=job_name%252Ctest_name%252Cplatform&column=status&show_incomplete=true&show_only_latest=false&show_invalid=true&condition=job_name%2520LIKE%2520%27lumpy-release/R65-10323.62.0/paygen%255C_au%255C_stable/%2525%27%2520AND%2520test_name%2520%253C%253E%2520%27SERVER_JOB%27%2520AND%2520test_name%2520NOT%2520LIKE%2520%27CLIENT%255C_JOB%2525%27%2520AND%2520platform%2520LIKE%2520%27lumpy%27 Looking at the logs it cannot reach a devserver: 03/21 18:55:42.558 INFO | dev_server:0798| Getting devservers for host: chromeos6-row2-rack7-host15 03/21 18:55:42.619 DEBUG| dev_server:0819| The host chromeos6-row2-rack7-host15 (100.115.129.23) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17. 03/21 18:55:42.620 DEBUG| dev_server:0770| Pick one healthy devserver from ['http://100.115.219.129:8082', 'http://100.115.219.130:8082', 'http://100.115.245.197:8082', 'http://100.115.219.132:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.136:8082', 'http://100.115.185.228:8082', 'http://100.115.219.139:8082', 'http://100.115.185.227:8082', 'http://100.115.185.226:8082', 'http://100.115.219.135:8082', 'http://100.115.245.201:8082'] 03/21 18:55:42.620 DEBUG| dev_server:0774| Check health for http://100.115.219.129:8082 03/21 18:55:42.620 DEBUG| utils:0214| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 03/21 18:55:44.152 DEBUG| dev_server:0779| Pick http://100.115.219.129:8082 03/21 18:55:44.160 DEBUG| utils:0214| Running 'ssh 100.115.219.130 'curl "http://100.115.219.130:8082/check_health?"'' 03/21 18:55:44.162 DEBUG| utils:0214| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/check_health?"'' 03/21 18:55:44.162 DEBUG| utils:0214| Running 'ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"'' 03/21 18:55:44.163 DEBUG| utils:0214| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"'' 03/21 18:55:44.165 DEBUG| utils:0214| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"'' 03/21 18:55:44.169 DEBUG| utils:0214| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"'' 03/21 18:55:44.169 DEBUG| utils:0214| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 03/21 18:55:44.171 DEBUG| utils:0214| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"'' 03/21 18:55:44.172 DEBUG| utils:0214| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"'' Then aborts the test. Assigning to infra deputy
,
Mar 22 2018
,
Mar 22 2018
,
Mar 22 2018
I'm skeptical that the problem is the devservers, per se. The cited log output is more or less normal: The list of devservers is the same list used for other DUTs in the chromeos6 lab, and those cases aren't showing failures. It also can't be the DUTs, because the DUT isn't involved in any of the logged actions. However, something is going wrong during devserver selection. If you look at a sample of success, you see that stuff that's supposed to happen never gets logged. Here's what successful AU tests look like in the chromeos6 lab: 03/21 04:37:12.586 DEBUG| dev_server:0819| The host chromeos6-row2-rack24-host10 (100.115.131.128) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:17. 03/21 04:37:12.615 DEBUG| dev_server:0770| Pick one healthy devserver from ['http://100.115.219.129:8082', 'http://100.115.219.130:8082', 'http://100.115.245.197:8082', 'http://100.115.219.132:8082', 'http://100.115.219.133:8082', 'http://100.115.219.134:8082', 'http://100.115.219.136:8082', 'http://100.115.185.228:8082', 'http://100.115.219.139:8082', 'http://100.115.185.227:8082', 'http://100.115.185.226:8082', 'http://100.115.219.135:8082', 'http://100.115.245.201:8082'] 03/21 04:37:12.615 DEBUG| dev_server:0774| Check health for http://100.115.219.129:8082 03/21 04:37:12.616 DEBUG| utils:0214| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 03/21 04:37:13.844 DEBUG| dev_server:0779| Pick http://100.115.219.129:8082 03/21 04:37:13.854 DEBUG| utils:0214| Running 'ssh 100.115.219.130 'curl "http://100.115.219.130:8082/check_health?"'' 03/21 04:37:13.856 DEBUG| utils:0214| Running 'ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"'' 03/21 04:37:13.865 DEBUG| utils:0214| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"'' 03/21 04:37:13.866 DEBUG| utils:0214| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 03/21 04:37:13.868 DEBUG| utils:0214| Running 'ssh 100.115.185.228 'curl "http://100.115.185.228:8082/check_health?"'' 03/21 04:37:13.871 DEBUG| utils:0214| Running 'ssh 100.115.219.139 'curl "http://100.115.219.139:8082/check_health?"'' 03/21 04:37:13.876 DEBUG| utils:0214| Running 'ssh 100.115.185.226 'curl "http://100.115.185.226:8082/check_health?"'' 03/21 04:37:13.876 DEBUG| utils:0214| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/check_health?"'' 03/21 04:37:13.876 DEBUG| utils:0214| Running 'ssh 100.115.185.227 'curl "http://100.115.185.227:8082/check_health?"'' 03/21 04:37:13.877 DEBUG| utils:0214| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/check_health?"'' 03/21 04:37:13.881 DEBUG| utils:0214| Running 'ssh 100.115.245.201 'curl "http://100.115.245.201:8082/check_health?"'' 03/21 04:37:13.874 DEBUG| utils:0214| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/check_health?"'' 03/21 04:37:15.226 DEBUG| dev_server:2754| Network IO of devserver http://100.115.219.133:8082 is at 193405810 Bps, which is higher than the threshold of 83886080 bytes per second. 03/21 04:37:15.226 DEBUG| dev_server:2754| Network IO of devserver http://100.115.245.197:8082 is at 134559036 Bps, which is higher than the threshold of 83886080 bytes per second. 03/21 04:37:15.227 DEBUG| dev_server:2754| Network IO of devserver http://100.115.245.201:8082 is at 165468310 Bps, which is higher than the threshold of 83886080 bytes per second. 03/21 04:37:15.227 DEBUG| dev_server:2754| Network IO of devserver http://100.115.219.136:8082 is at 93308088 Bps, which is higher than the threshold of 83886080 bytes per second. 03/21 04:37:15.227 DEBUG| dev_server:2754| Network IO of devserver http://100.115.219.130:8082 is at 131392997 Bps, which is higher than the threshold of 83886080 bytes per second. 03/21 04:37:15.227 DEBUG|autoupdate_EndToEn:0120| Choosing the least loaded devserver: http://100.115.185.228:8082 03/21 04:37:15.228 INFO |autoupdate_EndToEn:0130| Devserver chosen for this run: 100.115.185.228 03/21 04:37:15.228 INFO |autoupdate_EndToEn:0142| Staging images onto autotest devserver (http://100.115.185.228:8082)
,
Mar 22 2018
I've gone to the shard, and run the `ssh ... curl` commands for every devserver. They all pass with flying colors. So, in the current state, a lumpy stable build should work just fine. And I have no reason to believe that the next time we go to run a lumpy stable channel release build that it won't work just as well as I'm predicting for right now. Except... We've seen this failure two days running, and there's no reason to expect the third day will be any different. So, immovable object, meet irresistible force. ATM, all I can think to do is to add extra logging to the relevant code paths, then let it retry and see what we learn. That would mean it could take several days to track this down, probably into next week. I'm open to other better ideas...
,
Mar 22 2018
was there some network issue during those hours: 04:37 and 18:55 ?
,
Mar 22 2018
> was there some network issue during those hours: 04:37 and 18:55 ?
A widespread network issue would have caused widespread test failures.
To not cause widespread failures, I think the problem would have had
to be either
1) Specific to connections between the single shard and the chromeos6
lab, or
2) Specific to the shard and only occurring at those two times
(but not other times in between).
I am (shall we say) skeptical of both scenarios.
,
Mar 22 2018
> I am (shall we say) skeptical of both scenarios. Frankly, both theories strain my credulity... First, there was no failure at 04:37. Those logs were for a successful run for a different board. I'm unable to find the logs for the build that caused the failure on 3/21 at 18:55. But that's meaningless: I have identical logs for failures on 3/20, 3/19, and 3/16. So, this problem has been going on for a while. The shard is shared with board sentry, so I'd expect this problem to be affecting that board as well. So, how are sentry tests doing? I'm especially interested in out-of-band tests from suite_scheduler, since (mostly) only those tests would run in the chromeos6 lab where all the lumpy DUTs are.
,
Mar 22 2018
Sentry looks great though all runs are chromeos4 DUTs https://ubercautotest.corp.google.com/new_tko/#tab_id=spreadsheet_view&row=job_name%252Ctest_name%252Cplatform&column=status&show_incomplete=true&show_only_latest=false&show_invalid=true&condition=job_name%2520LIKE%2520%27sentry-release/R65-10323.62.0/paygen%255C_au%255C_stable/%2525%27%2520AND%2520test_name%2520%253C%253E%2520%27SERVER_JOB%27%2520AND%2520test_name%2520NOT%2520LIKE%2520%27CLIENT%255C_JOB%2525%27%2520AND%2520platform%2520LIKE%2520%27sentry%27
,
Mar 22 2018
At least the paygen AU tests are. Out of band AU tests look generally fine too, though again it looks like they're all/mostly using chromeos4 https://stainless.corp.google.com/search?exclude_cts=true&exclude_non_release=true&board=%5Esentry%24&test=%5Eautoupdate_&view=matrix&col=build&row=test&first_date=2018-03-16&last_date=2018-03-22
,
Mar 23 2018
> Out of band AU tests look generally fine too, though again it looks like they're all/mostly using chromeos4
<sigh> ITOT that virtually all of the chromeos6 sentry DUTs aren't
working:
$ dut-status -b sentry | grep '^chromeos6-.* OK' | wc -l
3
So, we could have some sort of problem with the shard's access to
chromeos6. Except that I already manually tested for that, and I
didn't find a problem.
It still feels like we have to add debug logging, and see what we
can learn.
,
Mar 24 2018
I've had a look at the code to see where to add extra logging It looks like we kick off a check_health call on each devserver using multithreading and it crashes since we never get a response from any of the checks: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/common_lib/cros/dev_server.py?sq=package:%5Echromeos_(internal%7Cpublic)$&dr&l=2826 Each failure has a crashinfo folder in the logs. I have no idea where to start looking in their for what happened. Interestingly earlier on in the logs there are these lines: 03/22 15:19:55.497 DEBUG| dev_server:0774| Check health for http://100.115.219.129:8082 03/22 15:19:55.497 DEBUG| utils:0214| Running 'ssh 100.115.219.129 'curl "http://100.115.219.129:8082/check_health?"'' 03/22 15:19:56.662 DEBUG| dev_server:0779| Pick http://100.115.219.129:8082 Which suggest that the check_health passed: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/common_lib/cros/dev_server.py?q=dev_server.py&sq=package:%5Echromeos_(internal%7Cpublic)$&dr&l=779
,
Mar 24 2018
OK. Digging in to figure out "where do we need to add logging to
learn more", I tracked the log messages down to client/common_lib/cros/dev_server.py,
in this function:
def get_least_loaded_devserver(devserver_type=ImageServer, hostname=None):
Logging from the failures makes it clear that the code called _in_ to this
function, but that the function didn't return through the expected pathway.
Looking inside that function, I find this snippet:
====
output = multiprocessing.Queue()
processes = []
for devserver in devservers:
processes.append(multiprocessing.Process(
target=devserver_type.get_devserver_load_wrapper,
args=(devserver, TIMEOUT_GET_DEVSERVER_LOAD, output)))
for p in processes:
p.start()
for p in processes:
p.join()
loads = [output.get() for p in processes]
====
We know from the logs that that snippet must have started, and the logs
also make clear that the subprocesses started in this snippet ran and
produced output. Looking at the code that follows the snippet, most of
it shouldn't be able to block at all. However, the code above definitely
_can_ block. So, we've got some sort of deadlock problem with our
subprocesses.
,
Mar 24 2018
Reading the Python documentation, it seems that the
code might block in one of two places.
Place 1:
for p in processes:
p.join()
If one or more subprocesses fail to terminate, the code above will
wait indefinitely for the termination.
Place 2:
loads = [output.get() for p in processes]
If one or more of the subprocesses terminates without writing to the
queue, the code above will block indefinitely to wait for the data.
I'm more inclined to blame "Place 2", but I can't prove it without
more data.
Options:
* Add logging around the calls to see where the hangup happens.
* Pre-empt the debug by adding timeouts to the calls to both join()
and get().
,
Mar 24 2018
Speaking of preemption, I logged in to the server, and poked around
in /tmp, and looked at what 'ps' could tell me. The shard has been
up for some 6 months, and has accumulated a lot of orphaned processes
and temporary file gunk.
So, maybe this:
https://www.youtube.com/watch?v=p85xwZ_OLX0
But that should wait for Monday, I reckon.
,
Mar 26 2018
Passing to this week's deputy. Recommended actions:
1. At some convenient moment, reboot the lumpy shard:
chromeos-skunk-5.mtv
If this make the problem go away, we're done.
2. If this doesn't make the problem go away, add explicit
timeout parameters to the calls to 'p.join()' and 'output.get()'
called out in crbug.com/824919#c14 .
3. Step 2. won't fix the problem, but should guarantee better
error behavior. Use the improved behavior to figure out how to
move the ball a few more yards down the field.
,
Mar 26 2018
,
Mar 26 2018
Passing to this week's primary deputy.
,
Mar 26 2018
Follow Richard's suggestions, will reboot the shard in the next CQ run.
,
Mar 28 2018
,
Mar 29 2018
,
Apr 27 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dhadd...@chromium.org
, Mar 22 2018