chameleon_host connection causes special task hang |
|||||
Issue descriptionWe found several instances of special task got hang due to connection to chameleon_host got stuck. For example: http://chromeos-server6.mtv.corp.google.com/results/hosts/chromeos1-row5-rack1-host2/54298164-cleanup/debug/autoserv.DEBUG 04/20 16:49:58.066 DEBUG| servo:0225| Servo initialized, version is servo_v3 04/20 16:49:58.066 INFO | servo_host:0539| Sanity checks pass on servo host chromeos1-row5-rack1-host2-servo 04/20 16:49:58.192 DEBUG| abstract_ssh:0835| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 54851:localhost:9992 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host2-chameleon 04/20 16:49:58.264 DEBUG| abstract_ssh:0843| Started ssh tunnel, local = 54851 remote = 9992, pid = 25323 04/20 16:49:58.267 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.369 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.471 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.572 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.674 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.776 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/20 16:49:58.877 DEBUG| chameleon_host:0118| Connection is not ready yet ... At least we need a timeout on that connection call. Assign to cychiang for now, or please find an owner for chameleon related issue. Thanks!
,
Apr 22 2016
Hi Dan, could you please attach the log to issue tracker ? Somehow the file is misssing on the server. "The requested URL /results/hosts/chromeos1-row5-rack1-host2/54298164-cleanup/debug/autoserv.DEBUG was not found on this server." I checked the code and it seems that is already a 30 seconds timeout on the call. https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/chameleon_host.py?type=cs&q=_wait_for_connection_established+package:%5Echromeos_public$&l=124 So... it would be interesting to find out why that 30 seconds timeout did not work. Thanks!
,
Apr 22 2016
The full test logs are here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos1-row5-rack1-host2/54298164-cleanup/debug/ I forgot that link is on drone.
,
Apr 22 2016
I checked that it set the tunnel up before the connection. In the lab, the tunnel should not be triggered. Probably some config changed to force the tunneling.
The code has a timeout 30sec and keeps reconnecting to Chameleon. However, according to the log, it was interrupted, because of some expected exception, like tunnel error.
def _wait_for_connection_established(self):
"""Wait for ChameleonConnection through tunnel being established."""
def _create_connection():
"""Create ChameleonConnection.
@returns: True if success. False otherwise.
"""
try:
self._chameleon_connection = chameleon.ChameleonConnection(
'localhost', self._local_port)
except chameleon.ChameleonConnectionError:
logging.debug('Connection is not ready yet ...')
return False
logging.debug('Connection is up')
return True
success = utils.wait_for_value(
_create_connection, expected_value=True, timeout_sec=30)
if not success:
raise ChameleonHostError('Can not connect to Chameleon')
Assign to xixuan@ to investigate it. Better to turn the tunneling config off first to avoid this issue happening again.
,
Apr 22 2016
oh, the current config is,
if the host in lab && we don't open the ssh_config, tunnel won't be triggered.
Since we open the ssh_to_chameleon config, any connection to chameleon host will go through tunnel.
if self._is_in_lab and not ENABLE_SSH_TUNNEL_FOR_CHAMELEON:
self._chameleon_connection = chameleon.ChameleonConnection(
self.hostname, chameleon_port)
else:
self._create_connection_through_tunnel()
,
Apr 25 2016
The tunneling was used in the dev network, not verified in the lab environment. I don't why it was enabled in the lab and this issue happened. It is not a Chameleon issue, probably related to the lab network. Xixuan, please investigate the cause, or simply disable the tunneling in the lab before this issue gets fixed.
,
Apr 25 2016
The issue can't be reproduced so it's hard to see whether it gets fixed or not. At this point, we have to set tunneling for the new ACL https://buganizer.corp.google.com/u/0/issues/25934953. The current chameleon tunnel code path is different from the code path of servo tunnel. I will first let chameleon use the same code path of servo, and then add some debugging logs if possible, to report why it's stucked.
,
Apr 25 2016
,
Apr 25 2016
Issue 600007 has been merged into this issue.
,
Apr 26 2016
,
May 5 2016
,
Jun 9 2016
,
Aug 12 2016
Closing. please reopen if its not fixed. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dshi@chromium.org
, Apr 21 2016