moblab: 'afe is not responding' during provision |
|||||
Issue description
Saw this twice last night:
- We provision a moblab DUT
- part of moblab_host's provision flow tries to ping its own AFE, and times out (300 seconds, so long enough)
- repair finds no problem.
- provision-repair loop.
+haddowk has stated his concern about our lab not giving enough time for a fresh installation of moblab to settle down. In the field, people usually leave it be for a bit before using it.
Detailed trace:
06/01 23:40:21.428 DEBUG| retry_util:0135| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
06/01 23:40:32.941 DEBUG| retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:40:54.605 DEBUG| retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:41:34.861 DEBUG| retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:42:56.454 DEBUG| retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:45:21.427 DEBUG| moblab_host:0291| AFE is not responding
06/01 23:45:21.427 ERROR| repair:0332| Failed: Legacy host verification checks
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/repair.py", line 55, in verify
host.verify_software()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 244, in verify_software
self._verify_duts()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 306, in _verify_duts
self.wait_afe_up()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 176, in wait_afe_up
self._check_afe()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 289, in _check_afe
self.afe.get_hosts()
File "/usr/local/autotest/server/frontend.py", line 538, in get_hosts
hosts = self.run('get_hosts', **query_args)
File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 126, in run
self, call, **dargs)
File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 106, in GenericRetry
time.sleep(sleep_time)
File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 87, in kill_us
raise TimeoutError(error_message % {'time': max_run_time})
TimeoutError: Timeout occurred- waited 300.0 seconds.
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/513961-provision/20170106232251/debug/
,
Jun 6 2017
The repair job following the failed provision in #1 is: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/521937-repair/ And the logs there show something promising: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/521937-repair/20170506120836/chromeos2-row2-rack8-host1/before_repair/log/apache2/ [Mon Jun 05 11:54:16.495312 2017] [:error] [pid 4279] [remote 127.0.0.1:0] mod_wsgi (pid=4279): Exception occurred processing WSGI script '/usr/local/autotest/frontend/frontend.wsgi'. [Mon Jun 05 11:54:16.495340 2017] [:error] [pid 4279] [remote 127.0.0.1:0] Traceback (most recent call last): [Mon Jun 05 11:54:16.495359 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/local/autotest/frontend/frontend.wsgi", line 23, in application [Mon Jun 05 11:54:16.495382 2017] [:error] [pid 4279] [remote 127.0.0.1:0] return _application(environ, start_response) [Mon Jun 05 11:54:16.495395 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/handlers/wsgi.py", line 255, in __call__ [Mon Jun 05 11:54:16.495414 2017] [:error] [pid 4279] [remote 127.0.0.1:0] response = self.get_response(request) [Mon Jun 05 11:54:16.495424 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/handlers/base.py", line 178, in get_response [Mon Jun 05 11:54:16.495441 2017] [:error] [pid 4279] [remote 127.0.0.1:0] response = self.handle_uncaught_exception(request, resolver, sys.exc_info()) [Mon Jun 05 11:54:16.495451 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/handlers/base.py", line 223, in handle_uncaught_exception [Mon Jun 05 11:54:16.495478 2017] [:error] [pid 4279] [remote 127.0.0.1:0] callback, param_dict = resolver.resolve500() [Mon Jun 05 11:54:16.495488 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 370, in resolve500 [Mon Jun 05 11:54:16.495505 2017] [:error] [pid 4279] [remote 127.0.0.1:0] return self._resolve_special('500') [Mon Jun 05 11:54:16.495515 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 361, in _resolve_special [Mon Jun 05 11:54:16.495531 2017] [:error] [pid 4279] [remote 127.0.0.1:0] return get_callable(callback), {} [Mon Jun 05 11:54:16.495541 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/utils/functional.py", line 29, in wrapper [Mon Jun 05 11:54:16.495558 2017] [:error] [pid 4279] [remote 127.0.0.1:0] result = func(*args) [Mon Jun 05 11:54:16.495568 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 97, in get_callable [Mon Jun 05 11:54:16.495584 2017] [:error] [pid 4279] [remote 127.0.0.1:0] mod = import_module(mod_name) [Mon Jun 05 11:54:16.495594 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/django/utils/importlib.py", line 35, in import_module [Mon Jun 05 11:54:16.495611 2017] [:error] [pid 4279] [remote 127.0.0.1:0] __import__(name) [Mon Jun 05 11:54:16.495620 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/local/autotest/frontend/afe/views.py", line 13, in <module> [Mon Jun 05 11:54:16.495637 2017] [:error] [pid 4279] [remote 127.0.0.1:0] dummy=object()) [Mon Jun 05 11:54:16.495647 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/local/autotest/client/common_lib/utils.py", line 1294, in import_site_module [Mon Jun 05 11:54:16.495664 2017] [:error] [pid 4279] [remote 127.0.0.1:0] return __import__(module, {}, {}, [short_module]) [Mon Jun 05 11:54:16.495674 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/local/autotest/frontend/afe/moblab_rpc_interface.py", line 53, in <module> [Mon Jun 05 11:54:16.495691 2017] [:error] [pid 4279] [remote 127.0.0.1:0] _GSUTIL_CMD = gs.GSContext.GetDefaultGSUtilBin() [Mon Jun 05 11:54:16.495700 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/gs.py", line 329, in GetDefaultGSUtilBin [Mon Jun 05 11:54:16.495719 2017] [:error] [pid 4279] [remote 127.0.0.1:0] tar_cache = cache.TarballCache(common_path, cache_user=cache_user) [Mon Jun 05 11:54:16.495731 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/cache.py", line 196, in __init__ [Mon Jun 05 11:54:16.495747 2017] [:error] [pid 4279] [remote 127.0.0.1:0] osutils.SafeMakedirsNonRoot(self._cache_dir, user=self._cache_user) [Mon Jun 05 11:54:16.495757 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/osutils.py", line 276, in SafeMakedirsNonRoot [Mon Jun 05 11:54:16.495774 2017] [:error] [pid 4279] [remote 127.0.0.1:0] created = should_chown = SafeMakedirs(path, mode=mode, sudo=True) [Mon Jun 05 11:54:16.495784 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/osutils.py", line 234, in SafeMakedirs [Mon Jun 05 11:54:16.495801 2017] [:error] [pid 4279] [remote 127.0.0.1:0] redirect_stderr=True, redirect_stdout=True) [Mon Jun 05 11:54:16.495811 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/cros_build_lib.py", line 320, in SudoRunCommand [Mon Jun 05 11:54:16.495828 2017] [:error] [pid 4279] [remote 127.0.0.1:0] return RunCommand(sudo_cmd, **kwargs) [Mon Jun 05 11:54:16.495837 2017] [:error] [pid 4279] [remote 127.0.0.1:0] File "/usr/lib64/python2.7/site-packages/chromite/lib/cros_build_lib.py", line 645, in RunCommand [Mon Jun 05 11:54:16.495854 2017] [:error] [pid 4279] [remote 127.0.0.1:0] raise RunCommandError(msg, cmd_result) [Mon Jun 05 11:54:16.495966 2017] [:error] [pid 4279] [remote 127.0.0.1:0] RunCommandError: return code: 1; command: sudo 'CROS_CACHEDIR=/mnt/moblab/cros_cache_apache' -- mkdir -p --mode 0775 /mnt/moblab/cros_cache_apache/common [Mon Jun 05 11:54:16.495972 2017] [:error] [pid 4279] [remote 127.0.0.1:0] [Mon Jun 05 11:54:16.495974 2017] [:error] [pid 4279] [remote 127.0.0.1:0] We trust you have received the usual lecture from the local System [Mon Jun 05 11:54:16.495977 2017] [:error] [pid 4279] [remote 127.0.0.1:0] Administrator. It usually boils down to these three things: [Mon Jun 05 11:54:16.495979 2017] [:error] [pid 4279] [remote 127.0.0.1:0] [Mon Jun 05 11:54:16.495983 2017] [:error] [pid 4279] [remote 127.0.0.1:0] #1) Respect the privacy of others. [Mon Jun 05 11:54:16.495985 2017] [:error] [pid 4279] [remote 127.0.0.1:0] #2) Think before you type. [Mon Jun 05 11:54:16.495990 2017] [:error] [pid 4279] [remote 127.0.0.1:0] #3) With great power comes great responsibility. [Mon Jun 05 11:54:16.495992 2017] [:error] [pid 4279] [remote 127.0.0.1:0] [Mon Jun 05 11:54:16.495994 2017] [:error] [pid 4279] [remote 127.0.0.1:0] sudo: no tty present and no askpass program specified [Mon Jun 05 11:54:16.495997 2017] [:error] [pid 4279] [remote 127.0.0.1:0] [Mon Jun 05 11:54:16.496000 2017] [:error] [pid 4279] [remote 127.0.0.1:0] cmd=['sudo', 'CROS_CACHEDIR=/mnt/moblab/cros_cache_apache', '--', 'mkdir', '-p', '--mode', '0775', '/mnt/moblab/cros_cache_apache/common'] Now if only this error had been visible from the failed provision job easily, I wouldn't have spent 1/2 hour chasing it down.
,
Jun 6 2017
Is this on R60 - I need to merge back a change this code has been changed on ToT
,
Jun 6 2017
This boils down to moblab user trying to run a sudo command as a result of a get_hosts rpc and being blocked because it isn't allowed to sudo +sbasi: Does moblab user even have sudo privileges? Or is this a question of not waiting long enough after a fresh install for the moblab user setup to complete...?
,
Jun 6 2017
Can you reference the CL that fixed this?
,
Jun 6 2017
Moblab user has full sudo privileges, the apache users does not - the move to the chromite gsutil has caused quite a few issues. https://chromium-review.googlesource.com/#/c/521739/ Is the change that should prevent the RPC interface trying to load the gsutil cmd at boot.
,
Jun 6 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Jun 6 2017Status: Unconfirmed (was: Untriaged)