New issue
Advanced search Search tips

Issue 729083 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

moblab: 'afe is not responding' during provision

Project Member Reported by pprabhu@chromium.org, Jun 2 2017

Issue description

Saw this twice last night:

- We provision a moblab DUT
- part of moblab_host's provision flow tries to ping its own AFE, and times out (300 seconds, so long enough)
- repair finds no problem.
- provision-repair loop.

+haddowk has stated his concern about our lab not giving enough time for a fresh installation of moblab to settle down. In the field, people usually leave it be for a bit before using it.

Detailed trace:
06/01 23:40:21.428 DEBUG|        retry_util:0135| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
06/01 23:40:32.941 DEBUG|        retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:40:54.605 DEBUG|        retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:41:34.861 DEBUG|        retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:42:56.454 DEBUG|        retry_util:0135| <class 'urllib2.HTTPError'>(HTTP Error 500: Internal Server Error)
06/01 23:45:21.427 DEBUG|       moblab_host:0291| AFE is not responding
06/01 23:45:21.427 ERROR|            repair:0332| Failed: Legacy host verification checks
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 55, in verify
    host.verify_software()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 244, in verify_software
    self._verify_duts()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 306, in _verify_duts
    self.wait_afe_up()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 176, in wait_afe_up
    self._check_afe()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 289, in _check_afe
    self.afe.get_hosts()
  File "/usr/local/autotest/server/frontend.py", line 538, in get_hosts
    hosts = self.run('get_hosts', **query_args)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 126, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 106, in GenericRetry
    time.sleep(sleep_time)
  File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 87, in kill_us
    raise TimeoutError(error_message % {'time': max_run_time})
TimeoutError: Timeout occurred- waited 300.0 seconds.


https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/513961-provision/20170106232251/debug/
 
Labels: Proj-Moblab
Status: Unconfirmed (was: Untriaged)
Again here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121608712-chromeos-test/chromeos2-row2-rack8-host1/

This was a trybot with changes that didn't affect moblab at all (changes only in moblab_RunSuite, which didn't get a chance to run because of the provision failure)

What is worse is that neither the apache logs, nor mobmonitor show anything amiss:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121608712-chromeos-test/chromeos2-row2-rack8-host1/sysinfo/var/log/apache2/
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121608712-chromeos-test/chromeos2-row2-rack8-host1/sysinfo/var/log/mobmonitor/
Labels: -Pri-2 Pri-1
The repair job following the failed provision in #1 is: 
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/521937-repair/

And the logs there show something promising:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row2-rack8-host1/521937-repair/20170506120836/chromeos2-row2-rack8-host1/before_repair/log/apache2/


[Mon Jun 05 11:54:16.495312 2017] [:error] [pid 4279] [remote 127.0.0.1:0] mod_wsgi (pid=4279): Exception occurred processing WSGI script '/usr/local/autotest/frontend/frontend.wsgi'.
[Mon Jun 05 11:54:16.495340 2017] [:error] [pid 4279] [remote 127.0.0.1:0] Traceback (most recent call last):
[Mon Jun 05 11:54:16.495359 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/local/autotest/frontend/frontend.wsgi", line 23, in application
[Mon Jun 05 11:54:16.495382 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     return _application(environ, start_response)
[Mon Jun 05 11:54:16.495395 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/handlers/wsgi.py", line 255, in __call__
[Mon Jun 05 11:54:16.495414 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     response = self.get_response(request)
[Mon Jun 05 11:54:16.495424 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/handlers/base.py", line 178, in get_response
[Mon Jun 05 11:54:16.495441 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
[Mon Jun 05 11:54:16.495451 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/handlers/base.py", line 223, in handle_uncaught_exception
[Mon Jun 05 11:54:16.495478 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     callback, param_dict = resolver.resolve500()
[Mon Jun 05 11:54:16.495488 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 370, in resolve500
[Mon Jun 05 11:54:16.495505 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     return self._resolve_special('500')
[Mon Jun 05 11:54:16.495515 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 361, in _resolve_special
[Mon Jun 05 11:54:16.495531 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     return get_callable(callback), {}
[Mon Jun 05 11:54:16.495541 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/utils/functional.py", line 29, in wrapper
[Mon Jun 05 11:54:16.495558 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     result = func(*args)
[Mon Jun 05 11:54:16.495568 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/core/urlresolvers.py", line 97, in get_callable
[Mon Jun 05 11:54:16.495584 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     mod = import_module(mod_name)
[Mon Jun 05 11:54:16.495594 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/django/utils/importlib.py", line 35, in import_module
[Mon Jun 05 11:54:16.495611 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     __import__(name)
[Mon Jun 05 11:54:16.495620 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/local/autotest/frontend/afe/views.py", line 13, in <module>
[Mon Jun 05 11:54:16.495637 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     dummy=object())
[Mon Jun 05 11:54:16.495647 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/local/autotest/client/common_lib/utils.py", line 1294, in import_site_module
[Mon Jun 05 11:54:16.495664 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     return __import__(module, {}, {}, [short_module])
[Mon Jun 05 11:54:16.495674 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/local/autotest/frontend/afe/moblab_rpc_interface.py", line 53, in <module>
[Mon Jun 05 11:54:16.495691 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     _GSUTIL_CMD = gs.GSContext.GetDefaultGSUtilBin()
[Mon Jun 05 11:54:16.495700 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/gs.py", line 329, in GetDefaultGSUtilBin
[Mon Jun 05 11:54:16.495719 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     tar_cache = cache.TarballCache(common_path, cache_user=cache_user)
[Mon Jun 05 11:54:16.495731 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/cache.py", line 196, in __init__
[Mon Jun 05 11:54:16.495747 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     osutils.SafeMakedirsNonRoot(self._cache_dir, user=self._cache_user)
[Mon Jun 05 11:54:16.495757 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/osutils.py", line 276, in SafeMakedirsNonRoot
[Mon Jun 05 11:54:16.495774 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     created = should_chown = SafeMakedirs(path, mode=mode, sudo=True)
[Mon Jun 05 11:54:16.495784 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/osutils.py", line 234, in SafeMakedirs
[Mon Jun 05 11:54:16.495801 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     redirect_stderr=True, redirect_stdout=True)
[Mon Jun 05 11:54:16.495811 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/cros_build_lib.py", line 320, in SudoRunCommand
[Mon Jun 05 11:54:16.495828 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     return RunCommand(sudo_cmd, **kwargs)
[Mon Jun 05 11:54:16.495837 2017] [:error] [pid 4279] [remote 127.0.0.1:0]   File "/usr/lib64/python2.7/site-packages/chromite/lib/cros_build_lib.py", line 645, in RunCommand
[Mon Jun 05 11:54:16.495854 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     raise RunCommandError(msg, cmd_result)
[Mon Jun 05 11:54:16.495966 2017] [:error] [pid 4279] [remote 127.0.0.1:0] RunCommandError: return code: 1; command: sudo 'CROS_CACHEDIR=/mnt/moblab/cros_cache_apache' -- mkdir -p --mode 0775 /mnt/moblab/cros_cache_apache/common
[Mon Jun 05 11:54:16.495972 2017] [:error] [pid 4279] [remote 127.0.0.1:0] 
[Mon Jun 05 11:54:16.495974 2017] [:error] [pid 4279] [remote 127.0.0.1:0] We trust you have received the usual lecture from the local System
[Mon Jun 05 11:54:16.495977 2017] [:error] [pid 4279] [remote 127.0.0.1:0] Administrator. It usually boils down to these three things:
[Mon Jun 05 11:54:16.495979 2017] [:error] [pid 4279] [remote 127.0.0.1:0] 
[Mon Jun 05 11:54:16.495983 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     #1) Respect the privacy of others.
[Mon Jun 05 11:54:16.495985 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     #2) Think before you type.
[Mon Jun 05 11:54:16.495990 2017] [:error] [pid 4279] [remote 127.0.0.1:0]     #3) With great power comes great responsibility.
[Mon Jun 05 11:54:16.495992 2017] [:error] [pid 4279] [remote 127.0.0.1:0] 
[Mon Jun 05 11:54:16.495994 2017] [:error] [pid 4279] [remote 127.0.0.1:0] sudo: no tty present and no askpass program specified
[Mon Jun 05 11:54:16.495997 2017] [:error] [pid 4279] [remote 127.0.0.1:0] 
[Mon Jun 05 11:54:16.496000 2017] [:error] [pid 4279] [remote 127.0.0.1:0] cmd=['sudo', 'CROS_CACHEDIR=/mnt/moblab/cros_cache_apache', '--', 'mkdir', '-p', '--mode', '0775', '/mnt/moblab/cros_cache_apache/common']


Now if only this error had been visible from the failed provision job easily, I wouldn't have spent 1/2 hour chasing it down.
Is this on R60 - I need to merge back a change this code has been changed on ToT
Status: haddowk (was: Unconfirmed)
This boils down to moblab user trying to run a sudo command as a result of a get_hosts rpc and being blocked because it isn't allowed to sudo

+sbasi: Does moblab user even have sudo privileges?

Or is this a question of not waiting long enough after a fresh install for the moblab user setup to complete...?
Owner: haddowk@chromium.org
Status: Assigned (was: haddowk)
Can you reference the CL that fixed this?
Moblab user has full sudo privileges, the apache users does not - the move to the chromite gsutil has caused quite a few issues.

https://chromium-review.googlesource.com/#/c/521739/

Is the change that should prevent the RPC interface trying to load the gsutil cmd at boot.


Status: Fixed (was: Assigned)

Sign in to add a comment