skunk-1 scheduler failed after board:auron_paine was migrated to it |
||||
Issue description
I migrated an experimental paladin to skunk-1, it run for a while and failed with the following error. I've move auron_paine back. going to investigate the error.
12:00:11 INFO | os.environ: {'USERNAME': 'chromeos-test', 'SUDO_COMMAND': '/usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results --production', 'TERM': 'linux', 'SHELL': '/bin/bash', 'TZ': 'America/Los_Angeles', 'DJANGO_SETTINGS_MODULE': 'autotest_lib.frontend.settings', 'SUDO_UID': '0', 'SUDO_GID': '0', 'LOGNAME': 'chromeos-test', 'USER': 'chromeos-test', 'NO_GCE_CHECK': 'False', 'MAIL': '/var/mail/chromeos-test', 'PATH': '/usr/sbin:/usr/bin:/sbin:/bin', 'SUDO_USER': 'root', 'HOME': '/usr/local/google/home/chromeos-test'}
12:00:11 WARNI| Elasticsearch db deprecated, no metadata will be reported.
12:00:11 INFO | Metadata reporting thread is started.
12:00:11 INFO | Starting new HTTP connection (1): metadata.google.internal
12:00:11 INFO | 12:00:11 09/22/17> dispatcher starting
12:00:11 INFO | My PID is 35494
12:00:11 ERROR| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 164, in main_without_exception_handling
initialize()
File "/usr/local/autotest/scheduler/monitor_db.py", line 222, in initialize
_db_manager = scheduler_lib.ConnectionManager()
File "/usr/local/autotest/server/site_utils.py", line 87, in __call__
*args, **kwargs)
File "/usr/local/autotest/scheduler/scheduler_lib.py", line 75, in __init__
setup_django_environment.enable_autocommit()
File "/usr/local/autotest/frontend/setup_django_environment.py", line 21, in enable_autocommit
_enable_autocommit_by_name('default')
File "/usr/local/autotest/frontend/setup_django_environment.py", line 14, in _enable_autocommit_by_name
connections[name].cursor()
File "/usr/local/autotest/site-packages/django/db/backends/__init__.py", line 326, in cursor
cursor = util.CursorWrapper(self._cursor(), self)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
self.connection = Database.connect(**kwargs)
File "/usr/local/autotest/site-packages/MySQLdb/__init__.py", line 81, in Connect
return Connection(*args, **kwargs)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 187, in __init__
super(Connection, self).__init__(*args, **kwargs2)
OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)")
12:00:11 INFO | Waiting for ts_mon flushing process to finish...
12:00:11 NOTIC| ts_mon was set up.
12:00:11 INFO | Attempting refresh to obtain initial access_token
12:00:11 INFO | Refreshing access_token
12:00:11 INFO | Finished waiting for ts_mon process.
12:00:11 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
12:00:11 ERROR| Exception escaping in monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
main_without_exception_handling()
File "/usr/local/autotest/scheduler/monitor_db.py", line 189, in main_without_exception_handling
_drone_manager.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 1365, in <module>
main()
File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
main_without_exception_handling()
File "/usr/local/autotest/scheduler/monitor_db.py", line 189, in main_without_exception_handling
_drone_manager.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'
,
Sep 22 2017
After I moved the board back to chromeos-server111.mtv.corp.google.com, skunk-1 was back to normal. There was downtime on skunk-1, deputies may received alerts.
,
Sep 22 2017
11:13:04 INFO | Drones refreshed.
Exception occurred formatting message: 'Starting _run_cleanup' using args ()
File "/usr/local/autotest/scheduler/monitor_db.py", line 1365, in <module>
main()
File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
main_without_exception_handling()
File "/usr/local/autotest/scheduler/monitor_db.py", line 172, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 366, in tick
self._run_cleanup()
File "/usr/local/autotest/scheduler/monitor_db.py", line 268, in wrapper
self._log_tick_msg('Starting %s' % func.__name__)
File "/usr/local/autotest/scheduler/monitor_db.py", line 1091, in _log_tick_msg
logging.debug(msg)
File "/usr/lib/python2.7/logging/__init__.py", line 1622, in debug
root.debug(msg, *args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1140, in debug
self._log(DEBUG, msg, args, **kwargs)
File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
self.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
self.callHandlers(record)
File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
hdlr.handle(record)
File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
self.emit(record)
File "/usr/lib/python2.7/logging/__init__.py", line 942, in emit
StreamHandler.emit(self, record)
File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
self.handleError(record)
File "/usr/local/autotest/client/setup_modules.py", line 86, in _autotest_logging_handle_error
traceback.print_stack()
--------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/logging/__init__.py", line 875, in emit
self.flush()
File "/usr/lib/python2.7/logging/__init__.py", line 835, in flush
self.stream.flush()
IOError: [Errno 28] No space left on device
Future logging formatting exceptions disabled.
Looks like skunk-1 doesn't have enough space under path "/". In skunk1, the extra disk is mounted to "/" but in skunk-1, the extra disk is mounted to "/dev". I'll file a ticket to ask the lab to config the filesystem in skunk-1 ~ skunk-5 like skunk1.
root@chromeos-skunk-1:/var/log# df -h
Filesystem Size Used Avail Use% Mounted on
udev 126G 12K 126G 1% /dev
tmpfs 26G 62M 26G 1% /run
/dev/dm-0 38G 34G 1.8G 96% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 126G 2.4M 126G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda1 472M 33M 415M 8% /boot
tmpfs 100K 0 100K 0% /var/lib/lxd/shmounts
tmpfs 100K 0 100K 0% /var/lib/lxd/devlxd
nxia@chromeos-skunk1:/$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.8G 12K 7.8G 1% /dev
tmpfs 1.6G 222M 1.4G 14% /run
/dev/dm-0 229G 63G 157G 29% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 7.8G 1.4M 7.8G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda1 472M 72M 376M 17% /boot
objfsd 229G 63G 157G 29% /google/obj
tmpfs 100K 0 100K 0% /var/lib/lxd/shmounts
tmpfs 100K 0 100K 0% /var/lib/lxd/devlxd
,
Sep 27 2017
The disk problem has been resolved at b/66707209. Will try to migrate the experimental build again.
,
Sep 28 2017
The shard-client on skunk-1 failed to send heartbeat after I migrated the board there. The issue has been tracked on crbug.com/769551. I've moved the board auron_paine back to chromeos-server111.mtv.corp.google.com. Will try it again after crbug.com/769551 is resolved.
,
Sep 28 2017
The global_afe_hostname puppet config has been solved in crbug.com/769551. will test auron_paine on skunk-1 again.
,
Sep 29 2017
auron_paine-paladin has been working well on skunk-1 after the migration. https://uberchromegw.corp.google.com/i/chromeos/builders/auron_paine-paladin/builds/931
,
Sep 29 2017
,
Mar 12 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by nxia@chromium.org
, Sep 22 2017