Schedulers not running on new shards |
||||||
Issue description
During the lab migration last night a number of new shards were added. I've been using chromeos-server104.mtv to investigate.
One some (all?) of them, the scheduler is failing to start up. After updating the puppet configuration to setup these new servers, and forcing a puppet run, all of the errors are gone but the following.
I read this as monitor_db can't connect to mysql, which kills the scheduler process.
03/24 13:27:05.785 INFO | status_server:0120| Status server running on ('0.0.0.0', 13467)
03/24 13:27:05.786 INFO | metadata_reporter:0148| Metadata reporting thread is started.
03/24 13:27:05.823 INFO | monitor_db:0213| 13:27:05 03/24/17> dispatcher starting
03/24 13:27:05.824 INFO | monitor_db:0214| My PID is 13934
03/24 13:27:05.908 ERROR| email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 177, in main_without_exception_handling
initialize()
File "/usr/local/autotest/scheduler/monitor_db.py", line 235, in initialize
_db_manager = scheduler_lib.ConnectionManager()
File "/usr/local/autotest/server/site_utils.py", line 85, in __call__
*args, **kwargs)
File "/usr/local/autotest/scheduler/scheduler_lib.py", line 64, in __init__
setup_django_environment.enable_autocommit()
File "/usr/local/autotest/frontend/setup_django_environment.py", line 22, in enable_autocommit
_enable_autocommit_by_name('global')
File "/usr/local/autotest/frontend/setup_django_environment.py", line 14, in _enable_autocommit_by_name
connections[name].cursor()
File "/usr/local/autotest/site-packages/django/db/backends/__init__.py", line 326, in cursor
cursor = util.CursorWrapper(self._cursor(), self)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
self.connection = Database.connect(**kwargs)
File "/usr/local/autotest/site-packages/MySQLdb/__init__.py", line 81, in Connect
return Connection(*args, **kwargs)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 187, in __init__
super(Connection, self).__init__(*args, **kwargs2)
OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")
03/24 13:27:05.911 ERROR| email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 177, in main_without_exception_handling
initialize()
File "/usr/local/autotest/scheduler/monitor_db.py", line 235, in initialize
,
Mar 24 2017
The broken boards: chromeos-server104.mtv.corp.google.com board:x86-zgb, board:celes, board:lars, board:cave, board:nyan_blaze chromeos-server105.mtv.corp.google.com board:zako, board:kefka, board:tricky, board:samus
,
Mar 24 2017
,
Mar 24 2017
/root/chromeos-admin/puppet/sync_and_run_puppet -f
,
Mar 24 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/80f9ee51d8280ca4c79b106aa45ada402e6e1a98 commit 80f9ee51d8280ca4c79b106aa45ada402e6e1a98 Author: Don Garrett <dgarrett@google.com> Date: Fri Mar 24 21:33:07 2017
,
Mar 24 2017
We 'fixed' this by bringing up two new shards, and transferring the duts over to them, and wiping 104 and 105. chromeos-server108.mtv, and chromeos-server109.mtv are the new ones.
,
Mar 25 2017
,
Mar 27 2017
I feel there's sth wrong with these newly added shards. The symptom includes: 1. crbug.com/705587 reports that "Not enough DUTs" for some boards, including tricky, samus. 2. CQ failures due to "Not enough DUTs" on x86-zgb, elm, kevin.
,
Mar 27 2017
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by dgarr...@chromium.org
, Mar 24 2017