New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 705068 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocking:
issue 690307



Sign in to add a comment

Schedulers not running on new shards

Project Member Reported by dgarr...@chromium.org, Mar 24 2017

Issue description

During the lab migration last night a number of new shards were added. I've been using chromeos-server104.mtv to investigate.

One some (all?) of them, the scheduler is failing to start up. After updating the puppet configuration to setup these new servers, and forcing a puppet run, all of the errors are gone but the following.

I read this as monitor_db can't connect to mysql, which kills the scheduler process.




03/24 13:27:05.785 INFO |     status_server:0120| Status server running on ('0.0.0.0', 13467)
03/24 13:27:05.786 INFO | metadata_reporter:0148| Metadata reporting thread is started.
03/24 13:27:05.823 INFO |        monitor_db:0213| 13:27:05 03/24/17> dispatcher starting
03/24 13:27:05.824 INFO |        monitor_db:0214| My PID is 13934
03/24 13:27:05.908 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 177, in main_without_exception_handling
    initialize()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 235, in initialize
    _db_manager = scheduler_lib.ConnectionManager()
  File "/usr/local/autotest/server/site_utils.py", line 85, in __call__
    *args, **kwargs)
  File "/usr/local/autotest/scheduler/scheduler_lib.py", line 64, in __init__
    setup_django_environment.enable_autocommit()
  File "/usr/local/autotest/frontend/setup_django_environment.py", line 22, in enable_autocommit
    _enable_autocommit_by_name('global')
  File "/usr/local/autotest/frontend/setup_django_environment.py", line 14, in _enable_autocommit_by_name
    connections[name].cursor()
  File "/usr/local/autotest/site-packages/django/db/backends/__init__.py", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")
03/24 13:27:05.911 ERROR|     email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 177, in main_without_exception_handling
    initialize()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 235, in initialize

 
Cc: akes...@chromium.org ihf@chromium.org shuqianz@chromium.org
This is happening on:
chromeos-server104.mtv
chromeos-server105.mtv

Not on:
chromeos-server98.mtv
chromeos-server99.mtv
chromeos-server100.mtv
chromeos-server101.mtv
chromeos-server102.mtv
chromeos-server103.mtv

The broken boards:

chromeos-server104.mtv.corp.google.com  board:x86-zgb, board:celes, board:lars, board:cave, board:nyan_blaze
chromeos-server105.mtv.corp.google.com  board:zako, board:kefka, board:tricky, board:samus

Comment 3 by ihf@chromium.org, Mar 24 2017

Blocking: 690307
/root/chromeos-admin/puppet/sync_and_run_puppet  -f
Project Member

Comment 5 by bugdroid1@chromium.org, Mar 24 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/80f9ee51d8280ca4c79b106aa45ada402e6e1a98

commit 80f9ee51d8280ca4c79b106aa45ada402e6e1a98
Author: Don Garrett <dgarrett@google.com>
Date: Fri Mar 24 21:33:07 2017

Owner: shuqianz@chromium.org
We 'fixed' this by bringing up two new shards, and transferring the duts over to them, and wiping 104 and 105.

chromeos-server108.mtv, and chromeos-server109.mtv are the new ones.
Status: Fixed (was: Started)

Comment 8 by xixuan@chromium.org, Mar 27 2017

Status: Untriaged (was: Fixed)
I feel there's sth wrong with these newly added shards. The symptom includes:

1.  crbug.com/705587  reports that "Not enough DUTs" for some boards, including tricky, samus.
2. CQ failures due to "Not enough DUTs" on x86-zgb, elm, kevin.


Mergedinto: 705633
Status: Duplicate (was: Untriaged)

Sign in to add a comment