New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 767976 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocking:
issue 754036



Sign in to add a comment

skunk-1 scheduler failed after board:auron_paine was migrated to it

Project Member Reported by nxia@chromium.org, Sep 22 2017

Issue description

I migrated an experimental paladin to skunk-1, it run for a while and failed with the following error. I've move auron_paine back. going to investigate the error. 

12:00:11 INFO | os.environ: {'USERNAME': 'chromeos-test', 'SUDO_COMMAND': '/usr/local/autotest/scheduler/monitor_db.py /usr/local/autotest/results --production', 'TERM': 'linux', 'SHELL': '/bin/bash', 'TZ': 'America/Los_Angeles', 'DJANGO_SETTINGS_MODULE': 'autotest_lib.frontend.settings', 'SUDO_UID': '0', 'SUDO_GID': '0', 'LOGNAME': 'chromeos-test', 'USER': 'chromeos-test', 'NO_GCE_CHECK': 'False', 'MAIL': '/var/mail/chromeos-test', 'PATH': '/usr/sbin:/usr/bin:/sbin:/bin', 'SUDO_USER': 'root', 'HOME': '/usr/local/google/home/chromeos-test'}
12:00:11 WARNI| Elasticsearch db deprecated, no metadata will be reported.
12:00:11 INFO | Metadata reporting thread is started.
12:00:11 INFO | Starting new HTTP connection (1): metadata.google.internal
12:00:11 INFO | 12:00:11 09/22/17> dispatcher starting
12:00:11 INFO | My PID is 35494
12:00:11 ERROR| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 164, in main_without_exception_handling
    initialize()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 222, in initialize
    _db_manager = scheduler_lib.ConnectionManager()
  File "/usr/local/autotest/server/site_utils.py", line 87, in __call__
    *args, **kwargs)
  File "/usr/local/autotest/scheduler/scheduler_lib.py", line 75, in __init__
    setup_django_environment.enable_autocommit()
  File "/usr/local/autotest/frontend/setup_django_environment.py", line 21, in enable_autocommit
    _enable_autocommit_by_name('default')
  File "/usr/local/autotest/frontend/setup_django_environment.py", line 14, in _enable_autocommit_by_name
    connections[name].cursor()
  File "/usr/local/autotest/site-packages/django/db/backends/__init__.py", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)")
12:00:11 INFO | Waiting for ts_mon flushing process to finish...
12:00:11 NOTIC| ts_mon was set up.
12:00:11 INFO | Attempting refresh to obtain initial access_token
12:00:11 INFO | Refreshing access_token
12:00:11 INFO | Finished waiting for ts_mon process.
12:00:11 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
12:00:11 ERROR| Exception escaping in monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
    main_without_exception_handling()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 189, in main_without_exception_handling
    _drone_manager.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1365, in <module>
    main()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
    main_without_exception_handling()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 189, in main_without_exception_handling
    _drone_manager.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'

 

Comment 1 by nxia@chromium.org, Sep 22 2017

Summary: skunk-1 scheduler failed after board:auron_paine was migrated to it (was: skunk-1 scheduler failed after migrating board:auron_paine to it)

Comment 2 by nxia@chromium.org, Sep 22 2017

Cc: akes...@chromium.org xixuan@chromium.org
After I moved the board back to chromeos-server111.mtv.corp.google.com, skunk-1 was back to normal. There was downtime on skunk-1, deputies may received alerts.

Comment 3 by nxia@chromium.org, Sep 22 2017



11:13:04 INFO | Drones refreshed.
Exception occurred formatting message: 'Starting _run_cleanup' using args ()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1365, in <module>
    main()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 96, in main
    main_without_exception_handling()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 172, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 366, in tick
    self._run_cleanup()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 268, in wrapper
    self._log_tick_msg('Starting %s' % func.__name__)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1091, in _log_tick_msg
    logging.debug(msg)
  File "/usr/lib/python2.7/logging/__init__.py", line 1622, in debug
    root.debug(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1140, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 942, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 86, in _autotest_logging_handle_error
    traceback.print_stack()
--------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 875, in emit
    self.flush()
  File "/usr/lib/python2.7/logging/__init__.py", line 835, in flush
    self.stream.flush()
IOError: [Errno 28] No space left on device
Future logging formatting exceptions disabled.




Looks like skunk-1 doesn't have enough space under path "/". In skunk1, the extra disk is mounted to "/" but in skunk-1, the extra disk is mounted to "/dev". I'll file a ticket to ask the lab to config the filesystem in skunk-1 ~ skunk-5 like skunk1.


root@chromeos-skunk-1:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            126G   12K  126G   1% /dev
tmpfs            26G   62M   26G   1% /run
/dev/dm-0        38G   34G  1.8G  96% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            126G  2.4M  126G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/sda1       472M   33M  415M   8% /boot
tmpfs           100K     0  100K   0% /var/lib/lxd/shmounts
tmpfs           100K     0  100K   0% /var/lib/lxd/devlxd




nxia@chromeos-skunk1:/$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.8G   12K  7.8G   1% /dev
tmpfs           1.6G  222M  1.4G  14% /run
/dev/dm-0       229G   63G  157G  29% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            7.8G  1.4M  7.8G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/sda1       472M   72M  376M  17% /boot
objfsd          229G   63G  157G  29% /google/obj
tmpfs           100K     0  100K   0% /var/lib/lxd/shmounts
tmpfs           100K     0  100K   0% /var/lib/lxd/devlxd


Comment 4 by nxia@chromium.org, Sep 27 2017

The disk problem has been resolved at b/66707209. Will try to migrate the experimental build again.

Comment 5 by nxia@chromium.org, Sep 28 2017

The shard-client on skunk-1 failed to send heartbeat after I migrated the board there. The issue has been tracked on crbug.com/769551.

I've moved the board auron_paine back to chromeos-server111.mtv.corp.google.com. Will try it again after crbug.com/769551 is resolved.

Comment 6 by nxia@chromium.org, Sep 28 2017

The global_afe_hostname puppet config has been solved in  crbug.com/769551. will test auron_paine on skunk-1 again.

Comment 7 by nxia@chromium.org, Sep 29 2017

auron_paine-paladin has been working well on skunk-1 after the migration.

https://uberchromegw.corp.google.com/i/chromeos/builders/auron_paine-paladin/builds/931


Comment 8 by nxia@chromium.org, Sep 29 2017

Status: Fixed (was: Untriaged)

Comment 9 by nxia@chromium.org, Mar 12 2018

Blocking: 754036

Sign in to add a comment