New issue
Advanced search Search tips

Issue 864783 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Jul 19
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

scheduler crash on shutdown: metrics flush failed

Project Member Reported by pprabhu@chromium.org, Jul 17

Issue description

07/17 15:42:12.556 INFO |        monitor_db:0217| Shutdown request received.
07/17 15:42:12.557 INFO |        monitor_db:0217| Shutdown request received.
07/17 15:42:12.581 DEBUG|        monitor_db:1246| Starting _schedule_special_tasks
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'CumulativeSecondsDistribution'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'PercentageDistribution'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'FloatMetric'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'PercentageDistribution'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'FloatMetric'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'PercentageDistribution'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'FloatMetric'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'PercentageDistribution'
07/17 15:42:12.586 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'FloatMetric'
07/17 15:42:12.587 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'PercentageDistribution'
07/17 15:42:12.587 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'CumulativeMetric'
07/17 15:42:12.587 WARNI|           metrics:0091| Flushing process has been closed (exit code -15), skipped sending metric 'CumulativeSecondsDistribution'
07/17 15:42:12.587 ERROR|        monitor_db:0205| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 194, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 493, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 389, in tick
    self._schedule_special_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 307, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 813, in _schedule_special_tasks
    only_tasks_with_leased_hosts=not self._inline_host_acquisition):
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 493, in wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 452, in SecondsTimer
    m.add(dt, fields={k: f[k] for k in keys})
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 176, in func
    return getattr(self._instance, prop)(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 91, in enqueue
    self.metric)
  File "/usr/lib/python2.7/logging/__init__.py", line 1604, in warning
    root.warning(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1164, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/local/autotest/client/setup_modules.py", line 85, in _autotest_logging_handle_error
    '%r using args %r\n' % (record.msg, record.args))

 
Labels: Hotlist-Deputy
The shard schedulers restarted a few times this afternoon. 
I suspect this has to do with tsmon outage + our metrics pipeline isn't resilient to too many metrics sending failures?

scheduler recovered on its own each time, and this isn't happening often enough to be alarming.
Status: Archived (was: Untriaged)
Not observed in the last two days.

Sign in to add a comment