New issue
Advanced search Search tips
Starred by 1 user
Status: Assigned
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Sign in to add a comment
DB connection errors kill job_reporter
Project Member Reported by, Jan 24 Back to list
Build in question:

Test in question:


extract from log:

Traceback (most recent call last):
  File "/usr/lib/python2.7/", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/", line 152, in <module>
  File "/usr/local/autotest/venv/lucifer/cmd/", line 44, in main
    return _main(args)
  File "/usr/local/autotest/venv/lucifer/cmd/", line 77, in _main
    handler = _make_handler(args)
  File "/usr/local/autotest/venv/lucifer/cmd/", line 91, in _make_handler
    job = models.Job.objects.get(id=args.job_id)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/", line 382, in get
    num = len(clone)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/", line 90, in __len__
    self._result_cache = list(self.iterator())
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/", line 301, in iterator
    for row in compiler.results_iter():
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/sql/", line 775, in results_iter
    for rows in self.execute_sql(MULTI):
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/sql/", line 839, in execute_sql
    cursor = self.connection.cursor()
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/backends/", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/backends/mysql/", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")

As a result, this otherwise passing test was eventually marked as aborted by the suite.

Do we need job_reporter to retry harder on temporary unavailablity of db? Which db was itunable to connect to actually?
job_reporter death = failed job is WAI.  Without digging into it, I hypothesize that MySQL restarted at this exact moment, causing the connection to a local Unix socket file to fail.  Otherwise, I can't think of a reason this error would occur.

Probably there is a way to add a retry in Django or MySQLdb somewhere.  Finding it might take a while since documentation for our version of Django doesn't even exist any more as far as I know.

+dshi, +pprabhu for insights on error and/or ways to retry.
I'm on the fence about retries here.
I do have two related requests:

- Add a "starting now" log message to job_reporter_output.log to anchor inferences about when the logs start.
I think the first import should be early enough

job_reporter: 2018-01-24 15:16:17,880:DEBUG:autotest:find_module:113:Dummying out autotest_lib.frontend.common import

But still, please add both a "starting" and a "done" log message. tko/parse logs are impossible to read because we never know if it finished successfully or not..

- This is another reason we want metrics from job_aborter. By spewing metrics about the different types of jobs abortion actions, we'll be able to peg the severity of this type of failure.
Labels: Hotlist-Lucifer
Retrying in Django can be problematic in the long term, as it can either hide some real issue or being too flake to actually work. I think retry the db call in job_reporter is reasonable to do, with some metrics reporting. 
Project Member Comment 5 by, Jan 25
The following revision refers to this bug:

commit bda3236d53b3f7d61302aa444163d92be9a037b1
Author: Allen Li <>
Date: Thu Jan 25 05:35:06 2018

[autotest] Add start/finish logging to lucifer scripts


Change-Id: I9ab3c62ae0cd7c8d8a264fa82d22db40f4201454
Commit-Ready: Allen Li <>
Tested-by: Allen Li <>
Reviewed-by: Prathmesh Prabhu <>


Are you saying to wrap every single DB call in job_reporter with a try/catch/retry block?  That doesn't sound like a sustainable solution.
Didn't know we have more than one db call, I think tko/ has only one or two calls.

What about retry job_reporter as a whole? Is there any other flakes we want to address?
Summary: DB connection errors kill job_reporter (was: job_reporter died, causing passing test to appear aborted)
Filed separate bug for missing TKO entries
Sign in to add a comment