New issue
Advanced search Search tips
Starred by 1 user
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment
DB connection errors kill job_reporter
Project Member Reported by akes...@chromium.org, Jan 24 Back to list
Build in question: https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-paladin/builds/30842

Test in question:
http://cautotest/afe/#tab_id=view_job&object_id=172134104

job_reporter_output.log:
https://storage.cloud.google.com/chromeos-autotest-results/172134104-chromeos-test/chromeos6-row2-rack7-host12/job_reporter_output.log?_ga=2.36199200.-1235484419.1474328066

extract from log:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 152, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 44, in main
    return _main(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 77, in _main
    handler = _make_handler(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 91, in _make_handler
    job = models.Job.objects.get(id=args.job_id)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/query.py", line 382, in get
    num = len(clone)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/query.py", line 90, in __len__
    self._result_cache = list(self.iterator())
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/query.py", line 301, in iterator
    for row in compiler.results_iter():
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 775, in results_iter
    for rows in self.execute_sql(MULTI):
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 839, in execute_sql
    cursor = self.connection.cursor()
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-5e245727ed23dd962b0bb6d1c7a40f7d/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/connections.py", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")




As a result, this otherwise passing test was eventually marked as aborted by the suite.

Do we need job_reporter to retry harder on temporary unavailablity of db? Which db was itunable to connect to actually?
 
Cc: pprabhu@chromium.org dshi@chromium.org
job_reporter death = failed job is WAI.  Without digging into it, I hypothesize that MySQL restarted at this exact moment, causing the connection to a local Unix socket file to fail.  Otherwise, I can't think of a reason this error would occur.

Probably there is a way to add a retry in Django or MySQLdb somewhere.  Finding it might take a while since documentation for our version of Django doesn't even exist any more as far as I know.

+dshi, +pprabhu for insights on error and/or ways to retry.
I'm on the fence about retries here.
I do have two related requests:

- Add a "starting now" log message to job_reporter_output.log to anchor inferences about when the logs start.
I think the first import should be early enough

job_reporter: 2018-01-24 15:16:17,880:DEBUG:autotest:find_module:113:Dummying out autotest_lib.frontend.common import

But still, please add both a "starting" and a "done" log message. tko/parse logs are impossible to read because we never know if it finished successfully or not..

- This is another reason we want metrics from job_aborter. By spewing metrics about the different types of jobs abortion actions, we'll be able to peg the severity of this type of failure.
Labels: Hotlist-Lucifer
Retrying in Django can be problematic in the long term, as it can either hide some real issue or being too flake to actually work. I think retry the db call in job_reporter is reasonable to do, with some metrics reporting. 
Project Member Comment 5 by bugdroid1@chromium.org, Jan 25
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bda3236d53b3f7d61302aa444163d92be9a037b1

commit bda3236d53b3f7d61302aa444163d92be9a037b1
Author: Allen Li <ayatane@chromium.org>
Date: Thu Jan 25 05:35:06 2018

[autotest] Add start/finish logging to lucifer scripts

BUG=chromium:805724
TEST=None

Change-Id: I9ab3c62ae0cd7c8d8a264fa82d22db40f4201454
Reviewed-on: https://chromium-review.googlesource.com/885106
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/bda3236d53b3f7d61302aa444163d92be9a037b1/venv/lucifer/cmd/job_reporter.py
[modify] https://crrev.com/bda3236d53b3f7d61302aa444163d92be9a037b1/venv/lucifer/cmd/job_aborter.py

Are you saying to wrap every single DB call in job_reporter with a try/catch/retry block?  That doesn't sound like a sustainable solution.
Didn't know we have more than one db call, I think tko/parse.py has only one or two calls.

What about retry job_reporter as a whole? Is there any other flakes we want to address?
Summary: DB connection errors kill job_reporter (was: job_reporter died, causing passing test to appear aborted)
Filed separate bug for missing TKO entries https://bugs.chromium.org/p/chromium/issues/detail?id=806065
Sign in to add a comment