New issue
Advanced search Search tips

Issue 859578 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Closed: Jul 2
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

afe connection flake can kill job_reporter

Project Member Reported by xixuan@chromium.org, Jul 2

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18999

It's not 'suite aborted' actually, it's caused by tko flakiness.

"lucifer/job_reporter_output.log" in some "aborted" jobs, like
https://stainless.corp.google.com/browse/chromeos-autotest-results/213105375-chromeos-test/
https://stainless.corp.google.com/browse/chromeos-autotest-results/213105047-chromeos-test/

shows that there's a mysql error:

ifer_run_job: 2018/06/30 04:00:05 Sending event parsing
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 209, in <module>
    sys.exit(main(sys.argv))
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 46, in main
    ret = _main(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 93, in _main
    return _run_autotest_job(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 106, in _run_autotest_job
    ret = _run_lucifer_job(handler, args, job)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 148, in _run_lucifer_job
    event_handler=event_handler, args=command_args)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 93, in run_event_command
    _handle_subprocess_events(event_handler, proc)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 112, in _handle_subprocess_events
    _handle_output_line(event_handler, line)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 127, in _handle_output_line
    event_handler(event, message)
  File "/usr/local/autotest/venv/lucifer/handlers.py", line 57, in __call__
    handler(msg)
  File "/usr/local/autotest/venv/lucifer/handlers.py", line 72, in _handle_gathering
    status=models.HostQueueEntry.Status.GATHERING)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/query.py", line 567, in update
    rows = query.get_compiler(self.db).execute_sql(None)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 839, in execute_sql
    cursor = self.connection.cursor()
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/connections.py", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")

It makes tko database doesn't have the records for these jobs, and make dynamic_suite think they're aborted.
 
Labels: -Pri-2 -Chase-Pending Chase Pri-1
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
Summary: afe connection flake can kill job_reporter (was: tko flakiness cause misleading "suite timed out".)
action: add a retry at this call site / use a retrying wrapper / something like that.
Mergedinto: 805724
Status: Duplicate (was: Assigned)

Sign in to add a comment