New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 792812 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

guado_moblab paladin: AFE timeout because of DB migration failure.

Project Member Reported by akes...@chromium.org, Dec 7 2017

Issue description

Cc: dgarr...@chromium.org pho...@chromium.org
marking experimental via tree status

Comment 3 Deleted

Looking at result logs for 8057, moblab's apache access.log has:

127.0.0.1 - - [06/Dec/2017:11:16:23 -0800] "POST /afe/server/noauth/rpc/?method=get_hosts HTTP/1.1" 500 2424
127.0.0.1 - - [06/Dec/2017:11:16:23 -0800] "POST /afe/server/noauth/rpc/?method=get_hosts HTTP/1.1" 500 2424

So the AFE was returning 500 here.
I SSH'ed into chromeos2-row1-rack8-host1.cros and pinged the AFE.
Here's the error being reported.

localhost ~ # cat /etc/lsb-release  | grep CHROMEOS_RELEASE_VERSION 
CHROMEOS_RELEASE_VERSION=10194.0.0-rc2
# This is from https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8063


localhost ~ # curl http://localhost/afe/server/noauth/rpc/?method=get_hosts
<html><head><body>
<!-- starting with sys.exc_info but hey, it's python -->
Type:  <br />
Value: (1146, &quot;Table &#39;chromeos_autotest_db.auth_user&#39; doesn&#39;t exist&quot;) <br />
Traceback: Traceback (most recent call last):
  File &quot;/usr/lib64/python2.7/site-packages/django/core/handlers/base.py&quot;, line 92, in get_response
    response = middleware_method(request)
  File &quot;/usr/local/autotest/frontend/apache_auth.py&quot;, line 83, in process_request
    password=&#39;&#39;)
  File &quot;/usr/lib64/python2.7/site-packages/django/contrib/auth/__init__.py&quot;, line 59, in authenticate
    user = backend.authenticate(**credentials)
  File &quot;/usr/local/autotest/frontend/apache_auth.py&quot;, line 21, in authenticate
    user = User.objects.get(username=username)
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/manager.py&quot;, line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/query.py&quot;, line 382, in get
    num = len(clone)
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/query.py&quot;, line 90, in __len__
    self._result_cache = list(self.iterator())
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/query.py&quot;, line 301, in iterator
    for row in compiler.results_iter():
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py&quot;, line 775, in results_iter
    for rows in self.execute_sql(MULTI):
  File &quot;/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py&quot;, line 840, in execute_sql
    cursor.execute(sql, params)
  File &quot;/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py&quot;, line 130, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File &quot;/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py&quot;, line 120, in execute
    return self.cursor.execute(query, args)
  File &quot;/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py&quot;, line 205, in execute
    self.errorhandler(self, exc, value)
  File &quot;/usr/lib64/python2.7/site-packages/MySQLdb/connections.py&quot;, line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1146, &quot;Table &#39;chromeos_autotest_db.auth_user&#39; doesn&#39;t exist&quot;)
 <br />
</body></head></html>

Owner: shuqianz@chromium.org
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/807528
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/809327

Two CLs that were chumped in the first failing build.
Charlene: Can you double check if these are to blame.
Confirmed by tryjobs, the failure is caused by these two CLs. 
Cc: dshi@chromium.org
Status: Fixed (was: Untriaged)
Status: Available (was: Fixed)
Summary: guado molab needs to wait longer time for the fresh installation (was: guado_moblab-paladin [Test-Logs]: moblab_RunSuite: ERROR: Unhandled AutoservError: Moblab AFE is not responding: Timeout occurred- waited 300.0 seconds.)
After discussed with xixuan@, dshi@, and deep dived into the code and logs, I think these two CLs are not the root cause of this failure. The root cause is that we don't wait long enough after a fresh installation for the moblab database setup. 

Here is the proof:
From comment #5, the chromeos_autotest_db.auth_user is not setup yet. 
From the scheduler log of the moblab: 
2/06 11:17:50.978 ERROR|        monitor_db:0192| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 181, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 378, in tick
    self._run_cleanup()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 278, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 405, in _run_cleanup
    self._periodic_cleanup.run_cleanup_maybe()
  File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 48, in run_cleanup_maybe
    self._cleanup()
  File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 75, in _cleanup
    self._django_session_cleanup()
  File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 263, in _django_session_cleanup
    self._db.execute(sql)
  File "/usr/local/autotest/database/database_connection.py", line 312, in execute
    results = self._backend.execute(query, parameters)
  File "/usr/local/autotest/database/database_connection.py", line 132, in execute
    parameters=parameters)
  File "/usr/local/autotest/database/database_connection.py", line 54, in execute
    self._cursor.execute(query, parameters)
  File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 130, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1146, "Table 'chromeos_autotest_db.django_session' doesn't exist")

The chromeos_autotest_db.django_session is not setup yet. 

Therefore, the database has not been setup yet when the scheduler kicked off the suite, then the suite failed.

I manually run the migration script, there is no errors. 

Owner: haddowk@chromium.org
Keith, can you take a look?
You are right that migration failed.
You're not quite right about timeout. I ran that curl initiated query a long time after provision.

Looking at https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/160902578-chromeos-test/chromeos2-row1-rack8-host1/moblab_RunSuite/sysinfo/var/log/bootup/ (look at moblab-database-init.log)

Applying migration 120_add_job_handoff_table up
Applying migration 121_update_afe_stable_versions_table_add_archive_url up
Applying migration 122_add_job_handoff_completed_columns up
Applying migration 123_add_static_labels_tables up
Traceback (most recent call last):
  File "/usr/local/autotest/database/migrate.py", line 483, in <module>
    main()
  File "/usr/local/autotest/database/migrate.py", line 444, in main
    manager.do_sync_db(version)
  File "/usr/local/autotest/database/migrate.py", line 356, in do_sync_db
    self.migrate_to_version_or_latest(version)
  File "/usr/local/autotest/database/migrate.py", line 344, in migrate_to_version_or_latest
    self.migrate_to_latest()
  File "/usr/local/autotest/database/migrate.py", line 305, in migrate_to_latest
    self.migrate_to_version(latest_version)
  File "/usr/local/autotest/database/migrate.py", line 260, in migrate_to_version
    self.do_migration(migration, migrate_up)
  File "/usr/local/autotest/database/migrate.py", line 228, in do_migration
    migration.migrate_up(self)
  File "/usr/local/autotest/database/migrate.py", line 72, in migrate_up
    self._execute_migration(self._UP_ATTRIBUTES, manager)
  File "/usr/local/autotest/database/migrate.py", line 63, in _execute_migration
    manager.execute_script(sql)
  File "/usr/local/autotest/database/migrate.py", line 141, in execute_script
    self.execute(statement)
  File "/usr/local/autotest/database/migrate.py", line 128, in execute
    return self._database.execute(query, parameters)
  File "/usr/local/autotest/database/database_connection.py", line 304, in execute
    results = self._backend.execute(query, parameters)
  File "/usr/local/autotest/database/database_connection.py", line 54, in execute
    self._cursor.execute(query, parameters)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1071, 'Specified key was too long; max key length is 767 bytes')


So, the migration failed because 124-XXX migration never finished because of this error.
Owner: shuqianz@chromium.org
Sorry, the problem is here: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/807528/4/frontend/migrations/123_add_static_labels_tables.py#11

According to https://dev.mysql.com/doc/refman/5.7/en/create-index.html an index (and hence a key) doesn't support total byte size > 767. A VARCHAR(750) field can contain more that 767 bytes, depending on the encoding.
I think explicitly setting a sane prefix size for the key should solve this problem.
i.e.: 
UNIQUE KEY name (name(50)),

I have not idea why we accept label names of 750 characters... but most of them can't differ for more than 50 chars anyway.

The DB migration script initializes the tables at schema #51: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/database/schema_051.sql?l=330
and applies the migrations from that point onwards.
So we never really run the migration that'd setup the afe_labels table. Now, why did the original setup not fail. My guess is that we hacked around it somehow.
Actually, this is a mariadb vs mysql db problem. We'd have run the original migrations on mysqldb, that's why they never caused any problme. Or we ran them before we started using InnoDB engine... 100 other reasons are possible.

This is why you should re-run all your migrations during tests :)
Status: Fixed (was: Available)
Summary: guado_moblab paladin: AFE timeout because of DB migration failure. (was: guado molab needs to wait longer time for the fresh installation)
This problem is fixed with the reverts.
Let's take discussion back to the original bug for landing those CLs.
If you think this is fixed, you should remove th builder from experimental. (doing that now)
The latest build when I marked this fixed was red due to BuildPackages error :/
So I didn't have any green builds to confirm.

I was going to come check this morning...
Fair enough. In this case the BuildPackages failure was due to an inflight CL (I think, I -Vd one anyway). Fortunately those CQ runs failed for other reasons and it didn't land.
 Issue 792667  has been merged into this issue.

Sign in to add a comment