guado_moblab paladin: AFE timeout because of DB migration failure. |
||||||||
Issue descriptionno common CL in these failures: https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8057 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8058 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8059 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8060 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8061
,
Dec 7 2017
marking experimental via tree status
,
Dec 7 2017
Looking at result logs for 8057, moblab's apache access.log has: 127.0.0.1 - - [06/Dec/2017:11:16:23 -0800] "POST /afe/server/noauth/rpc/?method=get_hosts HTTP/1.1" 500 2424 127.0.0.1 - - [06/Dec/2017:11:16:23 -0800] "POST /afe/server/noauth/rpc/?method=get_hosts HTTP/1.1" 500 2424 So the AFE was returning 500 here.
,
Dec 7 2017
I SSH'ed into chromeos2-row1-rack8-host1.cros and pinged the AFE. Here's the error being reported. localhost ~ # cat /etc/lsb-release | grep CHROMEOS_RELEASE_VERSION CHROMEOS_RELEASE_VERSION=10194.0.0-rc2 # This is from https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8063 localhost ~ # curl http://localhost/afe/server/noauth/rpc/?method=get_hosts <html><head><body> <!-- starting with sys.exc_info but hey, it's python --> Type: <br /> Value: (1146, "Table 'chromeos_autotest_db.auth_user' doesn't exist") <br /> Traceback: Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/django/core/handlers/base.py", line 92, in get_response response = middleware_method(request) File "/usr/local/autotest/frontend/apache_auth.py", line 83, in process_request password='') File "/usr/lib64/python2.7/site-packages/django/contrib/auth/__init__.py", line 59, in authenticate user = backend.authenticate(**credentials) File "/usr/local/autotest/frontend/apache_auth.py", line 21, in authenticate user = User.objects.get(username=username) File "/usr/lib64/python2.7/site-packages/django/db/models/manager.py", line 143, in get return self.get_query_set().get(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/django/db/models/query.py", line 382, in get num = len(clone) File "/usr/lib64/python2.7/site-packages/django/db/models/query.py", line 90, in __len__ self._result_cache = list(self.iterator()) File "/usr/lib64/python2.7/site-packages/django/db/models/query.py", line 301, in iterator for row in compiler.results_iter(): File "/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py", line 775, in results_iter for rows in self.execute_sql(MULTI): File "/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql cursor.execute(sql, params) File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 130, in execute six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2]) File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 120, in execute return self.cursor.execute(query, args) File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute self.errorhandler(self, exc, value) File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue DatabaseError: (1146, "Table 'chromeos_autotest_db.auth_user' doesn't exist") <br /> </body></head></html>
,
Dec 7 2017
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/807528 https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/809327 Two CLs that were chumped in the first failing build. Charlene: Can you double check if these are to blame.
,
Dec 7 2017
Confirmed by tryjobs, the failure is caused by these two CLs.
,
Dec 7 2017
,
Dec 7 2017
,
Dec 7 2017
After discussed with xixuan@, dshi@, and deep dived into the code and logs, I think these two CLs are not the root cause of this failure. The root cause is that we don't wait long enough after a fresh installation for the moblab database setup.
Here is the proof:
From comment #5, the chromeos_autotest_db.auth_user is not setup yet.
From the scheduler log of the moblab:
2/06 11:17:50.978 ERROR| monitor_db:0192| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 181, in main_without_exception_handling
dispatcher.tick()
File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 378, in tick
self._run_cleanup()
File "/usr/local/autotest/scheduler/monitor_db.py", line 278, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 405, in _run_cleanup
self._periodic_cleanup.run_cleanup_maybe()
File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 48, in run_cleanup_maybe
self._cleanup()
File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 75, in _cleanup
self._django_session_cleanup()
File "/usr/local/autotest/scheduler/monitor_db_cleanup.py", line 263, in _django_session_cleanup
self._db.execute(sql)
File "/usr/local/autotest/database/database_connection.py", line 312, in execute
results = self._backend.execute(query, parameters)
File "/usr/local/autotest/database/database_connection.py", line 132, in execute
parameters=parameters)
File "/usr/local/autotest/database/database_connection.py", line 54, in execute
self._cursor.execute(query, parameters)
File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 130, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/lib64/python2.7/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
self.errorhandler(self, exc, value)
File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1146, "Table 'chromeos_autotest_db.django_session' doesn't exist")
The chromeos_autotest_db.django_session is not setup yet.
Therefore, the database has not been setup yet when the scheduler kicked off the suite, then the suite failed.
I manually run the migration script, there is no errors.
,
Dec 7 2017
Keith, can you take a look?
,
Dec 7 2017
You are right that migration failed. You're not quite right about timeout. I ran that curl initiated query a long time after provision. Looking at https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/160902578-chromeos-test/chromeos2-row1-rack8-host1/moblab_RunSuite/sysinfo/var/log/bootup/ (look at moblab-database-init.log) Applying migration 120_add_job_handoff_table up Applying migration 121_update_afe_stable_versions_table_add_archive_url up Applying migration 122_add_job_handoff_completed_columns up Applying migration 123_add_static_labels_tables up Traceback (most recent call last): File "/usr/local/autotest/database/migrate.py", line 483, in <module> main() File "/usr/local/autotest/database/migrate.py", line 444, in main manager.do_sync_db(version) File "/usr/local/autotest/database/migrate.py", line 356, in do_sync_db self.migrate_to_version_or_latest(version) File "/usr/local/autotest/database/migrate.py", line 344, in migrate_to_version_or_latest self.migrate_to_latest() File "/usr/local/autotest/database/migrate.py", line 305, in migrate_to_latest self.migrate_to_version(latest_version) File "/usr/local/autotest/database/migrate.py", line 260, in migrate_to_version self.do_migration(migration, migrate_up) File "/usr/local/autotest/database/migrate.py", line 228, in do_migration migration.migrate_up(self) File "/usr/local/autotest/database/migrate.py", line 72, in migrate_up self._execute_migration(self._UP_ATTRIBUTES, manager) File "/usr/local/autotest/database/migrate.py", line 63, in _execute_migration manager.execute_script(sql) File "/usr/local/autotest/database/migrate.py", line 141, in execute_script self.execute(statement) File "/usr/local/autotest/database/migrate.py", line 128, in execute return self._database.execute(query, parameters) File "/usr/local/autotest/database/database_connection.py", line 304, in execute results = self._backend.execute(query, parameters) File "/usr/local/autotest/database/database_connection.py", line 54, in execute self._cursor.execute(query, parameters) File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute self.errorhandler(self, exc, value) File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (1071, 'Specified key was too long; max key length is 767 bytes') So, the migration failed because 124-XXX migration never finished because of this error.
,
Dec 7 2017
Sorry, the problem is here: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/807528/4/frontend/migrations/123_add_static_labels_tables.py#11 According to https://dev.mysql.com/doc/refman/5.7/en/create-index.html an index (and hence a key) doesn't support total byte size > 767. A VARCHAR(750) field can contain more that 767 bytes, depending on the encoding. I think explicitly setting a sane prefix size for the key should solve this problem. i.e.: UNIQUE KEY name (name(50)), I have not idea why we accept label names of 750 characters... but most of them can't differ for more than 50 chars anyway. The DB migration script initializes the tables at schema #51: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/database/schema_051.sql?l=330 and applies the migrations from that point onwards. So we never really run the migration that'd setup the afe_labels table. Now, why did the original setup not fail. My guess is that we hacked around it somehow.
,
Dec 7 2017
Actually, this is a mariadb vs mysql db problem. We'd have run the original migrations on mysqldb, that's why they never caused any problme. Or we ran them before we started using InnoDB engine... 100 other reasons are possible. This is why you should re-run all your migrations during tests :)
,
Dec 7 2017
This problem is fixed with the reverts. Let's take discussion back to the original bug for landing those CLs.
,
Dec 8 2017
If you think this is fixed, you should remove th builder from experimental. (doing that now)
,
Dec 8 2017
The latest build when I marked this fixed was red due to BuildPackages error :/ So I didn't have any green builds to confirm. I was going to come check this morning...
,
Dec 8 2017
Fair enough. In this case the BuildPackages failure was due to an inflight CL (I think, I -Vd one anyway). Fortunately those CQ runs failed for other reasons and it didn't land.
,
Dec 9 2017
Issue 792667 has been merged into this issue. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, Dec 7 2017