StagingLab: new DB column and use in the same push broke prod, but not staging |
|||||||
Issue descriptionWe did a lab push after a passing push test, but quickly had to land this revert CL and do an emergency push. https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/894823 Why didn't the push test catch this problem?
,
Jan 31 2018
From jkop: The revert crrev.com/c/894823 was necessary because a field was added before the DB migration it required. This was not caught in staging, and should have been.
,
Jan 31 2018
The root cause here is that the CL in question added a field to a django model for which the actual column will only exist in the database after a DB migration. The CL that adds the column is in the same push, but db migration hasn't been run. This should have caused failures on the staging lab as well, unless somebody ran the DB migration there. Allen, did you run DB migration on staging?
,
Jan 31 2018
,
Jan 31 2018
re #3 nope
,
Jan 31 2018
Re-assign to Allen, who is lucifier expert. It seems the staging lab haven't caught the lucifier bugs at least twice I remember, one is the DUT outage several weeks before. As you implement the lucifier, can you add test cases in the meantime in the test_push.py script?
,
Jan 31 2018
Not a lucifer issue, it's a staging lab issue. Handing back since shuqianz as secondary was working on fixing the staging lab already. It went from failing to passing too much. Maybe pprabhu could take a look as well, he's been cleaning up the test push code. Perhaps some exception is getting dropped?
,
Jan 31 2018
For this issue, it seems unrelated to the lucifer work, just a database migration issue not caught in the lab. Re-assign to myself to check the staging lab
,
Jan 31 2018
Re #7: test_push itself promotes the git-refs for autotest/chromite, so this isn't affected by my work that only changed how the errors are propagated back to the test_push task. I looked at the passing tests' logs and I don't see any indication that anything was wrong at all!
,
Jan 31 2018
chromeos-test@chromeos-staging-master2:/usr/local/autotest$ less ./logs/scheduler.log.2018-01-30-13.05.33
.....
/30 13:15:22.913 INFO | drone_manager:0515| Drones refreshed.
01/30 13:15:22.915 DEBUG| monitor_db:1224| Starting _send_to_lucifer
01/30 13:15:22.931 DEBUG| ssh_host:0301| Running (ssh) '/usr/local/autotest/bin/job_reporter --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --jobdir /usr/local/autotest/leases --job-id 10341 --autoserv-exit 0 -- -resultsdir /usr/local/a
utotest/results/10341-chromeos-test/chromeos2-row1-rack2-host3 -autotestdir /usr/local/autotest -watcherpath /opt/infra-tools/usr/bin/lucifer_watcher </dev/null >>/usr/local/autotest/results/10341-chromeos-test/chromeos2-row1-rack2-host3/job_reporter_out
put.log 2>&1 &' from '_send_to_lucifer|_send_parsing_to_lucifer|spawn_parsing_job_handler|spawn|run|run_very_slowly'
01/30 13:15:22.991 ERROR| monitor_db:0201| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
self._send_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 977, in _send_to_lucifer
self._send_parsing_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 1031, in _send_parsing_to_lucifer
models.JobHandoff.objects.create(job=job)
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 149, in create
return self.get_query_set().create(**kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 402, in create
obj.save(force_insert=True, using=self.db)
File "/usr/local/autotest/site-packages/django/db/models/base.py", line 546, in save
force_update=force_update, update_fields=update_fields)
File "/usr/local/autotest/site-packages/django/db/models/base.py", line 650, in save_base
result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 215, in _insert
return insert_query(self.model, objs, fields, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 1661, in insert_query
return query.get_compiler(using=using).execute_sql(return_id)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
cursor.execute(sql, params)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1054, "Unknown column 'drone' in 'field list'")
01/30 13:15:22.994 INFO | ts_mon_config:0207| Waiting for ts_mon flushing process to finish...
01/30 13:15:33.990 INFO | ts_mon_config:0213| Finished waiting for ts_mon process.
The staging lab actually caught this issue, but the test still passed. Deep diving....
,
Jan 31 2018
This caused a failed push. At least knowing what is the problem here is Chase worthy, imo.
,
Jan 31 2018
What is that send_to_lucifer step? Is it the end of the test? It seems like a job report thing that after a job is finished. That's the reason that all the tests passed?
,
Jan 31 2018
+1 to Chase-Pending, same reasoning
,
Jan 31 2018
That is where a job gets handed off to lucifer to "do the rest of the steps". In this case, it should have been GATHERING and PARSING. It's the same in prod, as of this afternoon (if allen landed his CL), we do those two steps via lucifer.
,
Jan 31 2018
The scheduler crashed and kept respawn itself due to the missing drone column when the 1pm testing push was running. But all the tests actually passed.
,
Jan 31 2018
Ah, so the tests were actually finished. It just failed to parse the results.
,
Jan 31 2018
So parsing/gathering failure cannot fail the test?
,
Jan 31 2018
The error occurs after it hands off to lucifer, so parsing and gathering ran normally.
,
Jan 31 2018
I believe the action item here is to add a check that the scheduler wasn't crashing during the test run. Ironically, lucifer did end up successfully finishing the gathering and parsing for the test despite the scheduler dying.
,
Feb 5 2018
fix in progress: don't pass staging test if important services are crashy.
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0de876d8a1af7557506311511f21471bb7cc4804 commit 0de876d8a1af7557506311511f21471bb7cc4804 Author: Shuqian Zhao <shuqianz@chromium.org> Date: Thu Feb 08 00:30:31 2018 autotest: add service check during testing push Previously there is no service check during testing push. This CL adds check to see whether a given service is in the crash loop during testing push. BUG= chromium:807476 TEST=unittest Change-Id: I178556a7082744cf98fecfd94b61ca61bd15312b Reviewed-on: https://chromium-review.googlesource.com/896026 Reviewed-by: Shuqian Zhao <shuqianz@chromium.org> Tested-by: Shuqian Zhao <shuqianz@chromium.org> Commit-Queue: Shuqian Zhao <shuqianz@chromium.org> Trybot-Ready: Shuqian Zhao <shuqianz@chromium.org> [modify] https://crrev.com/0de876d8a1af7557506311511f21471bb7cc4804/site_utils/test_push.py
,
Feb 12 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by shuqianz@chromium.org
, Jan 31 2018