New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 807476 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

StagingLab: new DB column and use in the same push broke prod, but not staging

Project Member Reported by dgarr...@chromium.org, Jan 31 2018

Issue description

We did a lab push after a passing push test, but quickly had to land this revert CL and do an emergency push.

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/894823


Why didn't the push test catch this problem?
 
Cc: ayatane@chromium.org shuqianz@chromium.org
 Issue 807477  has been merged into this issue.
From jkop:

The revert crrev.com/c/894823 was necessary because a field was added before the DB migration it required. This was not caught in staging, and should have been.
The root cause here is that the CL in question added a field to a django model for which the actual column will only exist in the database after a DB migration.

The CL that adds the column is in the same push, but db migration hasn't been run.

This should have caused failures on the staging lab as well, unless somebody ran the DB migration there. Allen, did you run DB migration on staging?
Summary: StagingLab: new DB column and use in the same push broke prod, but not staging (was: Lab push test failed to find errors that broke the lab.)
re #3 nope
Owner: ayatane@chromium.org
Re-assign to Allen, who is lucifier expert. It seems the staging lab haven't caught the lucifier bugs at least twice I remember, one is the DUT outage several weeks before. As you implement the lucifier, can you add test cases in the meantime in the test_push.py script?
Owner: shuqianz@chromium.org
Not a lucifer issue, it's a staging lab issue.

Handing back since shuqianz as secondary was working on fixing the staging lab already.  It went from failing to passing too much.

Maybe pprabhu could take a look as well, he's been cleaning up the test push code. Perhaps some exception is getting dropped?
For this issue, it seems unrelated to the lucifer work, just a database migration issue not caught in the lab. Re-assign to myself to check the staging lab
Re #7: test_push itself promotes the git-refs for autotest/chromite, so this isn't affected by my work that only changed how the errors are propagated back to the test_push task.

I looked at the passing tests' logs and I don't see any indication that anything was wrong at all!
chromeos-test@chromeos-staging-master2:/usr/local/autotest$ less ./logs/scheduler.log.2018-01-30-13.05.33

.....
/30 13:15:22.913 INFO |     drone_manager:0515| Drones refreshed.
01/30 13:15:22.915 DEBUG|        monitor_db:1224| Starting _send_to_lucifer
01/30 13:15:22.931 DEBUG|          ssh_host:0301| Running (ssh) '/usr/local/autotest/bin/job_reporter --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --jobdir /usr/local/autotest/leases --job-id 10341 --autoserv-exit 0 -- -resultsdir /usr/local/a
utotest/results/10341-chromeos-test/chromeos2-row1-rack2-host3 -autotestdir /usr/local/autotest -watcherpath /opt/infra-tools/usr/bin/lucifer_watcher </dev/null >>/usr/local/autotest/results/10341-chromeos-test/chromeos2-row1-rack2-host3/job_reporter_out
put.log 2>&1 &' from '_send_to_lucifer|_send_parsing_to_lucifer|spawn_parsing_job_handler|spawn|run|run_very_slowly'
01/30 13:15:22.991 ERROR|        monitor_db:0201| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
    self._send_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 977, in _send_to_lucifer
    self._send_parsing_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1031, in _send_parsing_to_lucifer
    models.JobHandoff.objects.create(job=job)
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 149, in create
    return self.get_query_set().create(**kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 402, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/local/autotest/site-packages/django/db/models/base.py", line 546, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/local/autotest/site-packages/django/db/models/base.py", line 650, in save_base
    result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 215, in _insert
    return insert_query(self.model, objs, fields, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 1661, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1054, "Unknown column 'drone' in 'field list'")
01/30 13:15:22.994 INFO |     ts_mon_config:0207| Waiting for ts_mon flushing process to finish...
01/30 13:15:33.990 INFO |     ts_mon_config:0213| Finished waiting for ts_mon process.


The staging lab actually caught this issue, but the test still passed. Deep diving....
Labels: Chase-Pending
This caused a failed push. At least knowing what is the problem here is Chase worthy, imo.
What is that send_to_lucifer step? Is it the end of the test? It seems like a job report thing that after a job is finished. That's the reason that all the tests passed? 

Comment 13 by jkop@chromium.org, Jan 31 2018

+1 to Chase-Pending, same reasoning
That is where a job gets handed off to lucifer to "do the rest of the steps". In this case, it should have been GATHERING and PARSING.

It's the same in prod, as of this afternoon (if allen landed his CL), we do those two steps via lucifer.
The scheduler crashed and kept respawn itself due to the missing drone column when the 1pm testing push was running. But all the tests actually passed.
Ah, so the tests were actually finished. It just failed to parse the results.
So parsing/gathering failure cannot fail the test?
The error occurs after it hands off to lucifer, so parsing and gathering ran normally.
I believe the action item here is to add a check that the scheduler wasn't crashing during the test run.

Ironically, lucifer did end up successfully finishing the gathering and parsing for the test despite the scheduler dying.
Labels: -Pri-2 -Chase-Pending Chase Pri-1
Status: Assigned (was: Untriaged)
fix in progress: don't pass staging test if important services are crashy.
Project Member

Comment 21 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0de876d8a1af7557506311511f21471bb7cc4804

commit 0de876d8a1af7557506311511f21471bb7cc4804
Author: Shuqian Zhao <shuqianz@chromium.org>
Date: Thu Feb 08 00:30:31 2018

autotest: add service check during testing push

Previously there is no service check during testing push. This CL adds
check to see whether a given service is in the crash loop during testing
push.

BUG= chromium:807476 
TEST=unittest

Change-Id: I178556a7082744cf98fecfd94b61ca61bd15312b
Reviewed-on: https://chromium-review.googlesource.com/896026
Reviewed-by: Shuqian Zhao <shuqianz@chromium.org>
Tested-by: Shuqian Zhao <shuqianz@chromium.org>
Commit-Queue: Shuqian Zhao <shuqianz@chromium.org>
Trybot-Ready: Shuqian Zhao <shuqianz@chromium.org>

[modify] https://crrev.com/0de876d8a1af7557506311511f21471bb7cc4804/site_utils/test_push.py

Status: Fixed (was: Assigned)

Sign in to add a comment