guado_moblab, chromeos-server118,120: host_scheduler crashing due to DB migration mismatch
Reported by
jrbarnette@chromium.org,
Dec 19 2017
|
|||||||||||
Issue description
This guado_moblab-paladin failed:
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8170
The test suite kicked off one test that never finished:
http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163694583
The abort happened because of a timeout; I don't know what took
so long. Roughly, here's a high level timeline:
05:49:56 - Suite job created
05:51:10 - Test job created
06:05:56 - Test job begins execution
07:22:40 - Suite job aborts the test
07:25:46 - Test job completes execution
So, two problems:
1) There's ~15 minutes between "test job created" and "test job begins".
2) The test job ran for 1h20m.
,
Dec 19 2017
Third failure:
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8172
Same basic timeline.
,
Dec 19 2017
,
Dec 19 2017
The scheduler is crashing
from host_scheduler.log.2017-12-19-13.35.17
12/19 13:35:16.071 INFO | host_scheduler:0432| Setting signal handler
12/19 13:35:16.072 WARNI| metadata_reporter:0142| Elasticsearch db deprecated, no metadata will be reported.
12/19 13:35:16.073 INFO | metadata_reporter:0150| Metadata reporting thread is started.
12/19 13:35:16.081 INFO | connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
12/19 13:35:16.341 INFO | config:0024| Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
12/19 13:35:16.342 ERROR| config:0244| ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
12/19 13:35:16.343 NOTIC| cros_logging:0038| ts_mon was set up.
12/19 13:35:16.446 INFO | host_scheduler:0388| Calling new tick.
12/19 13:35:16.446 INFO | host_scheduler:0389| Leasing hosts for frontend tasks.
12/19 13:35:16.449 INFO | host_scheduler:0391| Finding hosts for new jobs.
12/19 13:35:16.451 WARNI| scheduler_models:0138| /usr/local/autotest/scheduler/scheduler_models.py:138: DeprecationWarning: object() takes no parameters
return super(DBObject, cls).__new__(cls, id=id, **kwargs)
12/19 13:35:16.453 DEBUG| host_scheduler:0186| Minimum duts to get for suites (suite_id: min_duts): {1L: 0}
12/19 13:35:16.456 INFO | metadata_reporter:0164| Waiting up to 5 seconds for metadata reporting thread to complete.
,
Dec 19 2017
,
Dec 19 2017
Stack trace:
13:58:11 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/host_scheduler.py", line 520, in <module>
main()
File "/usr/local/autotest/scheduler/host_scheduler.py", line 499, in main
host_scheduler.tick()
File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 392, in tick
self._schedule_jobs()
File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 333, in _schedule_jobs
for acquisition in self.find_hosts_for_jobs(unverified_host_jobs):
File "/usr/local/autotest/scheduler/host_scheduler.py", line 282, in find_hosts_for_jobs
for host, job in zip(hosts, host_jobs):
File "/usr/local/autotest/scheduler/rdb_lib.py", line 79, in acquire_hosts
job_query_manager = JobQueryManager(queue_entries, suite_min_duts)
File "/usr/local/autotest/scheduler/rdb_lib.py", line 34, in __init__
self._labels = self.query_manager._get_labels(self._job_deps)
File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/query_managers.py", line 382, in _get_labels
where="id IN (SELECT label_id FROM afe_hosts_labels)")
File "/usr/local/autotest/scheduler/scheduler_models.py", line 329, in fetch
return [cls(id=row[0], row=row) for row in rows]
File "/usr/local/autotest/scheduler/scheduler_models.py", line 169, in __init__
self._update_fields_from_row(row)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 227, in _update_fields_from_row
self._assert_row_length(row)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 192, in _assert_row_length
self.__table, row, len(row), self._fields, len(self._fields)))
AssertionError: table = afe_labels, row = (1L, u'sparse_coverage_5', u'', 0, 0, 0, None, 0)/8, fields = ('id', 'name', 'kernel_config', 'platform', 'invalid', 'only_if_needed', 'atomic_group_id')/7
host_scheduler isn't dropping its own stack trace in its logs which is entirely stupid.
,
Dec 19 2017
This is still a problem with the db migrations that were committed last night. How did they get past guado_moblab paladin? Is the paladin marked experimental still? The problem is in DBModels.fetch_rows It should be fetching only the columns required by |self._fields|. https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/823543 tried to fix this, but didn't get all the instances of the problem. But, a local AFE test with host_scheduler enabled should have caught this.
,
Dec 19 2017
This is the offending line: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/scheduler_models.py?l=312 ---- Short term considerations: - This will not affect prod until we do a DB migration. - We haven't had a push for a few days now, and people are waiting for one. - We got a successful test_push this morning. - Even if I revert the DB migration CLs, I'd have to wait for another successful test_push to be sure they're not getting pushed to prod. I am planning on simply push these CLs right now as they don't affect prod, and let the revert / fix play it out. This is OK as long as we don't do a DB migration in the lab before a green moblab run + successful test_push run.
,
Dec 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b8280cef9d858373ab2e9851dc2e2d08183c05f7 commit b8280cef9d858373ab2e9851dc2e2d08183c05f7 Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Dec 19 22:28:41 2017 Revert "autotest: add replaced_by_static_label column to afe_labels" This reverts commit a69b15cdde11dbbdd96be9fd95eb2b3c7c45d76c. This breaks moblab BUG= chromium:796210 TEST=None Change-Id: Ieb01bce545bb0b4bb14505e6f766c592f7fc7a90 Reviewed-on: https://chromium-review.googlesource.com/834649 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [delete] https://crrev.com/9e32f5c3746c9ee0e575015f05fe7e54fa77d842/frontend/migrations/124_add_replaced_column_to_afe_labels.py
,
Dec 19 2017
Hmm... Rever in #9 doesn't affect the current push, but will un-break moblab. xixuan@ you can re-land with the fix to scheduler_models. Sorry :(
,
Dec 21 2017
,
Dec 21 2017
Issue 797124 has been merged into this issue.
,
Dec 21 2017
,
Dec 22 2017
,
Dec 22 2017
Last test push actually passed, sans 797157 [chromeos-staging-master2.hot.corp.google.com] out: All tests completed successfully, the prod branch of the following repos is ready to be pushed to the hash list below. [chromeos-staging-master2.hot.corp.google.com] out: autotest: f6bc7c9cbd9275dfa6690ac416bcf23b270bbeff [chromeos-staging-master2.hot.corp.google.com] out: chromite: 819449f76d5181aad593b2bac973682676675b4f [chromeos-staging-master2.hot.corp.google.com] out: I also need to manually make sure that none of the CLs that xixuan@ has been working on that need the DB migration are in the new prod-next branch, because we obviously haven't migrated the DB yet.
,
Dec 22 2017
DB migrations have been pushed to prod. Now wait for 1/2 hour for sanity, then I'll run DB migration in prod.
,
Dec 22 2017
I got stuck trying to apply db migration 124 (add column to afe_labels table) That tries to lock the whole afe_labels table and just hangs there and blocks everything else -- IOW, brings the DB to a grinding halt. The main DB now has up to migration 123 applied, but 124 is not applied. Shards have not been migrated. I give up for now. I have some ideas, but I'll let xixuan / shuqianz drive this further once everyone is back.
,
Dec 22 2017
My theatrics caused a blip in the system: http://shortn/_klbeH7TtbJ For ~15 minutes DB was locked up leading to failed heartbeats and jobs would not have made progress. This _shouldn't_ cause any havoc in the CQ, but I can't be sure.
,
Dec 22 2017
,
Jan 22 2018
This is long fixed. We decided to not do that migration at all and pulled it back. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by jrbarnette@chromium.org
, Dec 19 2017Escalating: This happened in an earlier paladin run: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8168 This test suite: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163608339 Similar looking timeline: 00:00:42 - Suite job created 00:01:28 - Test job created 00:16:48 - Test job begins execution 01:33:59 - Suite job aborts the test 01:37:01 - Test job completes execution