Issue metadata
Sign in to add a comment
|
host scheduler in a crash loop on chromeos-server118
Reported by
jrbarnette@chromium.org,
Dec 21 2017
|
||||||||||||||||||||||
Issue description
host-scheduler on chromeos-server118 is dying/respawning
repeatedly. Here's the exception:
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/host_scheduler.py", line 520, in <module>
main()
File "/usr/local/autotest/scheduler/host_scheduler.py", line 499, in main
host_scheduler.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 392, in tick
self._schedule_jobs()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 333, in _schedule_jobs
for acquisition in self.find_hosts_for_jobs(unverified_host_jobs):
File "/usr/local/autotest/scheduler/host_scheduler.py", line 282, in find_hosts_for_jobs
for host, job in zip(hosts, host_jobs):
File "/usr/local/autotest/scheduler/rdb_lib.py", line 79, in acquire_hosts
job_query_manager = JobQueryManager(queue_entries, suite_min_duts)
File "/usr/local/autotest/scheduler/rdb_lib.py", line 34, in __init__
self._labels = self.query_manager._get_labels(self._job_deps)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/query_managers.py", line 393, in _get_labels
where="id IN (%s)" % job_string_label_list)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 329, in fetch
return [cls(id=row[0], row=row) for row in rows]
File "/usr/local/autotest/scheduler/scheduler_models.py", line 169, in __init__
self._update_fields_from_row(row)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 227, in _update_fields_from_row
self._assert_row_length(row)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 192, in _assert_row_length
self.__table, row, len(row), self._fields, len(self._fields)))
AssertionError: table = afe_labels, row = (210L, u'pool:bvt', u'', 0, 0, 0, None, 0)/8, fields = ('id', 'name', 'kernel_config', 'platform', 'invalid', 'only_if_needed', 'atomic_group_id')/7
,
Dec 21 2017
... and, right on target, I see that this is pretty much the same as 796210. The fix is different, though, so let's call it not a dup.
,
Dec 21 2017
(to get the link) ... Same bug as 796210. I believe the fix is to push to prod.
,
Dec 21 2017
The reference of interest is bug 796210 . a.k.a. crbug.com/796210 a.k.a. https://bugs.chromium.org/p/chromium/issues/detail?id=796210
,
Dec 21 2017
I _think_ I know what the problem here is. prod is currently behind a couple (KI) DB migrations. It is behind because some CLs required to make those migrations safe were not in prod at the time of the last push. They are now in, and next push-to-prod will contain them. But, servers 118 and 120 were provisioned in this time, and all DB migrations run during provision. So, these two broke. 120 doesn't have any load yet, and 118 was given load in the interim, but has no more boards assigned to it. So there shouldn't be any impact on prod. But we can't use these shards till the next push. There is a test_push running right now. Given the number of DB changes xixuan@ has been doing, I will not push without a green test_push.
,
Dec 21 2017
,
Dec 21 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Dec 21 2017