New issue
Advanced search Search tips

Issue 797124 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 796210
Owner:
Closed: Dec 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 796614



Sign in to add a comment

host scheduler in a crash loop on chromeos-server118

Reported by jrbarnette@chromium.org, Dec 21 2017

Issue description

host-scheduler on chromeos-server118 is dying/respawning
repeatedly.  Here's the exception:

Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 520, in <module>
    main()
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 499, in main
    host_scheduler.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 392, in tick
    self._schedule_jobs()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 333, in _schedule_jobs
    for acquisition in self.find_hosts_for_jobs(unverified_host_jobs):
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 282, in find_hosts_for_jobs
    for host, job in zip(hosts, host_jobs):
  File "/usr/local/autotest/scheduler/rdb_lib.py", line 79, in acquire_hosts
    job_query_manager = JobQueryManager(queue_entries, suite_min_duts)
  File "/usr/local/autotest/scheduler/rdb_lib.py", line 34, in __init__
    self._labels = self.query_manager._get_labels(self._job_deps)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/query_managers.py", line 393, in _get_labels
    where="id IN (%s)" % job_string_label_list)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 329, in fetch
    return [cls(id=row[0], row=row) for row in rows]
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 169, in __init__
    self._update_fields_from_row(row)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 227, in _update_fields_from_row
    self._assert_row_length(row)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 192, in _assert_row_length
    self.__table, row, len(row), self._fields, len(self._fields)))
AssertionError: table = afe_labels, row = (210L, u'pool:bvt', u'', 0, 0, 0, None, 0)/8, fields = ('id', 'name', 'kernel_config', 'platform', 'invalid', 'only_if_needed', 'atomic_group_id')/7

 
Blocking: 796614 794630
... and, right on target, I see that this is pretty much the
same as 796210.  The fix is different, though, so let's call it
not a dup.

Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
(to get the link) ... Same bug as 796210.

I believe the fix is to push to prod.

I _think_ I know what the problem here is. 
prod is currently behind a couple (KI) DB migrations. It is behind because some CLs required to make those migrations safe were not in prod at the time of the last push. They are now in, and next push-to-prod will contain them.

But, servers 118 and 120 were provisioned in this time, and all DB migrations run during provision. So, these two broke.

120 doesn't have any load yet, and 118 was given load in the interim, but has no more boards assigned to it. So there shouldn't be any impact on prod.
But we can't use these shards till the next push.

There is a test_push running right now. Given the number of DB changes xixuan@ has been doing, I will not push without a green test_push.
Mergedinto: 796210
Status: Duplicate (was: Assigned)
Blocking: -794630

Sign in to add a comment