New issue
Advanced search Search tips

Issue 796210 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 797135
issue 797142
issue 797157

Blocking:
issue 794630



Sign in to add a comment

guado_moblab, chromeos-server118,120: host_scheduler crashing due to DB migration mismatch

Reported by jrbarnette@chromium.org, Dec 19 2017

Issue description

This guado_moblab-paladin failed:
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8170

The test suite kicked off one test that never finished:
    http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163694583

The abort happened because of a timeout; I don't know what took
so long.  Roughly, here's a high level timeline:

05:49:56 - Suite job created
05:51:10 - Test job created
06:05:56 - Test job begins execution
07:22:40 - Suite job aborts the test
07:25:46 - Test job completes execution

So, two problems:
 1) There's ~15 minutes between "test job created" and "test job begins".
 2) The test job ran for 1h20m.

 
Labels: -Pri-3 Pri-2
Escalating:  This happened in an earlier paladin run:
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8168

This test suite:
    http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163608339

Similar looking timeline:
00:00:42 - Suite job created
00:01:28 - Test job created
00:16:48 - Test job begins execution
01:33:59 - Suite job aborts the test
01:37:01 - Test job completes execution

Labels: -Pri-2 Pri-1
Third failure:
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/8172

Same basic timeline.


Owner: haddowk@chromium.org
Status: Assigned (was: Untriaged)
The scheduler is crashing

from host_scheduler.log.2017-12-19-13.35.17

12/19 13:35:16.071 INFO |    host_scheduler:0432| Setting signal handler
12/19 13:35:16.072 WARNI| metadata_reporter:0142| Elasticsearch db deprecated, no metadata will be reported.
12/19 13:35:16.073 INFO | metadata_reporter:0150| Metadata reporting thread is started.
12/19 13:35:16.081 INFO |    connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
12/19 13:35:16.341 INFO |            config:0024| Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
12/19 13:35:16.342 ERROR|            config:0244| ts_mon monitoring is disabled because the endpoint provided is invalid or not supported: 
12/19 13:35:16.343 NOTIC|      cros_logging:0038| ts_mon was set up.
12/19 13:35:16.446 INFO |    host_scheduler:0388| Calling new tick.
12/19 13:35:16.446 INFO |    host_scheduler:0389| Leasing hosts for frontend tasks.
12/19 13:35:16.449 INFO |    host_scheduler:0391| Finding hosts for new jobs.
12/19 13:35:16.451 WARNI|  scheduler_models:0138| /usr/local/autotest/scheduler/scheduler_models.py:138: DeprecationWarning: object() takes no parameters
  return super(DBObject, cls).__new__(cls, id=id, **kwargs)

12/19 13:35:16.453 DEBUG|    host_scheduler:0186| Minimum duts to get for suites (suite_id: min_duts): {1L: 0}
12/19 13:35:16.456 INFO | metadata_reporter:0164| Waiting up to 5 seconds for metadata reporting thread to complete.

Cc: pprabhu@chromium.org
Stack trace:

13:58:11 INFO | Waiting up to 5 seconds for metadata reporting thread to complete.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 520, in <module>
    main()
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 499, in main
    host_scheduler.tick()
  File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 392, in tick
    self._schedule_jobs()
  File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 333, in _schedule_jobs
    for acquisition in self.find_hosts_for_jobs(unverified_host_jobs):
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 282, in find_hosts_for_jobs
    for host, job in zip(hosts, host_jobs):
  File "/usr/local/autotest/scheduler/rdb_lib.py", line 79, in acquire_hosts
    job_query_manager = JobQueryManager(queue_entries, suite_min_duts)
  File "/usr/local/autotest/scheduler/rdb_lib.py", line 34, in __init__
    self._labels = self.query_manager._get_labels(self._job_deps)
  File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/query_managers.py", line 382, in _get_labels
    where="id IN (SELECT label_id FROM afe_hosts_labels)")
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 329, in fetch
    return [cls(id=row[0], row=row) for row in rows]
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 169, in __init__
    self._update_fields_from_row(row)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 227, in _update_fields_from_row
    self._assert_row_length(row)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 192, in _assert_row_length
    self.__table, row, len(row), self._fields, len(self._fields)))
AssertionError: table = afe_labels, row = (1L, u'sparse_coverage_5', u'', 0, 0, 0, None, 0)/8, fields = ('id', 'name', 'kernel_config', 'platform', 'invalid', 'only_if_needed', 'atomic_group_id')/7


host_scheduler isn't dropping its own stack trace in its logs which is entirely stupid.
Owner: xixuan@chromium.org
This is still a problem with the db migrations that were committed last night.
How did they get past guado_moblab paladin? Is the paladin marked experimental still?

The problem is in DBModels.fetch_rows
It should be fetching only the columns required by |self._fields|.
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/823543 tried to fix this, but didn't get all the instances of the problem.

But, a local AFE test with host_scheduler enabled should have caught this.
This is the offending line: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/scheduler_models.py?l=312

----

Short term considerations:
- This will not affect prod until we do a DB migration.
- We haven't had a push for a few days now, and people are waiting for one.
- We got a successful test_push this morning.
- Even if I revert the DB migration CLs, I'd have to wait for another successful test_push to be sure they're not getting pushed to prod.

I am planning on simply push these CLs right now as they don't affect prod, and let the revert / fix play it out. This is OK as long as we don't do a DB migration in the lab before a green moblab run + successful test_push run.
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b8280cef9d858373ab2e9851dc2e2d08183c05f7

commit b8280cef9d858373ab2e9851dc2e2d08183c05f7
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Dec 19 22:28:41 2017

Revert "autotest: add replaced_by_static_label column to afe_labels"

This reverts commit a69b15cdde11dbbdd96be9fd95eb2b3c7c45d76c.

This breaks moblab

BUG= chromium:796210 
TEST=None

Change-Id: Ieb01bce545bb0b4bb14505e6f766c592f7fc7a90
Reviewed-on: https://chromium-review.googlesource.com/834649
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[delete] https://crrev.com/9e32f5c3746c9ee0e575015f05fe7e54fa77d842/frontend/migrations/124_add_replaced_column_to_afe_labels.py

Hmm... Rever in #9 doesn't affect the current push, but will un-break moblab.
xixuan@ you can re-land with the fix to scheduler_models.
Sorry :(
Summary: guado_moblab, chromeos-server118,120: host_scheduler crashing due to DB migration mismatch (was: Unexplained ABORT from guado_moblab-paladin)
 Issue 797124  has been merged into this issue.
Blocking: 794630
Blockedon: 797142 797135
push-to-prod is blocked on multiple problems.
Blockedon: 797157
Cc: -pprabhu@chromium.org xixuan@chromium.org
Owner: pprabhu@chromium.org
Last test push actually passed, sans 797157

[chromeos-staging-master2.hot.corp.google.com] out: All tests completed successfully, the prod branch of the following repos is ready to be pushed to the hash list below.
[chromeos-staging-master2.hot.corp.google.com] out: autotest: f6bc7c9cbd9275dfa6690ac416bcf23b270bbeff
[chromeos-staging-master2.hot.corp.google.com] out: chromite: 819449f76d5181aad593b2bac973682676675b4f
[chromeos-staging-master2.hot.corp.google.com] out: 

I also need to manually make sure that none of the CLs that xixuan@ has been working on that need the DB migration are in the new prod-next branch, because we obviously haven't migrated the DB yet.
DB migrations have been pushed to prod. Now wait for 1/2 hour for sanity, then I'll run DB migration in prod.
I got stuck trying to apply db migration 124
(add column to afe_labels table) That tries to lock the whole afe_labels table and just hangs there and blocks everything else -- IOW, brings the DB to a grinding halt.

The main DB now has up to migration 123 applied, but 124 is not applied.
Shards have not been migrated.

I give up for now. I have some ideas, but I'll let xixuan / shuqianz drive this further once everyone is back.

My theatrics caused a blip in the system: http://shortn/_klbeH7TtbJ

For ~15 minutes DB was locked up leading to failed heartbeats and jobs would not have made progress. This _shouldn't_ cause any havoc in the CQ, but I can't be sure.
Owner: xixuan@chromium.org
Status: Fixed (was: Assigned)
This is long fixed. We decided to not do that migration at all and pulled it back.

Sign in to add a comment