New issue
Advanced search Search tips

Issue 796718 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner: ----
Closed: May 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

master scheduler crashing with error message "... already has a host agent ..."

Reported by jrbarnette@chromium.org, Dec 20 2017

Issue description

Since the most recent push to prod on the morning of 12/19,
the master scheduler has crashed 11 times with an exception
like this:
12/20 00:30:10.156 ERROR|        monitor_db:0204| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 193, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 387, in tick
    self._schedule_running_host_queue_entries()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 307, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1006, in _schedule_running_host_queue_entries
    for agent_task in self._get_queue_entry_agent_tasks(to_schedule=True):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 588, in _get_queue_entry_agent_tasks
    agent_task = self._get_agent_task_for_queue_entry(entry)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 632, in _get_agent_task_for_queue_entry
    self._check_for_duplicate_host_entries(task_entries)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 655, in _check_for_duplicate_host_entries
    self._assert_host_has_no_agent(task_entry)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 666, in _assert_host_has_no_agent
    % (entry, entry.host, agent.task))
SchedulerError: While scheduling HQE: 164289594, for job: 163902690 and host: chromeos2-row7-rack7-host11 has status:Starting [active], host <autotest_lib.scheduler.rdb_hosts.RDBClientHostWrapper object at 0x7f5530035ad0> already has a host agent <autotest_lib.scheduler.prejob_task.ProvisionTask object at 0x7f553049cc10>

It's not yet clear with this is causing any adverse impacts, but
as a practical matter, there's no way that this is _good_.
 
 Issue 257809  has been merged into this issue.
The blamelist for the change looks like this:
    $ git log --oneline 178a6928a..9e32f5c37
    9e32f5c37 cheets_CTS_N: Remove CtsCallLogTestCases from notest-packages.
    80f0ebe44 container: coalesce some sudo commands.
    6ad5aba53 tradefed_test: tune login timeouts.
    a69b15cdd autotest: add replaced_by_static_label column to afe_labels
    66db77036 [autotest] Improve run locally steps for AU OOBE test.
    10d0c4192 autotest: add two new tables and one new column for afe_labels migration
    e83f6eac0 GraphicsTest: Report subtest failures to chromeperf
    51b0c6015 autotest: remove support for 'only_if_needed_labels'.
    181afe798 [autotest] Make job_id mandatory
    8929a3b15 contrib/loadtest: Add options to test DUTs a specific number of times.
    f0897b9f5 contrib/summarize_loadtest.py: Default to stdin input.
    0f4014f1c contrib/loadtest: Lock to prevent concurrent loadtest runs.
    946e167f6 contrib/loadtest: Improve command line arguments
    507862213 contrib/loadtest: Add pre and post ping tests
    7a2a7b919 contrib/loadtest: Log trigger_response for easier debugging
    ead2fc8a7 contrib/loadtest: Add a ping test prior to starting test
    a1c9c6a54 contrib/loadtest: Add way to filter duts being provisioned.
    9c3827809 contrib/loadtest: Script to summarize/search loadtest results.
    9751d78c1 contrib/loadtest: Test harness to generate devserver load.
    03554c281 firmware_Cr50ConsoleCommands: detach rma features from ccd_v1
    7588644dd Remove some tests from suite:kernel_per-build_regression
    c3d6e620a cr50_stress_flaky: remove firmware_Cr50Update.erase_nvmem
    b93d2ec06 Remove logic for parsing webrtc_media_stats log.
    e90d1f741 Test various codec in MediaRecorder
    cd15b374b Remove hack from CL:756291
    7cb4f4df2 Change MEDIA_CAVDA_INIT_STATUS from "Media.ChromeArc.." to "Media.GpuArc.."
    2478d8582 chrome_cr50: use on/off instead of enable/disable
    1c941f54a tradefed_test: tune the wait for filelock.
    e48b9a39f kernel_CheckArmErrata: make CheckArmErrata NA for non-ARM platforms
    76a4c8880 [autotest] Allow autoupdater.py to handle non interactive updates.
    593b3a993 cli: Improve error message on missing SSO credentials.
    92a542abb stop collecting files from /dev/pstore
    6260024ee chrome_cr50: change ccd_set_testlab to set_ccd_testlab
    b9757938e chrome_cr50: compare entire level string
    9498d6791 chrome_cr50: rename ccd_set_level to set_ccd_level
    26fb101e4 Use cros_config as source for test model label
    df7c93285 tradefed_test: reboot on failed login.
    de5572798 Revert "GraphicsTest: Report subtest failures to chromeperf"
    335f92947 container_pool: Only log retrieveds container name
    7f08ac933 autotest: Fix quick provision stage error format string
    c510e01ce [autotest] Optimize LXC continer IDs.
    9b18c715f [autotest] Enable container pool metrics.
    710731dfd container_pool: Remove client reliance on cxn

A scan through this makes me think there's nothing in that list
that could possibly explain the symptom.

Moreover, the problem has occurred twice in the more distant but
still recent past:
    $ grep -l SchedulerError scheduler.log*
    scheduler.log.2017-12-07-11.02.57
    scheduler.log.2017-12-08-02.56.22
    scheduler.log.2017-12-19-15.47.55
    scheduler.log.2017-12-20-00.31.20
    scheduler.log.2017-12-20-00.34.45
    scheduler.log.2017-12-20-00.44.12
    scheduler.log.2017-12-20-02.47.36
    scheduler.log.2017-12-20-08.04.00
    scheduler.log.2017-12-20-08.24.23
    scheduler.log.2017-12-20-08.26.46
    scheduler.log.2017-12-20-09.56.09
    scheduler.log.2017-12-20-10.00.31
    scheduler.log.2017-12-20-13.18.54

Incidents on 12/7 and 12/8, then the current spate of incidents
after push to prod.

Status: Fixed (was: Untriaged)
I'm sure this was fixed somehow since the scheduler has been up for longer than ten seconds in the past five months

Sign in to add a comment