master scheduler crashing with error message "... already has a host agent ..."
Reported by
jrbarnette@chromium.org,
Dec 20 2017
|
||
Issue description
Since the most recent push to prod on the morning of 12/19,
the master scheduler has crashed 11 times with an exception
like this:
12/20 00:30:10.156 ERROR| monitor_db:0204| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 193, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 387, in tick
self._schedule_running_host_queue_entries()
File "/usr/local/autotest/scheduler/monitor_db.py", line 307, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 1006, in _schedule_running_host_queue_entries
for agent_task in self._get_queue_entry_agent_tasks(to_schedule=True):
File "/usr/local/autotest/scheduler/monitor_db.py", line 588, in _get_queue_entry_agent_tasks
agent_task = self._get_agent_task_for_queue_entry(entry)
File "/usr/local/autotest/scheduler/monitor_db.py", line 632, in _get_agent_task_for_queue_entry
self._check_for_duplicate_host_entries(task_entries)
File "/usr/local/autotest/scheduler/monitor_db.py", line 655, in _check_for_duplicate_host_entries
self._assert_host_has_no_agent(task_entry)
File "/usr/local/autotest/scheduler/monitor_db.py", line 666, in _assert_host_has_no_agent
% (entry, entry.host, agent.task))
SchedulerError: While scheduling HQE: 164289594, for job: 163902690 and host: chromeos2-row7-rack7-host11 has status:Starting [active], host <autotest_lib.scheduler.rdb_hosts.RDBClientHostWrapper object at 0x7f5530035ad0> already has a host agent <autotest_lib.scheduler.prejob_task.ProvisionTask object at 0x7f553049cc10>
It's not yet clear with this is causing any adverse impacts, but
as a practical matter, there's no way that this is _good_.
,
Dec 20 2017
The blamelist for the change looks like this:
$ git log --oneline 178a6928a..9e32f5c37
9e32f5c37 cheets_CTS_N: Remove CtsCallLogTestCases from notest-packages.
80f0ebe44 container: coalesce some sudo commands.
6ad5aba53 tradefed_test: tune login timeouts.
a69b15cdd autotest: add replaced_by_static_label column to afe_labels
66db77036 [autotest] Improve run locally steps for AU OOBE test.
10d0c4192 autotest: add two new tables and one new column for afe_labels migration
e83f6eac0 GraphicsTest: Report subtest failures to chromeperf
51b0c6015 autotest: remove support for 'only_if_needed_labels'.
181afe798 [autotest] Make job_id mandatory
8929a3b15 contrib/loadtest: Add options to test DUTs a specific number of times.
f0897b9f5 contrib/summarize_loadtest.py: Default to stdin input.
0f4014f1c contrib/loadtest: Lock to prevent concurrent loadtest runs.
946e167f6 contrib/loadtest: Improve command line arguments
507862213 contrib/loadtest: Add pre and post ping tests
7a2a7b919 contrib/loadtest: Log trigger_response for easier debugging
ead2fc8a7 contrib/loadtest: Add a ping test prior to starting test
a1c9c6a54 contrib/loadtest: Add way to filter duts being provisioned.
9c3827809 contrib/loadtest: Script to summarize/search loadtest results.
9751d78c1 contrib/loadtest: Test harness to generate devserver load.
03554c281 firmware_Cr50ConsoleCommands: detach rma features from ccd_v1
7588644dd Remove some tests from suite:kernel_per-build_regression
c3d6e620a cr50_stress_flaky: remove firmware_Cr50Update.erase_nvmem
b93d2ec06 Remove logic for parsing webrtc_media_stats log.
e90d1f741 Test various codec in MediaRecorder
cd15b374b Remove hack from CL:756291
7cb4f4df2 Change MEDIA_CAVDA_INIT_STATUS from "Media.ChromeArc.." to "Media.GpuArc.."
2478d8582 chrome_cr50: use on/off instead of enable/disable
1c941f54a tradefed_test: tune the wait for filelock.
e48b9a39f kernel_CheckArmErrata: make CheckArmErrata NA for non-ARM platforms
76a4c8880 [autotest] Allow autoupdater.py to handle non interactive updates.
593b3a993 cli: Improve error message on missing SSO credentials.
92a542abb stop collecting files from /dev/pstore
6260024ee chrome_cr50: change ccd_set_testlab to set_ccd_testlab
b9757938e chrome_cr50: compare entire level string
9498d6791 chrome_cr50: rename ccd_set_level to set_ccd_level
26fb101e4 Use cros_config as source for test model label
df7c93285 tradefed_test: reboot on failed login.
de5572798 Revert "GraphicsTest: Report subtest failures to chromeperf"
335f92947 container_pool: Only log retrieveds container name
7f08ac933 autotest: Fix quick provision stage error format string
c510e01ce [autotest] Optimize LXC continer IDs.
9b18c715f [autotest] Enable container pool metrics.
710731dfd container_pool: Remove client reliance on cxn
A scan through this makes me think there's nothing in that list
that could possibly explain the symptom.
Moreover, the problem has occurred twice in the more distant but
still recent past:
$ grep -l SchedulerError scheduler.log*
scheduler.log.2017-12-07-11.02.57
scheduler.log.2017-12-08-02.56.22
scheduler.log.2017-12-19-15.47.55
scheduler.log.2017-12-20-00.31.20
scheduler.log.2017-12-20-00.34.45
scheduler.log.2017-12-20-00.44.12
scheduler.log.2017-12-20-02.47.36
scheduler.log.2017-12-20-08.04.00
scheduler.log.2017-12-20-08.24.23
scheduler.log.2017-12-20-08.26.46
scheduler.log.2017-12-20-09.56.09
scheduler.log.2017-12-20-10.00.31
scheduler.log.2017-12-20-13.18.54
Incidents on 12/7 and 12/8, then the current spate of incidents
after push to prod.
,
May 17 2018
I'm sure this was fixed somehow since the scheduler has been up for longer than ten seconds in the past five months |
||
►
Sign in to add a comment |
||
Comment 1 by akes...@chromium.org
, Dec 20 2017