lab master scheduler crashlooping | AttributeError: 'NoneType' object has no attribute 'id' |
|||||||||||||
Issue descriptionQuite a few red paladin bots, starting around midnight Apr 21/22. Examples: https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/5708 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/882 https://uberchromegw.corp.google.com/i/chromeos/builders/elm-paladin/builds/2530 Logs show this timeout: BackgroundFailure: <class 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds Which is hanging on the swarming.py subprocess when getting killed: Traceback (most recent call last): File "/b/cbuild/repository/chromite/lib/parallel.py", line 441, in _Run self._task(*self._task_args, **self._task_kwargs) File "/b/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 628, in Run self.PerformStage() File "/b/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 416, in PerformStage ('Build', build_id, 'BuildStage', self._build_stage_id)}) File "/b/cbuild/repository/chromite/lib/failures_lib.py", line 190, in wrapped_functor return functor(*args, **kwargs) File "/b/cbuild/repository/chromite/cbuildbot/commands.py", line 1019, in RunHWTestSuite _HWTestWait(cmd, job_id, **swarming_args) File "/b/cbuild/repository/chromite/cbuildbot/commands.py", line 1259, in _HWTestWait **kwargs) File "/b/cbuild/repository/chromite/cbuildbot/swarming_lib.py", line 153, in RunSwarmingCommandWithRetries return retry_util.RetryCommand(RunSwarmingCommand, max_retry, *args, **kwargs) File "/b/cbuild/repository/chromite/lib/retry_util.py", line 224, in RetryCommand return GenericRetry(ShouldRetry, max_retry, functor, *args, **kwargs) File "/b/cbuild/repository/chromite/lib/retry_util.py", line 122, in GenericRetry ret = functor(*args, **kwargs) File "/b/cbuild/repository/chromite/cbuildbot/swarming_lib.py", line 92, in RunSwarmingCommand result = cros_build_lib.RunCommand(swarming_cmd, *args, **kwargs) File "/b/cbuild/repository/chromite/lib/cros_build_lib.py", line 593, in RunCommand (cmd_result.output, cmd_result.error) = proc.communicate(input) File "/usr/lib/python2.7/subprocess.py", line 796, in communicate self.wait() File "/usr/lib/python2.7/subprocess.py", line 1376, in wait pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) File "/usr/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call return func(*args) File "/b/cbuild/repository/chromite/lib/cros_build_lib.py", line 340, in _KillChildProcess if not signals.RelaySignal(original_handler, signum, frame): File "/b/cbuild/repository/chromite/lib/signals.py", line 26, in RelaySignal handler(signum, frame) File "/b/cbuild/repository/chromite/lib/commandline.py", line 733, in _DefaultHandler signum, 'Received signal %i; shutting down' % (signum,)) _ShutDownException: (15, 'Received signal 15; shutting down') Marking this Pri-1 since it blocks the CQ.
,
Apr 24 2017
,
Apr 24 2017
+more This looks like it is breaking lab tests all over, the 59 release builders are all red failing HWTest sanity, this may be considered a Pri-0 if the lab is non functional.
,
Apr 24 2017
,
Apr 24 2017
All our dashboards say that the lab has stopped doing work. The deputy (lucky guy!) gets the first swipe at this.
,
Apr 24 2017
First-time deputy is here! I'm trying to figure out what to do now. But some ADVICE would be greatly appreciated.
,
Apr 24 2017
Master scheduler not ticking. https://viceroy.corp.google.com/chromeos/deputy-view
,
Apr 24 2017
master scheduler errors:
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
dispatcher.initialize(recover_hosts=options.recover_hosts)
File "/usr/local/autotest/scheduler/monitor_db.py", line 336, in initialize
self._recover_processes()
File "/usr/local/autotest/scheduler/monitor_db.py", line 491, in _recover_processes
agent_tasks = self._create_recovery_agent_tasks()
File "/usr/local/autotest/scheduler/monitor_db.py", line 506, in _create_recovery_agent_tasks
+ self._get_special_task_agent_tasks(is_active=True))
File "/usr/local/autotest/scheduler/monitor_db.py", line 560, in _get_special_task_agent_tasks
for task in special_tasks]
File "/usr/local/autotest/scheduler/monitor_db.py", line 637, in _get_agent_task_for_special_task
return agent_task_class(task=special_task)
File "/usr/local/autotest/scheduler/prejob_task.py", line 368, in __init__
self._set_ids(host=self.host, queue_entries=[self.queue_entry])
File "/usr/local/autotest/scheduler/agent_task.py", line 166, in _set_ids
self.host_ids = [entry.host.id for entry in queue_entries]
AttributeError: 'NoneType' object has no attribute 'id'
,
Apr 24 2017
,
Apr 24 2017
prejob_task.py:368 is part of ProvisionTask. Is that the only PreJobTask that is showing this problem?
class ProvisionTask(PreJobTask):
TASK_TYPE = models.SpecialTask.Task.PROVISION
def __init__(self, task):
# Provisioning requires that we be associated with a job/queue entry
assert task.queue_entry, "No HQE associated with provision task!"
# task.queue_entry is an afe model HostQueueEntry object.
# self.queue_entry is a scheduler models HostQueueEntry object, but
# it gets constructed and assigned in __init__, so it's not available
# yet. Therefore, we're stuck pulling labels off of the afe model
# so that we can pass the --provision args into the __init__ call.
labels = {x.name for x in task.queue_entry.job.labels}
_, provisionable = provision.Provision.partition(labels)
extra_command_args = ['--provision',
'--job-labels', ','.join(provisionable)]
super(ProvisionTask, self).__init__(task, extra_command_args)
self._set_milestone(labels)
self._set_ids(host=self.host, queue_entries=[self.queue_entry])
,
Apr 24 2017
(note, the problem presumably is that we're passing a hostless queue_entry into _set_ids there)
,
Apr 24 2017
-> xixiun@ is driving this right now
,
Apr 24 2017
+dshi for insight Update: the logs don't tell us which job was being processed when it crashed, which makes this a pain to trace. xixuan has locally modified scheduler to log this.
,
Apr 24 2017
the problematic job id is 113768638. http://cautotest/afe/#tab_id=view_job&object_id=113768638
,
Apr 24 2017
I see this coming from both ProvisionTask and GatherLogsTask.
,
Apr 24 2017
Strange that that is the problematic job, since that is a shard job. Why does master scheduler know/care about shard job...
,
Apr 24 2017
Also, that job thinks it belongs on chromeos-server101, whereas board:pyro actually belongs to chromeos-server110
,
Apr 24 2017
This outage coincides with when pyro, reef, and ultima were shuffled around between shards. See crbug.com/713319
,
Apr 24 2017
select * from afe_jobs where id=113768638\G; shard_id: NULL someone manually changed shard_id already? It needs to be set to chromeos-server110
,
Apr 24 2017
I updated that afe_job with shard_id set to 155
,
Apr 24 2017
I update it with NULL. I assume it will be picked up by shard 155 later automatically?
,
Apr 24 2017
The issue is more complex than just the shard_id, see updates on the bug 713319
,
Apr 24 2017
Scheduler is back. Seems like the surgery on 713319 did the trick?
,
Apr 24 2017
,
Apr 24 2017
entry.job_id 113729414L also crashed scheduler
,
Apr 24 2017
04/24 11:33:14.933 DEBUG| scheduler_models:0210| host_entry_id: 113729414L
04/24 11:33:14.934 DEBUG| scheduler_models:0211| host_entry_row: (113729414L, u'chromeos-test', u'ultima-release/R60-9481.0.0/wifi_endtoend/network_WiFi_RoamSuspendEndToEnd', 20L, u'# Copyright 2015 The Chromium OS Authors. All rights reserved.\n# Use of this source code is governed by a BSD-style license that can be\n# found in the LICENSE file.\n\nAUTHOR = \'bmahadev, krisr, harpreet\'\nTIME = \'SHORT\'\nNAME = \'network_WiFi_RoamSuspendEndToEnd\'\nTEST_TYPE = \'Server\'\nATTRIBUTES = \'suite:wifi_endtoend, suite:wifi_matfunc_bcm4371\'\nDEPENDENCIES = \'servo, wificell\'\n\nDOC = """\nWiFi_RoamSuspendEndToEnd test configures two APs with the same ssid and runs the\nclient side network_RoamWifiEndToEnd test which uses these APs to test\nWiFi roaming between two Open Networks after a Suspend/Resume.\n"""\n\nfrom autotest_lib.server import utils\n\nargs_dict = utils.args_to_dict(args)\nservo_args = hosts.CrosHost.get_servo_arguments(args_dict)\n\ndef run(machine):\n host = hosts.create_host(machine, servo_args=servo_args)\n job.run_test(\'network_WiFi_RoamSuspendEndToEnd\',\n host=host,\n raw_cmdline_args=args)\n\nparallel_simple(run, machines)\n', 1L, datetime.datetime(2017, 4, 21, 14, 54, 34), 1L, 24L, 0, u'', 1, 0, 1, 72L, None, None, 60L, 113669280L, 0L, 1, 1845L, 148L, None)
04/24 11:33:14.934 DEBUG| scheduler_models:0283| after updating: set(['control_type', 'max_runtime_hrs', 'shard_id', 'created_on', 'timeout_mins', 'owner', 'max_runtime_mins', 'reboot_after', 'id', 'control_file', 'test_retry', 'run_verify', 'priority', 'synch_count', 'require_ssp', 'run_reset', 'reboot_before', 'email_list', 'parse_failed_repair', 'drone_set_id', 'name', 'parent_job_id', 'timeout', 'parameterized_job_id'])
04/24 11:33:14.934 DEBUG| scheduler_models:0490| Cannot get host id.
04/24 11:33:14.935 DEBUG| agent_task:0168| entry: <autotest_lib.scheduler.scheduler_models.HostQueueEntry object at 0x7fe57f676fd0>
04/24 11:33:14.935 DEBUG| agent_task:0169| entry.job_id: 113729414L
04/24 11:33:14.938 ERROR| email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 364, in tick
self._schedule_special_tasks()
File "/usr/local/autotest/scheduler/monitor_db.py", line 282, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 731, in _schedule_special_tasks
self.add_agent_task(self._get_agent_task_for_special_task(task))
File "/usr/local/autotest/scheduler/monitor_db.py", line 637, in _get_agent_task_for_special_task
return agent_task_class(task=special_task)
File "/usr/local/autotest/scheduler/prejob_task.py", line 266, in __init__
self._set_ids(host=self.host, queue_entries=[self.queue_entry])
File "/usr/local/autotest/scheduler/agent_task.py", line 170, in _set_ids
self.host_ids.append(entry.host.id)
,
Apr 24 2017
Issue 714707 has been merged into this issue.
,
Apr 24 2017
,
Apr 24 2017
Issue 714400 has been merged into this issue.
,
Apr 25 2017
scheduler is up and running again.
,
Apr 25 2017
Issue 712925 has been merged into this issue.
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by mnissler@chromium.org
, Apr 24 2017