New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 10 users

Issue metadata

Status: Archived
Closed: Apr 2017
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Sign in to add a comment

lab master scheduler crashlooping | AttributeError: 'NoneType' object has no attribute 'id'

Project Member Reported by, Apr 24 2017 Back to list

Issue description

Quite a few red paladin bots, starting around midnight Apr 21/22. Examples:

Logs show this timeout:

BackgroundFailure: <class 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds

Which is hanging on the subprocess when getting killed:

Traceback (most recent call last):
  File "/b/cbuild/repository/chromite/lib/", line 441, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/stages/", line 628, in Run
  File "/b/cbuild/repository/chromite/cbuildbot/stages/", line 416, in PerformStage
    ('Build', build_id, 'BuildStage', self._build_stage_id)})
  File "/b/cbuild/repository/chromite/lib/", line 190, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/", line 1019, in RunHWTestSuite
    _HWTestWait(cmd, job_id, **swarming_args)
  File "/b/cbuild/repository/chromite/cbuildbot/", line 1259, in _HWTestWait
  File "/b/cbuild/repository/chromite/cbuildbot/", line 153, in RunSwarmingCommandWithRetries
    return retry_util.RetryCommand(RunSwarmingCommand, max_retry, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/", line 224, in RetryCommand
    return GenericRetry(ShouldRetry, max_retry, functor, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/", line 122, in GenericRetry
    ret = functor(*args, **kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/", line 92, in RunSwarmingCommand
    result = cros_build_lib.RunCommand(swarming_cmd, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/", line 593, in RunCommand
    (cmd_result.output, cmd_result.error) = proc.communicate(input)
  File "/usr/lib/python2.7/", line 796, in communicate
  File "/usr/lib/python2.7/", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid,, 0)
  File "/usr/lib/python2.7/", line 476, in _eintr_retry_call
    return func(*args)
  File "/b/cbuild/repository/chromite/lib/", line 340, in _KillChildProcess
    if not signals.RelaySignal(original_handler, signum, frame):
  File "/b/cbuild/repository/chromite/lib/", line 26, in RelaySignal
    handler(signum, frame)
  File "/b/cbuild/repository/chromite/lib/", line 733, in _DefaultHandler
    signum, 'Received signal %i; shutting down' % (signum,))
_ShutDownException: (15, 'Received signal 15; shutting down')

Marking this Pri-1 since it blocks the CQ.
It looks the corresponding autotest jobs get scheduled, but remain queued and never get run.

For example:

Summary: HWTest times out due to autotest jobs stuck in queued state (was: HWTest times out)

This looks like it is breaking lab tests all over, the 59 release builders are all red failing HWTest sanity, this may be considered a Pri-0 if the lab is non functional.
Labels: -Pri-1 Pri-0
Status: Assigned
All our dashboards say that the lab has stopped doing work.

The deputy (lucky guy!) gets the first swipe at this.

First-time deputy is here! I'm trying to figure out what to do now. But some ADVICE would be greatly appreciated.
Master scheduler not ticking.

Comment 8 by, Apr 24 2017

master scheduler errors:

EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/", line 179, in main_without_exception_handling
  File "/usr/local/autotest/scheduler/", line 336, in initialize
  File "/usr/local/autotest/scheduler/", line 491, in _recover_processes
    agent_tasks = self._create_recovery_agent_tasks()
  File "/usr/local/autotest/scheduler/", line 506, in _create_recovery_agent_tasks
    + self._get_special_task_agent_tasks(is_active=True))
  File "/usr/local/autotest/scheduler/", line 560, in _get_special_task_agent_tasks
    for task in special_tasks]
  File "/usr/local/autotest/scheduler/", line 637, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/", line 368, in __init__
    self._set_ids(, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/", line 166, in _set_ids
    self.host_ids = [ for entry in queue_entries]
AttributeError: 'NoneType' object has no attribute 'id'
Summary: lab master scheduler crashlooping | AttributeError: 'NoneType' object has no attribute 'id' (was: HWTest times out due to autotest jobs stuck in queued state) is part of ProvisionTask. Is that the only PreJobTask that is showing this problem?

class ProvisionTask(PreJobTask):                                                 
    TASK_TYPE = models.SpecialTask.Task.PROVISION
    def __init__(self, task):                                                    
        # Provisioning requires that we be associated with a job/queue entry     
        assert task.queue_entry, "No HQE associated with provision task!"        
        # task.queue_entry is an afe model HostQueueEntry object.                
        # self.queue_entry is a scheduler models HostQueueEntry object, but      
        # it gets constructed and assigned in __init__, so it's not available    
        # yet.  Therefore, we're stuck pulling labels off of the afe model       
        # so that we can pass the --provision args into the __init__ call.       
        labels = { for x in task.queue_entry.job.labels}                   
        _, provisionable = provision.Provision.partition(labels)
        extra_command_args = ['--provision',
                              '--job-labels', ','.join(provisionable)]
        super(ProvisionTask, self).__init__(task, extra_command_args)            
        self._set_ids(, queue_entries=[self.queue_entry])  
(note, the problem presumably is that we're passing a hostless queue_entry into _set_ids there)
-> xixiun@ is driving this right now
+dshi for insight

Update: the logs don't tell us which job was being processed when it crashed, which makes this a pain to trace. xixuan has locally modified scheduler to log this.
the problematic job id is 113768638.


I see this coming from both ProvisionTask and GatherLogsTask.
Strange that that is the problematic job, since that is a shard job. Why does master scheduler know/care about shard job...
Also, that job thinks it belongs on chromeos-server101, whereas board:pyro actually belongs to chromeos-server110
This outage coincides with when pyro, reef, and ultima were shuffled around between shards. See

Comment 19 by, Apr 24 2017

select * from afe_jobs where id=113768638\G;

shard_id: NULL

someone manually changed shard_id already? It needs to be set to chromeos-server110

Comment 20 by, Apr 24 2017

I updated that afe_job with shard_id set to 155
I update it with NULL. I assume it will be picked up by shard 155 later automatically? 
The issue is more complex than just the shard_id, see updates on the bug 713319
Scheduler is back. Seems like the surgery on 713319 did the trick?
 Issue 714716  has been merged into this issue.
entry.job_id 113729414L also crashed scheduler
04/24 11:33:14.933 DEBUG|  scheduler_models:0210| host_entry_id: 113729414L
04/24 11:33:14.934 DEBUG|  scheduler_models:0211| host_entry_row: (113729414L, u'chromeos-test', u'ultima-release/R60-9481.0.0/wifi_endtoend/network_WiFi_RoamSuspendEndToEnd', 20L, u'# Copyright 2015 The Chromium OS Authors. All rights reserved.\n# Use of this source code is governed by a BSD-style license that can be\n# found in the LICENSE file.\n\nAUTHOR = \'bmahadev, krisr, harpreet\'\nTIME = \'SHORT\'\nNAME = \'network_WiFi_RoamSuspendEndToEnd\'\nTEST_TYPE = \'Server\'\nATTRIBUTES = \'suite:wifi_endtoend, suite:wifi_matfunc_bcm4371\'\nDEPENDENCIES = \'servo, wificell\'\n\nDOC = """\nWiFi_RoamSuspendEndToEnd test configures two APs with the same ssid and runs the\nclient side network_RoamWifiEndToEnd test which uses these APs to test\nWiFi roaming between two Open Networks after a Suspend/Resume.\n"""\n\nfrom autotest_lib.server import utils\n\nargs_dict = utils.args_to_dict(args)\nservo_args = hosts.CrosHost.get_servo_arguments(args_dict)\n\ndef run(machine):\n    host = hosts.create_host(machine, servo_args=servo_args)\n    job.run_test(\'network_WiFi_RoamSuspendEndToEnd\',\n                 host=host,\n                 raw_cmdline_args=args)\n\nparallel_simple(run, machines)\n', 1L, datetime.datetime(2017, 4, 21, 14, 54, 34), 1L, 24L, 0, u'', 1, 0, 1, 72L, None, None, 60L, 113669280L, 0L, 1, 1845L, 148L, None)
04/24 11:33:14.934 DEBUG|  scheduler_models:0283| after updating: set(['control_type', 'max_runtime_hrs', 'shard_id', 'created_on', 'timeout_mins', 'owner', 'max_runtime_mins', 'reboot_after', 'id', 'control_file', 'test_retry', 'run_verify', 'priority', 'synch_count', 'require_ssp', 'run_reset', 'reboot_before', 'email_list', 'parse_failed_repair', 'drone_set_id', 'name', 'parent_job_id', 'timeout', 'parameterized_job_id'])
04/24 11:33:14.934 DEBUG|  scheduler_models:0490| Cannot get host id.
04/24 11:33:14.935 DEBUG|        agent_task:0168| entry: <autotest_lib.scheduler.scheduler_models.HostQueueEntry object at 0x7fe57f676fd0>
04/24 11:33:14.935 DEBUG|        agent_task:0169| entry.job_id: 113729414L
04/24 11:33:14.938 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/", line 185, in main_without_exception_handling
  File "/usr/local/autotest/site-packages/chromite/lib/", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/", line 364, in tick
  File "/usr/local/autotest/scheduler/", line 282, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/", line 731, in _schedule_special_tasks
  File "/usr/local/autotest/scheduler/", line 637, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/", line 266, in __init__
    self._set_ids(, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/", line 170, in _set_ids

 Issue 714707  has been merged into this issue.
 Issue 714690  has been merged into this issue.
 Issue 714400  has been merged into this issue.
Status: Fixed
scheduler is up and running again.

Comment 31 by, Apr 25 2017

Issue 712925 has been merged into this issue.

Comment 32 by, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61
Status: Archived

Sign in to add a comment