Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 10 users
Status: Fixed
Owner:
Closed: Apr 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment
lab master scheduler crashlooping | AttributeError: 'NoneType' object has no attribute 'id'
Project Member Reported by mnissler@chromium.org, Apr 24 Back to list
Quite a few red paladin bots, starting around midnight Apr 21/22. Examples:

https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/5708
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/882
https://uberchromegw.corp.google.com/i/chromeos/builders/elm-paladin/builds/2530

Logs show this timeout:

BackgroundFailure: <class 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds

Which is hanging on the swarming.py subprocess when getting killed:

Traceback (most recent call last):
  File "/b/cbuild/repository/chromite/lib/parallel.py", line 441, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 628, in Run
    self.PerformStage()
  File "/b/cbuild/repository/chromite/cbuildbot/stages/test_stages.py", line 416, in PerformStage
    ('Build', build_id, 'BuildStage', self._build_stage_id)})
  File "/b/cbuild/repository/chromite/lib/failures_lib.py", line 190, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/commands.py", line 1019, in RunHWTestSuite
    _HWTestWait(cmd, job_id, **swarming_args)
  File "/b/cbuild/repository/chromite/cbuildbot/commands.py", line 1259, in _HWTestWait
    **kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/swarming_lib.py", line 153, in RunSwarmingCommandWithRetries
    return retry_util.RetryCommand(RunSwarmingCommand, max_retry, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/retry_util.py", line 224, in RetryCommand
    return GenericRetry(ShouldRetry, max_retry, functor, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/retry_util.py", line 122, in GenericRetry
    ret = functor(*args, **kwargs)
  File "/b/cbuild/repository/chromite/cbuildbot/swarming_lib.py", line 92, in RunSwarmingCommand
    result = cros_build_lib.RunCommand(swarming_cmd, *args, **kwargs)
  File "/b/cbuild/repository/chromite/lib/cros_build_lib.py", line 593, in RunCommand
    (cmd_result.output, cmd_result.error) = proc.communicate(input)
  File "/usr/lib/python2.7/subprocess.py", line 796, in communicate
    self.wait()
  File "/usr/lib/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
  File "/b/cbuild/repository/chromite/lib/cros_build_lib.py", line 340, in _KillChildProcess
    if not signals.RelaySignal(original_handler, signum, frame):
  File "/b/cbuild/repository/chromite/lib/signals.py", line 26, in RelaySignal
    handler(signum, frame)
  File "/b/cbuild/repository/chromite/lib/commandline.py", line 733, in _DefaultHandler
    signum, 'Received signal %i; shutting down' % (signum,))
_ShutDownException: (15, 'Received signal 15; shutting down')

Marking this Pri-1 since it blocks the CQ.
 
It looks the corresponding autotest jobs get scheduled, but remain queued and never get run.

For example:

https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-paladin/builds/2354
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=113806486

Summary: HWTest swarming.py times out due to autotest jobs stuck in queued state (was: HWTest swarming.py times out)
Cc: jrbarnette@chromium.org bhthompson@chromium.org nxia@chromium.org
+more

This looks like it is breaking lab tests all over, the 59 release builders are all red failing HWTest sanity, this may be considered a Pri-0 if the lab is non functional.
Cc: hidehiko@chromium.org
Cc: xixuan@chromium.org
Labels: -Pri-1 Pri-0
Owner: chingcodes@chromium.org
Status: Assigned
All our dashboards say that the lab has stopped doing work.

The deputy (lucky guy!) gets the first swipe at this.

First-time deputy is here! I'm trying to figure out what to do now. But some ADVICE would be greatly appreciated.
Master scheduler not ticking. 
https://viceroy.corp.google.com/chromeos/deputy-view


master scheduler errors:

EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
    dispatcher.initialize(recover_hosts=options.recover_hosts)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 336, in initialize
    self._recover_processes()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 491, in _recover_processes
    agent_tasks = self._create_recovery_agent_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 506, in _create_recovery_agent_tasks
    + self._get_special_task_agent_tasks(is_active=True))
  File "/usr/local/autotest/scheduler/monitor_db.py", line 560, in _get_special_task_agent_tasks
    for task in special_tasks]
  File "/usr/local/autotest/scheduler/monitor_db.py", line 637, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/prejob_task.py", line 368, in __init__
    self._set_ids(host=self.host, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/agent_task.py", line 166, in _set_ids
    self.host_ids = [entry.host.id for entry in queue_entries]
AttributeError: 'NoneType' object has no attribute 'id'
Summary: lab master scheduler crashlooping | AttributeError: 'NoneType' object has no attribute 'id' (was: HWTest swarming.py times out due to autotest jobs stuck in queued state)
prejob_task.py:368 is part of ProvisionTask. Is that the only PreJobTask that is showing this problem?

class ProvisionTask(PreJobTask):                                                 
    TASK_TYPE = models.SpecialTask.Task.PROVISION
                                                                                 
    def __init__(self, task):                                                    
        # Provisioning requires that we be associated with a job/queue entry     
        assert task.queue_entry, "No HQE associated with provision task!"        
        # task.queue_entry is an afe model HostQueueEntry object.                
        # self.queue_entry is a scheduler models HostQueueEntry object, but      
        # it gets constructed and assigned in __init__, so it's not available    
        # yet.  Therefore, we're stuck pulling labels off of the afe model       
        # so that we can pass the --provision args into the __init__ call.       
        labels = {x.name for x in task.queue_entry.job.labels}                   
        _, provisionable = provision.Provision.partition(labels)
        extra_command_args = ['--provision',
                              '--job-labels', ','.join(provisionable)]
        super(ProvisionTask, self).__init__(task, extra_command_args)            
        self._set_milestone(labels)
        self._set_ids(host=self.host, queue_entries=[self.queue_entry])  
(note, the problem presumably is that we're passing a hostless queue_entry into _set_ids there)
Cc: chingcodes@chromium.org
Owner: xixuan@chromium.org
-> xixiun@ is driving this right now
Cc: dshi@chromium.org
+dshi for insight

Update: the logs don't tell us which job was being processed when it crashed, which makes this a pain to trace. xixuan has locally modified scheduler to log this.
the problematic job id is 113768638.

http://cautotest/afe/#tab_id=view_job&object_id=113768638


I see this coming from both ProvisionTask and GatherLogsTask.
Strange that that is the problematic job, since that is a shard job. Why does master scheduler know/care about shard job...
Also, that job thinks it belongs on chromeos-server101, whereas board:pyro actually belongs to chromeos-server110
Cc: shuqianz@chromium.org
This outage coincides with when pyro, reef, and ultima were shuffled around between shards. See crbug.com/713319
select * from afe_jobs where id=113768638\G;

shard_id: NULL

someone manually changed shard_id already? It needs to be set to chromeos-server110
I updated that afe_job with shard_id set to 155
I update it with NULL. I assume it will be picked up by shard 155 later automatically? 
The issue is more complex than just the shard_id, see updates on the bug 713319
Scheduler is back. Seems like the surgery on 713319 did the trick?
Cc: warx@chromium.org michae...@chromium.org
 Issue 714716  has been merged into this issue.
entry.job_id 113729414L also crashed scheduler
04/24 11:33:14.933 DEBUG|  scheduler_models:0210| host_entry_id: 113729414L
04/24 11:33:14.934 DEBUG|  scheduler_models:0211| host_entry_row: (113729414L, u'chromeos-test', u'ultima-release/R60-9481.0.0/wifi_endtoend/network_WiFi_RoamSuspendEndToEnd', 20L, u'# Copyright 2015 The Chromium OS Authors. All rights reserved.\n# Use of this source code is governed by a BSD-style license that can be\n# found in the LICENSE file.\n\nAUTHOR = \'bmahadev, krisr, harpreet\'\nTIME = \'SHORT\'\nNAME = \'network_WiFi_RoamSuspendEndToEnd\'\nTEST_TYPE = \'Server\'\nATTRIBUTES = \'suite:wifi_endtoend, suite:wifi_matfunc_bcm4371\'\nDEPENDENCIES = \'servo, wificell\'\n\nDOC = """\nWiFi_RoamSuspendEndToEnd test configures two APs with the same ssid and runs the\nclient side network_RoamWifiEndToEnd test which uses these APs to test\nWiFi roaming between two Open Networks after a Suspend/Resume.\n"""\n\nfrom autotest_lib.server import utils\n\nargs_dict = utils.args_to_dict(args)\nservo_args = hosts.CrosHost.get_servo_arguments(args_dict)\n\ndef run(machine):\n    host = hosts.create_host(machine, servo_args=servo_args)\n    job.run_test(\'network_WiFi_RoamSuspendEndToEnd\',\n                 host=host,\n                 raw_cmdline_args=args)\n\nparallel_simple(run, machines)\n', 1L, datetime.datetime(2017, 4, 21, 14, 54, 34), 1L, 24L, 0, u'', 1, 0, 1, 72L, None, None, 60L, 113669280L, 0L, 1, 1845L, 148L, None)
04/24 11:33:14.934 DEBUG|  scheduler_models:0283| after updating: set(['control_type', 'max_runtime_hrs', 'shard_id', 'created_on', 'timeout_mins', 'owner', 'max_runtime_mins', 'reboot_after', 'id', 'control_file', 'test_retry', 'run_verify', 'priority', 'synch_count', 'require_ssp', 'run_reset', 'reboot_before', 'email_list', 'parse_failed_repair', 'drone_set_id', 'name', 'parent_job_id', 'timeout', 'parameterized_job_id'])
04/24 11:33:14.934 DEBUG|  scheduler_models:0490| Cannot get host id.
04/24 11:33:14.935 DEBUG|        agent_task:0168| entry: <autotest_lib.scheduler.scheduler_models.HostQueueEntry object at 0x7fe57f676fd0>
04/24 11:33:14.935 DEBUG|        agent_task:0169| entry.job_id: 113729414L
04/24 11:33:14.938 ERROR|     email_manager:0082| Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 185, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 364, in tick
    self._schedule_special_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 282, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 731, in _schedule_special_tasks
    self.add_agent_task(self._get_agent_task_for_special_task(task))
  File "/usr/local/autotest/scheduler/monitor_db.py", line 637, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/prejob_task.py", line 266, in __init__
    self._set_ids(host=self.host, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/agent_task.py", line 170, in _set_ids
    self.host_ids.append(entry.host.id)

 Issue 714707  has been merged into this issue.
Cc: gkihumba@chromium.org
 Issue 714690  has been merged into this issue.
 Issue 714400  has been merged into this issue.
Status: Fixed
scheduler is up and running again.
Issue 712925 has been merged into this issue.
Labels: VerifyIn-60
Labels: VerifyIn-61
Sign in to add a comment