New issue
Advanced search Search tips

Issue 912776 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 912675



Sign in to add a comment

swarming when using external scheduler gets AttributeError: 'NoneType' object has no attribute 'key'

Project Member Reported by akes...@chromium.org, Dec 7

Issue description

   5: {
    logMessage:  "Using external scheduler address: qscheduler-swarming-dev.appspot.com id: qs1 for bot crossk-chromeos4-row7-rack8-host13"     
    severity:  "DEBUG"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/external_scheduler.py"      
     functionName:  "assign_task"      
     line:  "118"      
    }
    time:  "2018-12-07T00:19:06.188671Z"     
   }
   6: {
    logMessage:  "POST https://qscheduler-swarming-dev.appspot.com/prpc/swarming.ExternalScheduler/AssignTasks"     
    severity:  "INFO"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/components/net.py"      
     functionName:  "request_async"      
     line:  "145"      
    }
    time:  "2018-12-07T00:19:06.190116Z"     
   }
   7: {
    logMessage:  "Got task id 419ed0ca91c37910"     
    severity:  "INFO"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py"      
     functionName:  "_get_task_from_external_scheduler"      
     line:  "724"      
    }
    time:  "2018-12-07T00:19:06.483820Z"     
   }
   8: {
    logMessage:  "Determined request_key, result_key Key('TaskRequest', 8927844147652708462), Key('TaskRequest', 8927844147652708462, 'TaskResultSummary', 1)"     
    severity:  "INFO"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py"      
     functionName:  "_get_task_from_external_scheduler"      
     line:  "732"      
    }
    time:  "2018-12-07T00:19:06.484111Z"     
   }
   9: {
    logMessage:  "Determined try_number, slice_number 1 0"     
    severity:  "INFO"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py"      
     functionName:  "_get_task_from_external_scheduler"      
     line:  "744"      
    }
    time:  "2018-12-07T00:19:06.492312Z"     
   }
   10: {
    logMessage:  "bot_reap_task(crossk-chromeos4-row7-rack8-host13) in 0.314s: 1 iterated, 0 reenqueued, 0 expired, 0 stale_index, 0 failured"     
    severity:  "DEBUG"     
    sourceLocation: {
     file:  "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py"      
     functionName:  "bot_reap_task"      
     line:  "1050"      
    }
    time:  "2018-12-07T00:19:06.502661Z"     
   }
   11: {
    logMessage:  "'NoneType' object has no attribute 'key'
Traceback (most recent call last):
  File "third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "components/utils.py", line 122, in dispatch_and_report
    return old_dispatcher(*args, **kwargs)
  File "gae_ts_mon/config.py", line 256, in dispatch
    time_fn=time_fn)
  File "gae_ts_mon/config.py", line 218, in _instrumented_dispatcher
    ret = dispatcher(request, response)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "components/auth/handler.py", line 193, in dispatch
    super(AuthenticatingHandler, self).dispatch()
  File "third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "handlers_bot.py", line 557, in post
    res.dimensions, res.version, res.lease_expiration_ts)
  File "server/task_scheduler.py", line 999, in bot_reap_task
    slice_index = task_to_run.task_to_run_key_slice_index(to_run.key)
AttributeError: 'NoneType' object has no attribute 'key'"     
Blocking: 912675
This didn't occur when I was running local tests; but I was only using 1-slice tasks, whereas the above test used a 2-slice task I think.

I'm not certain what is responsible. Will try to reproduce locally.

M-A any insight?
The to_run object that is somehow coming out as None is being determined in _get_task_from_external_schedulers with the following code

  logging.info('Determined try_number, slice_number %s %s', try_number,
               slice_number)
  to_run_key = task_to_run.request_to_task_to_run_key(request, try_number,
                                                      slice_number)
  to_run = to_run_key.get()
  # TODO(akeshet/maruel): Figure out how to unpack a TaskToRun from the
  # returned result_key.
  return [(request, to_run)]

Oh I misread the code and sent https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1366647/

I think this change is still good but I'll have to read more closely.
IIUC, you are constructing a TaskToRun ndb.Key for a TaskToRun that was never enqueued. The entities are created with task_to_run.new_task_to_run().
Project Member

Comment 7 by bugdroid1@chromium.org, Dec 7

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/e022900eba76b03292d3af5133253cf5ac3a4f99

commit e022900eba76b03292d3af5133253cf5ac3a4f99
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Dec 07 16:45:22 2018

[swarming] Work around inconsistent index

It's possible in rare cases that the TaskRequest entity exist but not
the TaskToRun, yet the index for the TaskToRun exists!

R=qyearsley@chromium.org
Bug:  912776 
Change-Id: I07a6552b49bae672a2223b36800ae5eba2cb5675
Reviewed-on: https://chromium-review.googlesource.com/c/1366647
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/e022900eba76b03292d3af5133253cf5ac3a4f99/appengine/swarming/server/task_to_run.py

#6 Why was no TaskToRun enqueued? I see what appears to be a a loop that creates then for all but the last slice, in schedule_task

def schedule_request(request, secret_bytes):
  <snip>
  if not dupe_summary:
    # The task has to run.
    index = 0
    while index < request.num_task_slices:
      # This needs to be extremely fast.
      to_run = task_to_run.new_task_to_run(request, 1, index)
      #  Make sure there's capacity if desired.
      t = request.task_slice(index)
      if (t.wait_for_capacity or
          bot_management.has_capacity(t.properties.dimensions)):
        # It's pending at this index now.
        result_summary.current_task_slice = index
        break
      index += 1

    if index == request.num_task_slices:
      # Skip to_run since it's not enqueued.
      to_run = None
      # Same rationale as deduped task.
      secret_bytes = None
      # Instantaneously denied.
      result_summary.abandoned_ts = result_summary.created_ts
      result_summary.state = task_result.State.NO_RESOURCE

One possible solution in progress at https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1370894

As discussed face to face, another option is https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1370895

After hacking a bit on the latter, I am concerned that it adds a lot of code that will be additional maintenance burden, and am inclined to the former (or a variant of it that addresses consistency issues with it).
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 12

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/592f250900745056a8a2ed3873616dc8ecdb60d5

commit 592f250900745056a8a2ed3873616dc8ecdb60d5
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed Dec 12 01:03:56 2018

swarming: ensure the external scheduler's slice is active

BUG= chromium:912776 
TEST=Local tests with dev instance

Change-Id: I2e3690bf0670bfad9e1e30914ff8895982a95970
Reviewed-on: https://chromium-review.googlesource.com/c/1372351
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/592f250900745056a8a2ed3873616dc8ecdb60d5/appengine/swarming/server/task_scheduler.py

^ awaiting push-to-staging, I believe the issue should be resolved afterward
Live as 3942-64a8782.
Status: Fixed (was: Assigned)

Sign in to add a comment