swarming when using external scheduler gets AttributeError: 'NoneType' object has no attribute 'key' |
|||
Issue descriptionhttps://pantheon.corp.google.com/errors/CL7A1e6jwaSyfA?time=P1D&project=chromium-swarm-dev&organizationId=433637338589 https://pantheon.corp.google.com/logs/viewer?resource=gae_app%2Fmodule_id%2Fdefault%2Fversion_id%2F3925-aa67cae&logName=projects%2Fchromium-swarm-dev%2Flogs%2Fappengine.googleapis.com%252Frequest_log&expandAll=true×tamp=2018-12-07T00%3A18%3A07.690000000Z&interval=NO_LIMIT&filters=protoPayload.requestId%3A5c09bbf000ff0a2c1d5f33d4a10001737e6368726f6d69756d2d737761726d2d6465760001333932352d61613637636165000100&project=chromium-swarm-dev&organizationId=433637338589&minLogLevel=0&customFacets&limitCustomFacetWidth=true&scrollTimestamp=2018-12-07T00%3A16%3A48.666653000Z&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.zone%3D%22us9%22%0Aresource.labels.project_id%3D%22chromium-swarm-dev%22%0Aresource.labels.version_id%3D%223925-aa67cae%22%0Aresource.labels.module_id%3D%22default%22%0Atimestamp%3D%222018-12-07T00%3A16%3A48.666653000Z%22%0AinsertId%3D%225c09bbf10002aa98f7ad657e%22
,
Dec 7
,
Dec 7
This didn't occur when I was running local tests; but I was only using 1-slice tasks, whereas the above test used a 2-slice task I think. I'm not certain what is responsible. Will try to reproduce locally. M-A any insight?
,
Dec 7
The to_run object that is somehow coming out as None is being determined in _get_task_from_external_schedulers with the following code
logging.info('Determined try_number, slice_number %s %s', try_number,
slice_number)
to_run_key = task_to_run.request_to_task_to_run_key(request, try_number,
slice_number)
to_run = to_run_key.get()
# TODO(akeshet/maruel): Figure out how to unpack a TaskToRun from the
# returned result_key.
return [(request, to_run)]
,
Dec 7
Oh I misread the code and sent https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1366647/ I think this change is still good but I'll have to read more closely.
,
Dec 7
IIUC, you are constructing a TaskToRun ndb.Key for a TaskToRun that was never enqueued. The entities are created with task_to_run.new_task_to_run().
,
Dec 7
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/e022900eba76b03292d3af5133253cf5ac3a4f99 commit e022900eba76b03292d3af5133253cf5ac3a4f99 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Dec 07 16:45:22 2018 [swarming] Work around inconsistent index It's possible in rare cases that the TaskRequest entity exist but not the TaskToRun, yet the index for the TaskToRun exists! R=qyearsley@chromium.org Bug: 912776 Change-Id: I07a6552b49bae672a2223b36800ae5eba2cb5675 Reviewed-on: https://chromium-review.googlesource.com/c/1366647 Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/e022900eba76b03292d3af5133253cf5ac3a4f99/appengine/swarming/server/task_to_run.py
,
Dec 10
#6 Why was no TaskToRun enqueued? I see what appears to be a a loop that creates then for all but the last slice, in schedule_task
def schedule_request(request, secret_bytes):
<snip>
if not dupe_summary:
# The task has to run.
index = 0
while index < request.num_task_slices:
# This needs to be extremely fast.
to_run = task_to_run.new_task_to_run(request, 1, index)
# Make sure there's capacity if desired.
t = request.task_slice(index)
if (t.wait_for_capacity or
bot_management.has_capacity(t.properties.dimensions)):
# It's pending at this index now.
result_summary.current_task_slice = index
break
index += 1
if index == request.num_task_slices:
# Skip to_run since it's not enqueued.
to_run = None
# Same rationale as deduped task.
secret_bytes = None
# Instantaneously denied.
result_summary.abandoned_ts = result_summary.created_ts
result_summary.state = task_result.State.NO_RESOURCE
,
Dec 11
One possible solution in progress at https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1370894 As discussed face to face, another option is https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1370895 After hacking a bit on the latter, I am concerned that it adds a lot of code that will be additional maintenance burden, and am inclined to the former (or a variant of it that addresses consistency issues with it).
,
Dec 12
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/592f250900745056a8a2ed3873616dc8ecdb60d5 commit 592f250900745056a8a2ed3873616dc8ecdb60d5 Author: Aviv Keshet <akeshet@chromium.org> Date: Wed Dec 12 01:03:56 2018 swarming: ensure the external scheduler's slice is active BUG= chromium:912776 TEST=Local tests with dev instance Change-Id: I2e3690bf0670bfad9e1e30914ff8895982a95970 Reviewed-on: https://chromium-review.googlesource.com/c/1372351 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/592f250900745056a8a2ed3873616dc8ecdb60d5/appengine/swarming/server/task_scheduler.py
,
Dec 12
^ awaiting push-to-staging, I believe the issue should be resolved afterward
,
Dec 12
Live as 3942-64a8782.
,
Dec 12
|
|||
►
Sign in to add a comment |
|||
Comment 1 by akes...@chromium.org
, Dec 75: { logMessage: "Using external scheduler address: qscheduler-swarming-dev.appspot.com id: qs1 for bot crossk-chromeos4-row7-rack8-host13" severity: "DEBUG" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/external_scheduler.py" functionName: "assign_task" line: "118" } time: "2018-12-07T00:19:06.188671Z" } 6: { logMessage: "POST https://qscheduler-swarming-dev.appspot.com/prpc/swarming.ExternalScheduler/AssignTasks" severity: "INFO" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/components/net.py" functionName: "request_async" line: "145" } time: "2018-12-07T00:19:06.190116Z" } 7: { logMessage: "Got task id 419ed0ca91c37910" severity: "INFO" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py" functionName: "_get_task_from_external_scheduler" line: "724" } time: "2018-12-07T00:19:06.483820Z" } 8: { logMessage: "Determined request_key, result_key Key('TaskRequest', 8927844147652708462), Key('TaskRequest', 8927844147652708462, 'TaskResultSummary', 1)" severity: "INFO" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py" functionName: "_get_task_from_external_scheduler" line: "732" } time: "2018-12-07T00:19:06.484111Z" } 9: { logMessage: "Determined try_number, slice_number 1 0" severity: "INFO" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py" functionName: "_get_task_from_external_scheduler" line: "744" } time: "2018-12-07T00:19:06.492312Z" } 10: { logMessage: "bot_reap_task(crossk-chromeos4-row7-rack8-host13) in 0.314s: 1 iterated, 0 reenqueued, 0 expired, 0 stale_index, 0 failured" severity: "DEBUG" sourceLocation: { file: "/base/data/home/apps/s~chromium-swarm-dev/3925-aa67cae.414501407744757878/server/task_scheduler.py" functionName: "bot_reap_task" line: "1050" } time: "2018-12-07T00:19:06.502661Z" } 11: { logMessage: "'NoneType' object has no attribute 'key' Traceback (most recent call last): File "third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__ rv = self.handle_exception(request, response, e) File "third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__ rv = self.router.dispatch(request, response) File "components/utils.py", line 122, in dispatch_and_report return old_dispatcher(*args, **kwargs) File "gae_ts_mon/config.py", line 256, in dispatch time_fn=time_fn) File "gae_ts_mon/config.py", line 218, in _instrumented_dispatcher ret = dispatcher(request, response) File "third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher return route.handler_adapter(request, response) File "third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__ return handler.dispatch() File "components/auth/handler.py", line 193, in dispatch super(AuthenticatingHandler, self).dispatch() File "third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch return self.handle_exception(e, self.app.debug) File "third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch return method(*args, **kwargs) File "handlers_bot.py", line 557, in post res.dimensions, res.version, res.lease_expiration_ts) File "server/task_scheduler.py", line 999, in bot_reap_task slice_index = task_to_run.task_to_run_key_slice_index(to_run.key) AttributeError: 'NoneType' object has no attribute 'key'"