New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 830979 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

[Findit] KeyError: 'task_id' when callback run_test_swarming_task_pipeline.py

Project Member Reported by chanli@chromium.org, Apr 10 2018

Issue description

Traceback (most recent call last):
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/first_party/gae_ts_mon/config.py", line 256, in dispatch
    time_fn=time_fn)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/first_party/gae_ts_mon/config.py", line 218, in _instrumented_dispatcher
    ret = dispatcher(request, response)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27/8fbda9e1e00609ef_unzipped/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/infra_api_clients/../third_party/pipeline/pipeline.py", line 2782, in post
    self.get()
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/infra_api_clients/../third_party/pipeline/pipeline.py", line 2786, in get
    self.run_callback()
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/infra_api_clients/../third_party/pipeline/pipeline.py", line 2863, in run_callback
    callback_result = perform_callback()
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/infra_api_clients/../third_party/pipeline/pipeline.py", line 2853, in perform_callback
    return stage._callback_internal(kwargs)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/infra_api_clients/../third_party/pipeline/pipeline.py", line 1099, in _callback_internal
    return self.callback(**kwargs)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/gae_libs/pipelines.py", line 447, in callback
    returned_value = self.CallbackImpl(arg, parameters)
  File "/base/data/home/apps/s~findit-for-me/waterfall-backend:14839-f7d41cf.408837037076478115/pipelines/test_failure/run_test_swarming_task_pipeline.py", line 52, in CallbackImpl
    task_id = parameters['task_id']
KeyError: 'task_id'

Error reporting: https://pantheon.corp.google.com/errors/CNnZ5JmspZCIVA?time=P30D&project=findit-for-me

 

Comment 1 by chanli@chromium.org, Apr 16 2018

One noticeable thing is that for the tasks related to this error, they all started within a very short of time.

For example https://findit-for-me.appspot.com/waterfall/failure?url=https://luci-milo.appspot.com/buildbot/chromium.memory/Android%20CFI/714, the task was created at 4/14/2018, 2:05:59 PM (PDT); started at 4/14/2018, 2:05:59 PM (PDT) and completed at 4/14/2018, 2:06:10 PM (PDT).

I think the root cause for the failure is because we trigger a task with callback_info, and the pipeline is called back before the RunImpl finishes and saves task_id in parameter.

I'm thinking in this case return an error to make the pipeline retry, wdyt?


Comment 2 by chanli@chromium.org, Apr 16 2018

Alternative would be in handler side, after we get the pipeline by from_id, we check if there is task_id in pipeline.GetCallbackParameters(), if not, IIUC, we can raise an exception and wait for pubsub to resend the message.

Comment 3 by st...@chromium.org, Apr 16 2018

Re #1: return an error would be fine if the status of the Task is up to running status (You may double check how the code handles status update of a Swarming task. If we skip when it is completed or errored, we would be in trouble.

Re #2: that would raise up the errors on HTTP handlers. As Roberto is set up the 5XX ratio, this would be a problem. Returning non 2XX status might work, like 404.
We discussed returning a 404 as the most appropriate response.

Comment 5 by st...@chromium.org, Apr 16 2018

Mind also fixing https://cs.chromium.org/chromium/infra/appengine/findit/pipelines/compile_failure/run_compile_try_job_pipeline.py?l=52 ?

Actually, if we want the the build url, we need to handle running status properly. For Swarming task, with the task id we could construct the url already, thus it's fine to only handle completed and errored statuses.

Comment 6 by chanli@chromium.org, Apr 16 2018

Working on both.
Project Member

Comment 7 by bugdroid1@chromium.org, Apr 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/071f214a57acdd1fce3f843b4b22280fcbdf9762

commit 071f214a57acdd1fce3f843b4b22280fcbdf9762
Author: Chan <chanli@chromium.org>
Date: Tue Apr 17 16:39:48 2018

[Findit] Retry callback if required callback parameters are not ready.

Uses an error to tell aync pipelines to retry callback if required callback parameters are not ready.


Bug:  830979 
Change-Id: I6a29f236a2ca512d2e111063f8eeb696c253ea07
Reviewed-on: https://chromium-review.googlesource.com/1014562
Reviewed-by: Shuotao Gao <stgao@chromium.org>
Commit-Queue: Chan Li <chanli@chromium.org>

[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/test_failure/run_test_swarming_task_pipeline.py
[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/flake_failure/run_flake_swarming_task_pipeline.py
[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/compile_failure/run_compile_try_job_pipeline.py
[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/test_failure/test/run_test_swarming_task_pipeline_test.py
[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/compile_failure/test/run_compile_try_job_pipeline_test.py
[modify] https://crrev.com/071f214a57acdd1fce3f843b4b22280fcbdf9762/appengine/findit/pipelines/flake_failure/test/run_flake_swarming_task_pipeline_test.py

Comment 8 by chanli@chromium.org, Apr 18 2018

Status: Fixed (was: Started)

Sign in to add a comment