New issue
Advanced search Search tips

Issue 728716 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature



Sign in to add a comment

When a bot reaped a task but never sent a task_update, retry the task as if it hadn't run

Project Member Reported by mar...@chromium.org, Jun 1 2017

Issue description

This generally happens because the server blew up, not due to the bot. Change the bot to send a ping immediately after the task was reaped to confirm it got it. Then change TaskRunResult.modified_ts != .started_ts, which states that the bot indeed got the task.

This should help get one more 9 of reliability.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jun 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/5251e45e518f606fba7a28ec465cc56a8e18e018

commit 5251e45e518f606fba7a28ec465cc56a8e18e018
Author: maruel <maruel@chromium.org>
Date: Sat Jun 17 00:54:15 2017

Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG= 728716 

Review-Url: https://codereview.chromium.org/2914803004

[modify] https://crrev.com/5251e45e518f606fba7a28ec465cc56a8e18e018/appengine/swarming/server/task_result.py
[modify] https://crrev.com/5251e45e518f606fba7a28ec465cc56a8e18e018/appengine/swarming/server/task_result_test.py
[modify] https://crrev.com/5251e45e518f606fba7a28ec465cc56a8e18e018/appengine/swarming/server/task_scheduler.py
[modify] https://crrev.com/5251e45e518f606fba7a28ec465cc56a8e18e018/appengine/swarming/server/task_scheduler_test.py

Comment 2 by mar...@chromium.org, Oct 30 2017

Status: Fixed (was: Assigned)

Sign in to add a comment