New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 642511 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

chromium-swarm.appspot.com returning 503 -- BadValueError: state must be COMPLETED on deduped task

Project Member Reported by chromium...@appspot.gserviceaccount.com, Aug 30 2016

Issue description

"[trigger] (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 32 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyIQsSBUZsYWtlIhZbdHJpZ2dlcl0gKHdpdGggcGF0Y2gpDA.



This flaky test/step was previously tracked in  issue 604954 .
 
Labels: -Sheriff-Chromium Infra-Troopers
Summary: chromium-swarm.appspot.com returning 503 -- "[trigger] events_unittests (with patch)" is flaky (was: "[trigger] (with patch)" is flaky)
A sampling of the flakes all show the same cause. Troopers, could you take a look?

14470 2016-08-30 12:07:16.851 E: Unable to open given url, https://chromium-swarm.appspot.com/_ah/api/swarming/v1/tasks/new, after 30 attempts.
503 Server Error: Service Unavailable for url: https://chromium-swarm.appspot.com/_ah/api/swarming/v1/tasks/new
----------
Alternate-protocol: 443:quic
X-xss-protection: 1; mode=block
X-content-type-options: nosniff
Content-encoding: gzip
Transfer-encoding: chunked
Expires: Tue, 30 Aug 2016 12:07:16 GMT
Server: GSE
Cache-control: private, max-age=0
Date: Tue, 30 Aug 2016 12:07:16 GMT
X-frame-options: SAMEORIGIN
Alt-svc: quic=":443"; ma=2592000; v="36,35,34,33,32,31,30"
Content-type: application/json; charset=UTF-8

{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "backendError",
    "message": "Internal Server Error"
   }
  ],
  "code": 503,
  "message": "Internal Server Error"
 }
}

Comment 2 by s...@google.com, Aug 30 2016

Cc: vadimsh@chromium.org smut@chromium.org mar...@chromium.org
Found a row of HTTP 500s: https://screenshot.googleplex.com/3y1VvNQ5Kr6.png (https://pantheon.corp.google.com/logs/viewer?project=chromium-swarm&minLogLevel=0&expandAll=false&resource=appengine.googleapis.com&logName=%2Fprojects%2Fchromium-swarm%2Flogs%2Fappengine.googleapis.com%252Frequest_log&filters=status:500&filters=text:%2F_ah%2Fspi%2FSwarmingTasksService.new&advancedFilter=)


Request:
<NewTaskRequest (/base/data/home/apps/s~chromium-swarm/2300-1e18b15.395200628328903120/handlers_endpoints.py:296)
 expiration_secs: 3600
 name: u'events_unittests (with patch)/Ubuntu-12.04/d0f6e5bbba/linux_chromium_rel_ng/289029'
 parent_task_id: u''
 priority: 30
 properties: <TaskProperties
 command: []
 dimensions: [<StringPair
 key: u'cpu'
 value: u'x86-64'>, <StringPair
 key: u'gpu'
 value: u'none'>, <StringPair
 key: u'os'
 value: u'Ubuntu-12.04'>, <StringPair
 key: u'pool'
 value: u'Chrome'>]
 env: []
 execution_timeout_secs: 3600
 extra_args: [u'--test-launcher-summary-output=${ISOLATED_OUTDIR}/output.json']
 grace_period_secs: 30
 idempotent: True
 inputs_ref: <FilesRef
 isolated: u'd0f6e5bbbac03262d2c142de3e3683b0cc3a0050'
 isolatedserver: u'https://isolateserver.appspot.com'
 namespace: u'default-gzip'>
 io_timeout_secs: 1200>
 tags: [u'buildername:linux_chromium_rel_ng', u'buildnumber:289029', u'data:d0f6e5bbbac03262d2c142de3e3683b0cc3a0050', u'master:tryserver.chromium.linux', u'name:events_unittests', u'os:Ubuntu-12.04', u'patch_project:chromium', u'project:chromium', u'purpose:ManualTS', u'purpose:pre-commit', u'rietveld:https://codereview.chromium.org/2260303002/#ps220001', u'slavename:slave643-c4', u'stepname:events_unittests (with patch) on Ubuntu-12.04']
 user: u'shivanisha@chromium.org'>

----

Stack trace:
Traceback (most recent call last):
  File "third_party/protorpc-1.0/protorpc/wsgi/service.py", line 181, in protorpc_service_app
    response = method(instance, request)
  File "gae_ts_mon/config.py", line 274, in decorated
    ret = fn(service, *args, **kwargs)
  File "components/auth/endpoints_support.py", line 202, in wrapper
    return method(service, *args, **kwargs)
  File "third_party/endpoints-1.0/endpoints/api_config.py", line 1331, in invoke_remote
    return remote_method(service_instance, request)
  File "third_party/protorpc-1.0/protorpc/remote.py", line 414, in invoke_remote_method
    response = method(service_instance, request)
  File "components/auth/endpoints_support.py", line 161, in wrapper
    return func(service, *args, **kwargs)
  File "components/auth/api.py", line 952, in wrapper
    return func(*args, **kwargs)
  File "handlers_endpoints.py", line 306, in new
    result_summary = task_scheduler.schedule_request(request)
  File "server/task_scheduler.py", line 451, in schedule_request
    request, get_new_keys, extra=[task, result_summary])
  File "components/utils.py", line 633, in sync
    return async_fn(*args, **kwargs).get_result()
  File "appengine/ext/ndb/tasklets.py", line 383, in get_result
    self.check_success()
  File "appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
    value = gen.throw(exc.__class__, exc, tb)
  File "components/datastore_utils/monotonic.py", line 116, in insert_async
    if (yield txn.transaction_async(run, retries=0)):
  File "appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
    value = gen.throw(exc.__class__, exc, tb)
  File "components/datastore_utils/txn.py", line 46, in transaction_async
    result = yield ndb.transaction_async(callback, **ctx_options)
  File "appengine/ext/ndb/tasklets.py", line 430, in _help_tasklet_along
    value = gen.send(val)
  File "appengine/ext/ndb/context.py", line 1016, in transaction
    result = yield result
  File "appengine/ext/ndb/tasklets.py", line 430, in _help_tasklet_along
    value = gen.send(val)
  File "components/datastore_utils/monotonic.py", line 102, in run
    yield ndb.put_multi_async(entities)
  File "appengine/ext/ndb/model.py", line 3937, in put_multi_async
    return [entity.put_async(**ctx_options) for entity in entities]
  File "appengine/ext/ndb/model.py", line 3466, in _put_async
    self._pre_put_hook()
  File "server/task_result.py", line 599, in _pre_put_hook
    'state must be COMPLETED on deduped task')
BadValueError: state must be COMPLETED on deduped task
Components: -Tests>Flaky Infra>Platform>Swarming
Labels: -Pri-1 -Infra-Troopers Pri-2
Owner: mar...@chromium.org
Status: Assigned (was: Untriaged)
Summary: chromium-swarm.appspot.com returning 503 -- BadValueError: state must be COMPLETED on deduped task (was: chromium-swarm.appspot.com returning 503 -- "[trigger] events_unittests (with patch)" is flaky)
Based on server logs, events_unittests with isolate d0f6e5bbbac03262d2c142de3e3683b0cc3a0050 is the only such "broken" task and it stopped being used after ~9:00 AM PST today (probably the binary has changed, and now Swarming dedups it from some other task, that's in some valid state).

So the immediate danger is gone (Pri-1 -> Pri-2). But there's a bug somewhere that allowed the broken state to happen in the first place. Tentatively assigning to M-A to investigate.
Project Member

Comment 5 by bugdroid1@chromium.org, Aug 31 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/62dac3d6729d3389437fb2710e3191eb8834f3db

commit 62dac3d6729d3389437fb2710e3191eb8834f3db
Author: maruel <maruel@chromium.org>
Date: Wed Aug 31 21:57:21 2016

Add more safety checks.

A rootless index has no consistency guarantee. So ensure that the returned value
is good. There has been at least one case on production where the index returned
an object matching properties_hash != None but had failed.

R=vadimsh@chromium.org
BUG= chromium:642511 

Review-Url: https://codereview.chromium.org/2298053002

[modify] https://crrev.com/62dac3d6729d3389437fb2710e3191eb8834f3db/appengine/swarming/server/task_result.py
[modify] https://crrev.com/62dac3d6729d3389437fb2710e3191eb8834f3db/appengine/swarming/server/task_scheduler.py

Project Member

Comment 6 by bugdroid1@chromium.org, Aug 31 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/62dac3d6729d3389437fb2710e3191eb8834f3db

commit 62dac3d6729d3389437fb2710e3191eb8834f3db
Author: maruel <maruel@chromium.org>
Date: Wed Aug 31 21:57:21 2016

Add more safety checks.

A rootless index has no consistency guarantee. So ensure that the returned value
is good. There has been at least one case on production where the index returned
an object matching properties_hash != None but had failed.

R=vadimsh@chromium.org
BUG= chromium:642511 

Review-Url: https://codereview.chromium.org/2298053002

[modify] https://crrev.com/62dac3d6729d3389437fb2710e3191eb8834f3db/appengine/swarming/server/task_result.py
[modify] https://crrev.com/62dac3d6729d3389437fb2710e3191eb8834f3db/appengine/swarming/server/task_scheduler.py

Comment 7 by mar...@chromium.org, Aug 31 2016

Status: Fixed (was: Assigned)

Sign in to add a comment