New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 601899 link

Starred by 3 users

Issue metadata

Status: Archived
Owner: ----
Closed: Jul 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Find a way not to fail reporting events from CQ

Project Member Reported by aga...@chromium.org, Apr 8 2016

Issue description

Stacktrace:
Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/gae_ts_mon/config.py", line 243, in dispatch
    time_fn=time_fn)
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/gae_ts_mon/config.py", line 205, in _instrumented_dispatcher
    ret = dispatcher(request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/shared/utils.py", line 29, in headered_json_handler
    result = handler(self, *args)
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 31, in get
    return summarize_patch(issue, patch, now)
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 37, in summarize_patch
    for raw_attempt in get_raw_attempts(issue, patch)][::-1]
  File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 105, in summarize_attempt
    durations['running_all_jobs'] = timestamp - verifier_start_timestamp
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

Errors started at about 12:30: https://pantheon.corp.google.com/errors/13179318467544052417?time=PT6H&refresh=off&sample=12031498970441013394&project=chromium-cq-status

The error is in code written by alancutter, but may be due to a change in the data format it is parsing.
 
Owner: serg...@chromium.org
Status: Assigned (was: Unconfirmed)
Status: Fixed (was: Assigned)
Deployed new version, https://chromium-cq-status.appspot.com/patch-summary/1870183003/1 now opens without any errors.
Labels: -Infra-CQ Infra-CommitQueue
Owner: ----
Status: Untriaged (was: Fixed)
Summary: Find a way not to fail reporting events to CQ (was: chromium-cq-status is throwing 500s in patch_summary.summarize_attempt())
Regarding why verifier_start may be missing, I've done some static analysis of the code in async_push.py. We put events into the queue to be sent to chromium-cq-status. Then if AE returns an error, we put them back into the end of the queue and try again later. Although it may seem like we do infinite retries this way, we may actually discard events if we restart CQ before the retry happens. As a result, we may have reported verifier_fail, but dropped verifier_start due to AE errors and due to CQ restart.

I'll add this to CQ team's daily triage, so we can have a look at this together and decide how to proceed.
Cc: -serg...@chromium.org
IMO, won't fix in current CQ codebase. Yes, it happens, but it doesn't cause outages + no information is really missing - it's in event mon + in CQ logs.
Labels: -Infra-Troopers
This doesn't belong to trooper queue.
Labels: -Pri-1 Pri-3
Status: Available (was: Untriaged)
Summary: Find a way not to fail reporting events from CQ (was: Find a way not to fail reporting events to CQ)
I'll keep it open, but mark P3.
Components: Infra>CQ
Labels: -Infra-CommitQueue
Cc: andyb...@chromium.org
This seems squarely in CQ SLO territory.
Cc: katthomas@chromium.org
+Katie since she’s spearheading CQ SLO efforts.
Cc: -andyb...@chromium.org
Components: -Infra>CQ Infra>Platform>CQdaemon

Comment 15 by efoo@chromium.org, Aug 31 2017

Components: Infra>Platform>CQ

Comment 16 by efoo@chromium.org, Aug 31 2017

Components: -Infra>Platform>CQdaemon
Status: Archived (was: Available)
This can't be done in today's CQ design because CQ only form of "datastore" is Gerrit's comments, which can be used to implement 2-phase-commit, but will double the load on Gerrit.

Sign in to add a comment