Find a way not to fail reporting events from CQ |
|||||||||||||||
Issue description
Stacktrace:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/gae_ts_mon/config.py", line 243, in dispatch
time_fn=time_fn)
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/gae_ts_mon/config.py", line 205, in _instrumented_dispatcher
ret = dispatcher(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/shared/utils.py", line 29, in headered_json_handler
result = handler(self, *args)
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 31, in get
return summarize_patch(issue, patch, now)
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 37, in summarize_patch
for raw_attempt in get_raw_attempts(issue, patch)][::-1]
File "/base/data/home/apps/s~chromium-cq-status/732aa73.391724554431514264/handlers/patch_summary.py", line 105, in summarize_attempt
durations['running_all_jobs'] = timestamp - verifier_start_timestamp
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
Errors started at about 12:30: https://pantheon.corp.google.com/errors/13179318467544052417?time=PT6H&refresh=off&sample=12031498970441013394&project=chromium-cq-status
The error is in code written by alancutter, but may be due to a change in the data format it is parsing.
,
Apr 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/f5967cbaf89645a9576cfcbdb5867f1c717895d0 commit f5967cbaf89645a9576cfcbdb5867f1c717895d0 Author: sergiyb <sergiyb@chromium.org> Date: Fri Apr 08 20:51:41 2016 Only try to compute running_all_jobs time if we know when try-job verifier started R=agable@chromium.org BUG= 601899 Review URL: https://codereview.chromium.org/1877453003 [modify] https://crrev.com/f5967cbaf89645a9576cfcbdb5867f1c717895d0/appengine/chromium_cq_status/handlers/patch_summary.py [add] https://crrev.com/f5967cbaf89645a9576cfcbdb5867f1c717895d0/appengine/chromium_cq_status/handlers/test/patch_summary_test.expected/PatchSummaryTest.test_no_verifier_start.json [modify] https://crrev.com/f5967cbaf89645a9576cfcbdb5867f1c717895d0/appengine/chromium_cq_status/handlers/test/patch_summary_test.py [add] https://crrev.com/f5967cbaf89645a9576cfcbdb5867f1c717895d0/appengine/chromium_cq_status/handlers/test/resources/patch_no_verifier_start.json
,
Apr 8 2016
Deployed new version, https://chromium-cq-status.appspot.com/patch-summary/1870183003/1 now opens without any errors.
,
Apr 8 2016
,
Apr 8 2016
Regarding why verifier_start may be missing, I've done some static analysis of the code in async_push.py. We put events into the queue to be sent to chromium-cq-status. Then if AE returns an error, we put them back into the end of the queue and try again later. Although it may seem like we do infinite retries this way, we may actually discard events if we restart CQ before the retry happens. As a result, we may have reported verifier_fail, but dropped verifier_start due to AE errors and due to CQ restart. I'll add this to CQ team's daily triage, so we can have a look at this together and decide how to proceed.
,
Apr 11 2016
,
Apr 11 2016
IMO, won't fix in current CQ codebase. Yes, it happens, but it doesn't cause outages + no information is really missing - it's in event mon + in CQ logs.
,
Apr 11 2016
This doesn't belong to trooper queue.
,
Apr 15 2016
I'll keep it open, but mark P3.
,
Apr 26 2016
,
Oct 25 2016
This seems squarely in CQ SLO territory.
,
Oct 25 2016
+Katie since she’s spearheading CQ SLO efforts.
,
Jan 18 2017
,
Jan 23 2017
,
Aug 31 2017
,
Aug 31 2017
,
Jul 2
This can't be done in today's CQ design because CQ only form of "datastore" is Gerrit's comments, which can be used to implement 2-phase-commit, but will double the load on Gerrit. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by serg...@chromium.org
, Apr 8 2016Status: Assigned (was: Unconfirmed)