Analysis of compile failures is not started automatically on Findit side |
||
Issue descriptionExample: http://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/37790 There might be two possible causes: 1. Findit didn't receive the requests from alerts-dispatcher (unlikely) 2. Findit dropped the requests silently for some reason. (most likely) Alerts-dispatcher sends a request for each failed build, while builder_alerts send a request for all the failed builds. Findit handles requests asynchronously with a task queue with max_concurrent_requests=1
,
Oct 14 2016
And another datapoint https://build.chromium.org/p/chromium/builders/Mac/builds/20166 The analysis was not triggered until I manually ran it on http://findit-for-me.appspot.com/
,
Oct 14 2016
The root cause is that analysis requests from alerts-dispatcher are queued up in the task queue "waterfall-serial-queue" which is configured with max_concurrent_requests=1 And analysis requests seem to be picked randomly for processing by task queue, because we do see result for one failure but no result for another on SoM. The queue was used to speed up the response to requests from builder_alerts. builder_alerts sent a single http request for all the failures, and it sent at the rate of 1~3 http requests per minute. It could take up to 10+ seconds to process the analysis request of one single failed build, and decide whether a new analysis is needed (new steps could fail since last completed analysis). To start the analysis for one failed build, Findit needs to get metadata of the failed build from the buildbot master through an http request (to avoid DDos, an interval of 10 seconds is enforced for requests to the same master), and do a transnational ndb write to avoid concurrent analyses of the same failure. The queue is to separate such http requests and transnational ndb writes from direct non-transnational read of existing results for quicker response. The max_concurrent_requests=1 made sense only for builder_alerts, because it sent requests much less frequently and no concurrent requests at all. I set max_concurrent_requests=1, because it doesn't make much sense to process two batches of build failures concurrently as they most likely have the same failures. But it totally doesn't make sense for alerts-dispatcher, because it sends parallel requests -- one per failed build. And when there is a burst of failures on Waterfall, analysis requests are queued up quickly. https://screenshot.googleplex.com/41kbJKmTM79.png https://screenshot.googleplex.com/2nVFAdc5FKp.png
,
Oct 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/545805bdd28931c2e13ca4e77ef623bf5e2df0aa commit 545805bdd28931c2e13ca4e77ef623bf5e2df0aa Author: stgao <stgao@chromium.org> Date: Sat Oct 15 00:55:54 2016 [Findit] Process analysis requests of Waterfall failures concurrently. This might not be the best fix, but it should work. Integration through Pub/Sub might be a better solution to decouple alerts-dispatcher and findit: no direct http requests from alerts-dispatcher to findit. BUG= 655232 Review-Url: https://codereview.chromium.org/2425453002 [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/common/constants.py [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/findit_api.py [rename] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/handlers/process_failure_analysis_requests.py [rename] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/handlers/test/process_failure_analysis_requests_test.py [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/main.py [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/queue.yaml [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/waterfall-backend.yaml [modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/waterfall/build_failure_analysis_pipelines.py
,
Oct 15 2016
This bug should have been fixed by the CL above. But we might want a better solution for the integration between Sheriff-o-Matic (actually alerts-dispatcher) and Findit, like using Pub/Sub. For this idea, it is tracked in bug 656228 |
||
►
Sign in to add a comment |
||
Comment 1 by st...@chromium.org
, Oct 12 2016