New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 655232 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

Analysis of compile failures is not started automatically on Findit side

Project Member Reported by st...@chromium.org, Oct 12 2016

Issue description

Example: http://build.chromium.org/p/chromium.linux/builders/Android%20Arm64%20Builder%20%28dbg%29/builds/37790

There might be two possible causes:
1. Findit didn't receive the requests from alerts-dispatcher (unlikely)
2. Findit dropped the requests silently for some reason. (most likely)
   Alerts-dispatcher sends a request for each failed build, while builder_alerts send a request for all the failed builds.
   Findit handles requests asynchronously with a task queue with max_concurrent_requests=1
 

Comment 1 by st...@chromium.org, Oct 12 2016

Cc: chanli@chromium.org lijeffrey@chromium.org
I checked logs on Findit side, and it turned out that case 2 above only happened once on Step 16 as shown in log.
https://screenshot.googleplex.com/xm1hV8BpSaF.png

And there is no error log in the current deployed version.

Thus it is possible that Findit didn't receive the requests from alerts-dispatcher.

Comment 2 by st...@chromium.org, Oct 14 2016

And another datapoint https://build.chromium.org/p/chromium/builders/Mac/builds/20166

The analysis was not triggered until I manually ran it on http://findit-for-me.appspot.com/

Comment 3 by st...@chromium.org, Oct 14 2016

The root cause is that analysis requests from alerts-dispatcher are queued up in the task queue "waterfall-serial-queue" which is configured with max_concurrent_requests=1
And analysis requests seem to be picked randomly for processing by task queue, because we do see result for one failure but no result for another on SoM.

The queue was used to speed up the response to requests from builder_alerts.
builder_alerts sent a single http request for all the failures, and it sent at the rate of 1~3 http requests per minute.

It could take up to 10+ seconds to process the analysis request of one single failed build, and decide whether a new analysis is needed (new steps could fail since last completed analysis).
To start the analysis for one failed build, Findit needs to get metadata of the failed build from the buildbot master through an http request (to avoid DDos, an interval of 10 seconds is enforced for requests to the same master), and do a transnational ndb write to avoid concurrent analyses of the same failure.

The queue is to separate such http requests and transnational ndb writes from direct non-transnational read of existing results for quicker response.

The max_concurrent_requests=1 made sense only for builder_alerts, because it sent requests much less frequently and no concurrent requests at all.
I set max_concurrent_requests=1, because it doesn't make much sense to process two batches of build failures concurrently as they most likely have the same failures.

But it totally doesn't make sense for alerts-dispatcher, because it sends parallel requests -- one per failed build.
And when there is a burst of failures on Waterfall, analysis requests are queued up quickly.
https://screenshot.googleplex.com/41kbJKmTM79.png
https://screenshot.googleplex.com/2nVFAdc5FKp.png
Project Member

Comment 4 by bugdroid1@chromium.org, Oct 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/545805bdd28931c2e13ca4e77ef623bf5e2df0aa

commit 545805bdd28931c2e13ca4e77ef623bf5e2df0aa
Author: stgao <stgao@chromium.org>
Date: Sat Oct 15 00:55:54 2016

[Findit] Process analysis requests of Waterfall failures concurrently.

This might not be the best fix, but it should work.
Integration through Pub/Sub might be a better solution to decouple alerts-dispatcher and findit: no direct http requests from alerts-dispatcher to findit.

BUG= 655232 

Review-Url: https://codereview.chromium.org/2425453002

[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/common/constants.py
[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/findit_api.py
[rename] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/handlers/process_failure_analysis_requests.py
[rename] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/handlers/test/process_failure_analysis_requests_test.py
[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/main.py
[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/queue.yaml
[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/waterfall-backend.yaml
[modify] https://crrev.com/545805bdd28931c2e13ca4e77ef623bf5e2df0aa/appengine/findit/waterfall/build_failure_analysis_pipelines.py

Comment 5 by st...@chromium.org, Oct 15 2016

Status: Fixed (was: Assigned)
This bug should have been fixed by the CL above.

But we might want a better solution for the integration between Sheriff-o-Matic (actually alerts-dispatcher) and Findit, like using Pub/Sub.
For this idea, it is tracked in bug 656228

Sign in to add a comment