New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 716240 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

lib/classifier causes slowdown in ReportStage on master-paladin

Project Member Reported by akes...@chromium.org, Apr 27 2017

Issue description


Build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14414/
Stage: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F14414%2F%2B%2Frecipes%2Fsteps%2FReport%2F0%2Fstdout


This is a build that would otherwise self destruct quickly. However, it is spending a long time in ReportStage looping over build slaves and trying to fetch their logs. Each is timing out, probably because those builds are still in progress.
 
Cc: nxia@chromium.org
Cc: d...@chromium.org
+dnj in case he has any ideas.

We currently do 3 retries with a 5 second sleep between each for each unsuccessful stage.

Various options (not mutually exclusive):
1. when the stage is marked as inflight in CIDB, do not do any retries when retrieving logs with the expectation that they're finished
2. decrease retry sleeps
3. parallelize alert dispatching handling different builds and stages in parallel
4. move all processing to an external service that is constantly generating alerts or is kicked off via the builds

#1, and maybe #3, are probably the best options initially.

Not directly related, but worth doing:
- when returning logs, instead of timing out with an exception, return partial logs and a error message that is filled on incomplete results
I like #1 and #3.
I tried implementing #1 and it helps (reduces time in half), but it still gets hung up on a build like:
http://vi/chromeos/build_details?build_id=1481843
https://uberchromegw.corp.google.com/i/chromeos/builders/falco-full-compile-paladin/builds/8974

This is a build that CIDB has status updated to "failed", but was actually aborted according to buildbot and has incomplete logs.

Comment 5 by d...@chromium.org, Apr 28 2017

I think the "partial logs" field is reasonable. Remember, if the build legit crashes, the termination signal will never be sent, and the log will be considered streaming until it times out after a long time (day?).
Project Member

Comment 6 by bugdroid1@chromium.org, Apr 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/06cfd5621cb29f15c1c67873e5c83bc808f6644b

commit 06cfd5621cb29f15c1c67873e5c83bc808f6644b
Author: David Riley <davidriley@chromium.org>
Date: Sat Apr 29 09:42:49 2017

som_alerts_dispatcher: Don't always retry when retrieving logdog logs.

When som_alerts_dispatcher is gathering logs for classification, it
normal retries three times with a five second sleep to handle transient
LogDog issues.  For inflight/aborted builds, there's a good chance that
some of the log descriptors will not be closed so avoid blocking
waiting for termination when analyzing builds/stages in those states.

BUG= chromium:716240 
TEST=som_alerts_dispatcher CREDS 1481808,1000

Change-Id: I69b904eb05ed80ae48f94a274a6703330cbfcc76
Reviewed-on: https://chromium-review.googlesource.com/490608
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/06cfd5621cb29f15c1c67873e5c83bc808f6644b/lib/logdog.py
[modify] https://crrev.com/06cfd5621cb29f15c1c67873e5c83bc808f6644b/scripts/som_alerts_dispatcher.py

Project Member

Comment 7 by bugdroid1@chromium.org, May 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a

commit 0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a
Author: David Riley <davidriley@chromium.org>
Date: Tue May 02 02:18:16 2017

som_alerts_dispatcher: Parallelize build alert generation.

BUG= chromium:716240 
TEST=som_alerts_dispatcher

Change-Id: I974dc76eaec459e2ea159351fc05efcdffeb9350
Reviewed-on: https://chromium-review.googlesource.com/490541
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a/scripts/som_alerts_dispatcher.py

Status: Fixed (was: Assigned)
I still want to do partial logs, but I'll do that separately from this bug.

Comment 9 by dchan@chromium.org, Aug 1 2017

Labels: VerifyIn-61

Comment 10 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment