lib/classifier causes slowdown in ReportStage on master-paladin |
|||||
Issue descriptionBuild: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14414/ Stage: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F14414%2F%2B%2Frecipes%2Fsteps%2FReport%2F0%2Fstdout This is a build that would otherwise self destruct quickly. However, it is spending a long time in ReportStage looping over build slaves and trying to fetch their logs. Each is timing out, probably because those builds are still in progress.
,
Apr 28 2017
+dnj in case he has any ideas. We currently do 3 retries with a 5 second sleep between each for each unsuccessful stage. Various options (not mutually exclusive): 1. when the stage is marked as inflight in CIDB, do not do any retries when retrieving logs with the expectation that they're finished 2. decrease retry sleeps 3. parallelize alert dispatching handling different builds and stages in parallel 4. move all processing to an external service that is constantly generating alerts or is kicked off via the builds #1, and maybe #3, are probably the best options initially. Not directly related, but worth doing: - when returning logs, instead of timing out with an exception, return partial logs and a error message that is filled on incomplete results
,
Apr 28 2017
I like #1 and #3.
,
Apr 28 2017
I tried implementing #1 and it helps (reduces time in half), but it still gets hung up on a build like: http://vi/chromeos/build_details?build_id=1481843 https://uberchromegw.corp.google.com/i/chromeos/builders/falco-full-compile-paladin/builds/8974 This is a build that CIDB has status updated to "failed", but was actually aborted according to buildbot and has incomplete logs.
,
Apr 28 2017
I think the "partial logs" field is reasonable. Remember, if the build legit crashes, the termination signal will never be sent, and the log will be considered streaming until it times out after a long time (day?).
,
Apr 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/06cfd5621cb29f15c1c67873e5c83bc808f6644b commit 06cfd5621cb29f15c1c67873e5c83bc808f6644b Author: David Riley <davidriley@chromium.org> Date: Sat Apr 29 09:42:49 2017 som_alerts_dispatcher: Don't always retry when retrieving logdog logs. When som_alerts_dispatcher is gathering logs for classification, it normal retries three times with a five second sleep to handle transient LogDog issues. For inflight/aborted builds, there's a good chance that some of the log descriptors will not be closed so avoid blocking waiting for termination when analyzing builds/stages in those states. BUG= chromium:716240 TEST=som_alerts_dispatcher CREDS 1481808,1000 Change-Id: I69b904eb05ed80ae48f94a274a6703330cbfcc76 Reviewed-on: https://chromium-review.googlesource.com/490608 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/06cfd5621cb29f15c1c67873e5c83bc808f6644b/lib/logdog.py [modify] https://crrev.com/06cfd5621cb29f15c1c67873e5c83bc808f6644b/scripts/som_alerts_dispatcher.py
,
May 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a commit 0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a Author: David Riley <davidriley@chromium.org> Date: Tue May 02 02:18:16 2017 som_alerts_dispatcher: Parallelize build alert generation. BUG= chromium:716240 TEST=som_alerts_dispatcher Change-Id: I974dc76eaec459e2ea159351fc05efcdffeb9350 Reviewed-on: https://chromium-review.googlesource.com/490541 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/0be2470d2b6e8b44a0c1127ae2eed5f42dcbdc6a/scripts/som_alerts_dispatcher.py
,
May 2 2017
I still want to do partial logs, but I'll do that separately from this bug.
,
Aug 1 2017
,
Jan 22 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by akes...@chromium.org
, Apr 27 2017