New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810589 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: ----



Sign in to add a comment

Analysis is stuck in try-job loop at one revision 534726

Project Member Reported by st...@chromium.org, Feb 9 2018

Issue description

Comment 1 by st...@chromium.org, Feb 9 2018

Labels: -Pri-2 Pri-1
Please investigate and fix the bug, and make sure the bug is not in the new analysis pipeline.

Comment 2 by st...@chromium.org, Feb 9 2018

FYI: I've cancelled the analysis manually.

Comment 3 by st...@chromium.org, Feb 9 2018

Labels: -Pri-1 Pri-0
I guess P0 is more appropriate as we plan to run more try-jobs in next deployment.
The root cause has been identified: In process_flake_try_job_result_pipeline.py, there was an error in the returned try job data, which was detected correctly, and stored to the FlakeTryJob entity. However, FlakeTryJob does not have a .error field, so this was being set and lost. In NextCommitPositionPipeline, FlakeTryJobData.error is being checked, but was not set (due to mistake of setting the error in FlakeTryJob instead). Thus the lookback algorithm would just run a bisect again, and request the same commit position, to be run, and an error encountered, etc, resulting in the infinite loop.

Fixes:
1. Set error in the correct place
2. Ensure the commit position that was just ran, if called in NextCommitPositionPipeline, indeed has a corresponding data point as a defense mechanism.
Project Member

Comment 6 by bugdroid1@chromium.org, Feb 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/4b2a531231829ef6c44e6cca375fcc6568e54c2a

commit 4b2a531231829ef6c44e6cca375fcc6568e54c2a
Author: Jeffrey Li <lijeffrey@chromium.org>
Date: Thu Feb 15 13:37:21 2018

[Findit] Flake Analyzer - Fixing infinite loop in culprit analysis

The infinite loop was caused by an improper check for fake try jobs that
had undetected errors. The error is stored in the incorrect model, which
when read, would not be present.

1. Fix the error to be stored properly in FlakeTryJobData, not FlakeTryJob.
2. Implement a check to ensure the previously-ran commit position indeed
   has a corresponding data point before proceeding, else abort to prevent
   similar issues from over-utilizing resources.

Bug:  810589 
Change-Id: I8e5fac4447ff97b741c70787944e17a2909ff248
Reviewed-on: https://chromium-review.googlesource.com/920781
Reviewed-by: Shuotao Gao <stgao@chromium.org>
Commit-Queue: Jeffrey Li <lijeffrey@chromium.org>

[modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/process_flake_try_job_result_pipeline.py
[modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/test/process_flake_try_job_result_pipeline_test.py
[modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/recursive_flake_try_job_pipeline.py
[modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/test/recursive_flake_try_job_pipeline_test.py

Status: Fixed (was: Assigned)

Sign in to add a comment