Analysis is stuck in try-job loop at one revision 534726 |
|||
Issue descriptionPage URL: https://findit-for-me.appspot.com/waterfall/flake?key=ag9zfmZpbmRpdC1mb3ItbWVyoAELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCJqY2hyb21pdW0ubWFjL01hYzEwLjkgVGVzdHMvNTM2MjAvY29udGVudF9icm93c2VydGVzdHMvVFdWdGIzSjVWSEpoWTJsdVoxUmxjM1F1UW5KdmQzTmxja2x1YVhScFlYUmxaRVIxYlhBPQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM Description: This pipeline has been running for 2 days, and it is in a loop at one revision 534726. See the tryjobs: https://ci.chromium.org/p/chromium/builders/luci.chromium.findit/findit_variable/865 https://ci.chromium.org/p/chromium/builders/luci.chromium.findit/findit_variable/864 https://ci.chromium.org/p/chromium/builders/luci.chromium.findit/findit_variable/863 https://ci.chromium.org/p/chromium/builders/luci.chromium.findit/findit_variable/862 More could be found in https://pantheon.corp.google.com/logs/viewer?project=findit-for-me&duration=PT1H&minLogLevel=0&expandAll=false×tamp=2018-02-09T00:37:05.974698000Z&interval=NO_LIMIT&resource=gae_app%2Fmodule_id%2Fwaterfall-backend%2Fversion_id%2F13850-8d17315&filters=text:MemoryTracingTest.BrowserInitiatedDump&logName=projects%2Ffindit-for-me%2Flogs%2Fappengine.googleapis.com%252Frequest_log
,
Feb 9 2018
FYI: I've cancelled the analysis manually.
,
Feb 9 2018
I guess P0 is more appropriate as we plan to run more try-jobs in next deployment.
,
Feb 15 2018
The root cause has been identified: In process_flake_try_job_result_pipeline.py, there was an error in the returned try job data, which was detected correctly, and stored to the FlakeTryJob entity. However, FlakeTryJob does not have a .error field, so this was being set and lost. In NextCommitPositionPipeline, FlakeTryJobData.error is being checked, but was not set (due to mistake of setting the error in FlakeTryJob instead). Thus the lookback algorithm would just run a bisect again, and request the same commit position, to be run, and an error encountered, etc, resulting in the infinite loop. Fixes: 1. Set error in the correct place 2. Ensure the commit position that was just ran, if called in NextCommitPositionPipeline, indeed has a corresponding data point as a defense mechanism.
,
Feb 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/4b2a531231829ef6c44e6cca375fcc6568e54c2a commit 4b2a531231829ef6c44e6cca375fcc6568e54c2a Author: Jeffrey Li <lijeffrey@chromium.org> Date: Thu Feb 15 13:37:21 2018 [Findit] Flake Analyzer - Fixing infinite loop in culprit analysis The infinite loop was caused by an improper check for fake try jobs that had undetected errors. The error is stored in the incorrect model, which when read, would not be present. 1. Fix the error to be stored properly in FlakeTryJobData, not FlakeTryJob. 2. Implement a check to ensure the previously-ran commit position indeed has a corresponding data point before proceeding, else abort to prevent similar issues from over-utilizing resources. Bug: 810589 Change-Id: I8e5fac4447ff97b741c70787944e17a2909ff248 Reviewed-on: https://chromium-review.googlesource.com/920781 Reviewed-by: Shuotao Gao <stgao@chromium.org> Commit-Queue: Jeffrey Li <lijeffrey@chromium.org> [modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/process_flake_try_job_result_pipeline.py [modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/test/process_flake_try_job_result_pipeline_test.py [modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/recursive_flake_try_job_pipeline.py [modify] https://crrev.com/4b2a531231829ef6c44e6cca375fcc6568e54c2a/appengine/findit/waterfall/flake/test/recursive_flake_try_job_pipeline_test.py
,
Feb 16 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by st...@chromium.org
, Feb 9 2018