You can see this for the analysis: https://findit-for-me.appspot.com/waterfall/flake?key=ag9zfmZpbmRpdC1mb3ItbWVy9AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCK9AWNocm9taXVtLm1hYy9NYWMxMC4xMSBUZXN0cy8xNjUzMy9icm93c2VyX3NpZGVfbmF2aWdhdGlvbl9icm93c2VyX3Rlc3RzIG9uIE1hYy0xMC4xMS9RWEJ3UTI5dWRISnZiR3hsY2s1bGQxQnliMlpwYkdWTllXNWhaMlZ0Wlc1MFFuSnZkM05sY2xSbGMzUXVURzlqYTJWa1VISnZabWxzWlZKbGIzQmxibGRwZEdoT2IxZHBibVJ2ZDNNPQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM pipeline ui: https://findit-for-me.appspot.com/_ah/pipeline/status?root=9447783ad289404bbc4fb11d959f1e44&auto=false#pipeline-0435f577f402465489d824ae200066c4 The same build number 16533 is being given over and over and over again. Not sure when it'll stop. I think this is a result of the data points not being updated in the analysis.
A good fix would be to check if the build number that's returned by lookback is the same build number that NextBuildNumberPipeline recieved as an argument. If it is, then abort the whole pipeline.
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560 commit e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560 Author: Brandon Wylie <wylieb@chromium.org> Date: Tue Aug 15 20:08:04 2017 [Findit] Flake Analyzer - Fix for infinite nextbuildnumber loop. Bug:755359 Change-Id: I7119d7c7bd61331a2484dc93a0a064815aa7421a Reviewed-on: https://chromium-review.googlesource.com/615095 Commit-Queue: Brandon Wylie <wylieb@chromium.org> Reviewed-by: Jeffrey Li <lijeffrey@chromium.org> [modify] https://crrev.com/e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560/appengine/findit/waterfall/flake/next_build_number_pipeline.py [modify] https://crrev.com/e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560/appengine/findit/waterfall/flake/test/recursive_flake_pipeline_test.py [modify] https://crrev.com/e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560/appengine/findit/waterfall/flake/recursive_flake_pipeline.py [modify] https://crrev.com/e7644aa3213ba7f7a8f0cf2d2faeaa51dbd95560/appengine/findit/waterfall/flake/test/next_build_number_pipeline_test.py
I checked a lot of recent analyses of flaky tests, and found below: 1. A lot of analyses were still in RUNNING status even they were a few days ago. However, the pipeline indicates that they are completed. Among these 100 analyses, 65 are still in running status, while 17 ran into Errors. https://findit-for-me.appspot.com/waterfall/list-flakes?cursor=CuIBChkKDHJlcXVlc3RfdGltZRIJCNyYpqzVrNUCEsABag9zfmZpbmRpdC1mb3ItbWVyrAELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCJ2Y2hyb21pdW0ubGludXgvTGludXggVGVzdHMvNTk2MjAvY29tcG9uZW50c191bml0dGVzdHMvUm1sbGJHUlVjbWxoYkhOUWNtOTJhV1JsY2xSbGMzUXVVSEp2ZG1sa1pWTjViblJvWlhScFkxUnlhV0ZzY3c9PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAUMGAAgAQ==&direction=next 2. The "infinite loop" itself stops after ~160 instances of the RecursiveFlakePipeline. The analysis Brandon gave above is a good example. A more recent one is https://findit-for-me.appspot.com/waterfall/flake?key=ag9zfmZpbmRpdC1mb3ItbWVynwELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCJpY2hyb21pdW0ud2luL1dpbjcgVGVzdHMgKGRiZykoMSkvNjIzMzIvYnJvd3Nlcl90ZXN0cy9VMlZ6YzJsdmJsSmxjM1J2Y21WVVpYTjBMbEpsYzNSdmNtVlhaV0pWU1ZObGRIUnBibWR6DAsSE01hc3RlckZsYWtlQW5hbHlzaXMYAQw When it just started, it is https://screenshot.googleplex.com/VNUhV7vzemT.png But now it is https://screenshot.googleplex.com/h6vB4ZTbNT5.png https://findit-for-me.appspot.com/_ah/pipeline/status?root=0d9242e38eb5453087c8cdf321ddff0b&auto=false#pipeline-9ec735c4d228482ca845713073347e06 3. The "infinite loop" happens when Flake Analyzer tries to rerun at the same build point at which the test is stable instead of flaky. Besides this specific bug, I'm surprised that our analysis pipeline is so unreliable. It seems better to fix this bigger issue before adding more features to Flake Analyzer. Brandon, as you have looked into this bug, I'd assign it to you to follow up, and please file a meta bug to stabilize Flake Analyzer with sub-bugs for breakdown tasks that you could identify.
Update to point 2 above in comment #3: 160 seems low and wrong for the sample I gave in comment #3. It is still running and there have been 200+ instances of RecursiveFlakePipeline already (Children: 985 / 1182 done) https://findit-for-me.appspot.com/_ah/pipeline/status?root=0d9242e38eb5453087c8cdf321ddff0b&auto=false#pipeline-63586d7e541f4222993b3edd522dd7e6
Looks like the root cause of this is: https://bugs.chromium.org/p/chromium/issues/detail?id=756214
Comment 1 by wylieb@chromium.org
, Aug 14 2017