New issue
Advanced search Search tips

Issue 909079 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Feature request: CCI monitoring/alerts for false rejects.

Project Member Reported by erikc...@chromium.org, Nov 28

Issue description

go/cq-slo-dash has daily stats for false rejects. A recent spike [almost 10X!] in false rejects appears (?) to have gone unnoticed for a week until I manually checked the graphs. 
https://bugs.chromium.org/p/chromium/issues/detail?id=909074

Cross-checking against go/top-cq-flakes showed a clear spike in flakiness caused by fuchsia_x64. 

Given that we've been able to consistently keep false rejects below 10% since introduction of 'retry with patch' and associated fixes [~9/20], I think it would be appropriate to add some monitoring/alerts so that a CCI trooper investigates if false rejects exceeds 10%. 

As we continue to drive down false rejects, I'd like to eventually bring that threshold even lower.

+ jbudorick, wdyt?
 
Screen Shot 2018-11-27 at 8.00.04 PM.png
119 KB View Download
This seems likely to just be excess trooper noise to me; we've known about significant issues w/ the bots over the last week (and have gotten a *ton* of alerts about them). We may notice at a lower level than false reject rate, though.
The root cause for this bug was a bad Fuchsia SDK roll. 

> we've known about significant issues w/ the bots over the last week (and have gotten a *ton* of alerts about them)

Are you also referring to the bad SDK roll? And were CCI troopers able to easily pinpoint the root problem?

This particular problem seems to be at the intersection of Chromium sheriffs, CCI and CATS. Future variations are likely to occur. Basically:

1) Bad Chromium CL lands that causes flakes in multiple test suites, and large spike in INVALID_TEST_RESULTS.
2) Find-it starts filing bugs ~5 days ago [ issue 907804 ]
3) Chromium sheriffs start disabling tests ~4 days ago. They repeatedly remove the Sheriff label, which Find-it keeps re-adding [c#14, c#15, c#16, c#22]
4) wez@ starts investigating 30 hours ago, discovers root cause, lands fix, begins re-enabling tests.

In this case, Find-It did alert Chromium Sheriffs about the sudden, massive spike in flakiness, but the signal was lost in the noise. In particular, the spike in INVALID_TEST_RESULTS was not investigated at all. None of the sheriffs noticed the spike in flakiness across multiple test suites. 

If we had been monitoring false rejects, we could theoretically have discovered this problem quite easily and escalated. It took me less than 5 minutes of digging to realize something was terribly wrong and needed to escalate ASAP with fuchsia experts. 



Cc: nednguyen@chromium.org
Taking this one step further: If a bad CL lands that causes a spike in flakiness due to INVALID_TEST_RESULTS but not TEST_FAILURE, then the spike in false rejects will go unnoticed by both chromium sheriffs and CCI. CATs team will notice but currently there's no process to follow up from any of these three groups.

example: Back on 9/7, stgao@ filed a bug indicating that a large proportion of false rejects are due to INVALID_TEST_RESULTS [Issue 881991]. I started investigating alongside the 'retry with patch' improvements, and discovered that INVALID_TEST_RESULTS was caused by bugs both in test suite runners, and in infra-owned code. Some of these bugs had been around for a long time. 

I can be convinced that we don't need monitoring for false rejects caused by TEST_FAILURES. That is potentially covered by Find-It automatic bug filing + sheriffs [in this case, it didn't work out well, but this was also a holiday week, so maybe this is the exception].

I do think that we need monitoring/alerts for false rejects caused by INVALID_TEST_RESULTS. Historically those have gone unnoticed. Unfortunately, this requires case-by-case investigation to determine whether the problem lies in chromium code or infra code. +nednguyen [who might be in the best position to eventually own this]
I'd argue that INVALID_TEST_RESULTS should result in some sort of infra failure that gets sent to the CCI team for frontline triage, though it doesn't currently do so.

Possibly it should go to the core-automation team and they're trooper rotation, if they actually had one ;).
Cc: -nednguyen@chromium.org nedngu...@google.com
#4: a binary failing to emit a result JSON doesn't usually fit in CCI's scope, at least in my conception thereof -- I think the CCI/CCA boundary is at the test task boundary. In the absence of a CCA rotation, though, I could potentially be convinced.
agreed, that would be my thinking as well. The only other reason I'd put it in CCI's scope for now is that *someone* should notice it, and I think sheriffs would probably be confused by it. However, maybe I'm wrong and it should be sheriffs instead of CCI until CCA could deal w/ it.
> a binary failing to emit a result JSON doesn't usually fit in CCI's scope

Understood. Unfortunately, I have also seen examples where INVALID_TEST_RESULTS occurs due to a bug in infra code. Perhaps that is rare relative to the former? 

example 1: https://chromium-review.googlesource.com/c/chromium/tools/build/+/1234173 [chromium test recipe]
example 2: https://bugs.chromium.org/p/chromium/issues/detail?id=895027 [android test runner, I guess that's owned by CCA?]

Maybe I just don't know the scope of CCA (?)
chromium recipe is definitely CCI; android test runner is me for historical reasons but probably not CCI.

CCA is relatively new, so its scope isn't super clear yet.
Labels: Infra-Platform-Test

Sign in to add a comment