New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 845299 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-06-01
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 840427
issue 843199



Sign in to add a comment

telemetry_perf_unittests is too flaky on CQ causing a lot of builds that were retried and passed eventually

Project Member Reported by st...@chromium.org, May 21 2018

Issue description

telemetry_perf_unittests is one top source of potential CQ false rejections.

Here are the two lists of failed builds due to Swarming task timeout or the like and caused a lot of build/CQ retries:
1) Affecting 612 different CLs on 711 different CQ Luci builds: https://docs.google.com/spreadsheets/d/1MQXJDXW-dB4TjHwBvRiiax26YYXP8i5c1IF9f0Z7ZPU/edit#gid=1356601349
2) Affecting 208 different CLs on 252 different CQ Buildbot builds: https://docs.google.com/spreadsheets/d/1iAdws7P8-wfzAiLGRjgrpOXrlUW6L-0hXbCxLPUJKIw/edit#gid=1229683811
(Note that the two CL sets might overlap.)

The flake is due to Swarming task timeout from those limited builds I inspected.
The flake is mostly on linux_chromium_rel_ng, win7_chromium_rel_ng, and android_n5x_swarming_rel.
The flake is since 2018-04-13 19:01:08.69453 UTC. So it seems NOT due to a recent change like the one in     bug 842978     or   843199   for Win7.
The finding above is based on Chromium CQ data in the past 60 days.

(In the new flake detection, we will bring back detection of flaky test steps like this but our priority now is to detect flaky tests.)

 

Comment 1 by st...@chromium.org, May 21 2018

Labels: Hotlist-CQ-FalseRejection

Comment 2 by st...@chromium.org, May 22 2018

Description: Show this description

Comment 3 by st...@chromium.org, May 22 2018

Cc: benhenry@chromium.org sullivan@chromium.org jparent@chromium.org jbudorick@chromium.org dpranke@chromium.org
Components: Speed>Telemetry
Owner: nednguyen@chromium.org
Status: Assigned (was: Untriaged)
Ned, would you mind looking into this? telemetry_perf_unittests caused a lot of flaky builds in CQ as shown above.

Comment 4 by st...@chromium.org, May 22 2018

Description: Show this description

Comment 5 Deleted

How are you labelling flakes? I find https://ci.chromium.org/buildbot/tryserver.chromium.win/win7_chromium_rel_ng/167595 in the spreadsheet which seem like a legit failure, not a flaky run
Blockedon: 843199
Looks like almost all the flakes on win 7 recent is due to  issue 843199 
Is that possible to have graph tracking of these data? I suspect there are more than 1 reasons why telemetry_perf_unittests would be flaky, so having a graph tracking to drive down the flake would be great.
I also disagree with your assessment that "it is NOT due to a recent change like the one in    bug 842978  or  843199  for Win7." In  issue 843199 , thakis@ mentioned that the problem is due to his lld switch commit which was landed a while ago.

Comment 10 by st...@chromium.org, May 22 2018

Re #6 and #9:

Criteria to label those steps as flaky:
1. For the same CL/patchset and the same CQ builder, there are one failed build and one successful build. The failed build is classified as flaky builds.
2. This bug ONLY includes those flaky builds with the failure type of INVALID_TEST_RESULTS (Those failed builds due to TEST_FAILURE are all excluded). From what I know so far, they are most likely due to Swarming task timeout or expired Swarming tasks, though there might be more cases here.
3. The builds listed in the two spreadsheets above ALL had failed "telemetry_perf_unittests (with patch)" like https://screenshot.googleplex.com/jSerhBEKO01.png
   If "telemetry_perf_unittests (without patch)" passed or "telemetry_perf_unittests (retry summary)" failed, those builds are excluded.

I understand this is tricky. And if this still can't help much, please feel free to set up a meeting with me, and I could explain better with some demo to you.

The build you linked might be due to the recent   bug 842978    or  843199 .
However, that bug started around May 15. If you look at the two spreadsheets, a lot of failed builds on win7_chromium_rel_ng were way before that.
IIUC, the  bug 843199  only caused trouble on Win7. However, we still have a lot of failed builds on linux_chromium_rel_ng and some in android_n5x_swarming_rel.
That's why I concluded that it is not due to this recent change.


Re #8:
Each row in the two spreadsheets has a timestamp, you may try to build time-based graph on top of that in spreadsheet I believe.
If that can't work out, I could offer to add a commit position to each row too.


stgao@, as mentioned in #8, I doubt that this is the case of one CL landed making Telemetry test suite flaky. It's more like this suite contains 2, 3 tests that are flaky for a while & there are some CLs that landed which caused even more flakes (e.g:  issue  843199 ). So only doing commit analysis is unlikely to help with this type of situation.

I believe the way to approach it is to fix one type of flaky at a time, based on the stack trace of the failure. For this approach, it would be great of have graph tracking to know when we drive down the flakes to a reasonable level, and which platforms/ type of problem to focus on next.

Comment 12 by st...@chromium.org, May 22 2018

Description: Show this description

Comment 13 by st...@chromium.org, May 22 2018

Cc: kbr@chromium.org thakis@chromium.org
+thakis@: would you mind confirming whether the  issue 843199  could affect linux_chromium_rel_ng? We'd like to exclude this cause if possible, because over 700+ builds in linux_chromium_rel_ng had telemetry_perf_unittests being flaky https://docs.google.com/spreadsheets/d/1MQXJDXW-dB4TjHwBvRiiax26YYXP8i5c1IF9f0Z7ZPU/edit#gid=1356601349

nednguyen@, I agreed with you that there might be multiple root causes that made telemetry_perf_unittests flaky and it's better to tackle one at a time.
However, I'd like to clarify one thing more explicitly: individual flaky tests in telemetry_perf_unittests (failure type: TEST_FAILURE) are EXCLUDED from the data linked to in this bug, because the data here is specific for flaky telemetry_perf_unittests as a WHOLE step so that it caused the failure type of INVALID_TEST_RESULTS (the root causes could be Swarming task timeout, Swarming task expired, etc). It is possible that some specific tests in the suite cause a whole shard of telemetry_perf_unittests to time out, but that's beyond my knowledge.
It would be great if you or someone on your team could first investigate the 700+ builds of linux_chromium_rel_ng WITH failed "telemetry_perf_unittests (with patch)" but WITHOUT explicit failed test failures. This is one specific platform and one specific type of problem as you wanted. As I described in the bug summary, NO other platform had such a massive failures expect win7 for which the root cause might be the  issue 843199 .
As of now, those 700+ failed builds in linux_chromium_rel_ng with timestamps are the best information I could provide from my end. You may create graph based on the timestamp if that could help.
Personally, I really appreciate any effort to drive potential CQ false rejection down! And I believe you and your team also care a lot about that as well :)

(+kbr@ who might be interested in this bug as well)
NextAction: 2018-06-01
I click on a few links and almost all the timed out are due to  issue  843199  & an issue crbug.com/838504 that has been addressed (test has been suppressed). 

Given this, I would wait another 10 days, then can you run your analysis again and see how flaky rate has changed?

Blockedon: 840427
Meanwhile, eyaich@'s effort of splitting all benchmark smoke tests out to a separate suite will also help stabilize things further.

Comment 16 by st...@chromium.org, May 22 2018

Many thanks Ned for looking into this!

And please feel free to ping me to provide you updated data in the next action day!
 issue 843199  can't affect Linux.
#17: correct. Almost all Linux flakes are due to crbug.com/838504.
The NextAction date has arrived: 2018-06-01
Shuotao: can you run your analysis again and let me know how it has changed since then?
Description: Show this description
I've confirmed that the flakiness on linux_chromium_rel_ng is gone.

For win7_chromium_rel_ng and android_n5x_swarming_rel, I need more time to confirm because the data is split into BigQuery (Luci builds) and Dremel (Buildbot builds) which needs some overhead of plumbing.

Comment 23 by kbr@chromium.org, Jun 6 2018

win7_chromium_rel_ng is now LUCI-only. android_n5x_swarming_rel has been replaced with https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel which is LUCI-only.

First build of android_n5x_swarming_rel on Luci was at 2018-05-20 05:29:54.917 UTC
First build of win7_chromium_rel_ng on Luci was at 2018-05-31 01:20:21.583 UTC
linux_chromium_rel_ng has been on Luci for a long time.

For those three builders on Luci side, no step-level flakiness of telemetry_perf_unittests is detected in the past 21 days.
If you still need data on buildbot side, please let me know too.

Comment 25 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 26 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Speed>Telemetry

Sign in to add a comment