Issue metadata
Sign in to add a comment
|
Lots of Android bots and a Linux bot started failing around 10:30 on 3/31/16 |
||||||||||||||||||||||||
Issue descriptionRevision range first seen: 384361 - 384610 Bots: all Android bots The actual test failure varies from test to test, but usually involves some sort of timeout: WebSocketTimeoutException, Javascript evaluation timeout, Dev tools timeout exception, etc. I've seen a Chrome crash screenshot in some of the test failures: http://bit.ly/1N4QzLg Does anyone have any insight into what might be causing this?
,
Apr 11 2016
,
Apr 11 2016
,
Apr 11 2016
Sorry: that screenshot was supposed to be: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/profiler-file-id_1-2016-04-11_04-04-1012655.png
,
Apr 12 2016
Do you have any links to build outputs that we could look at? Something like https://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/33837
,
Apr 12 2016
Yep - here's the log output for a recent Galaxy S5 failure: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/2258 And here's the log output for a recent Linux failure: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf%20%285%29/builds/14539
,
Apr 13 2016
,
Apr 13 2016
,
Apr 14 2016
,
Apr 15 2016
This is a pretty tricky bug. When I ran a normal bisect on the problem, it came back with this CL as the culprit: https://chromium.googlesource.com/chromium/src/+/846f623cb1ef78d90e620c2a7a6557544a86b692. The tricky part: this CL has been reverted and the failures continued. The interesting thing about this CL is that, when I went back and looked at the actual perf waterfall run that ran at the CL after this one: rasterize_and_record_micro.top_25_smooth passed. This CL seems to have caused a few flaky failures among other performance tests, and my best guess is that it also caused a flaky failure on rasterize_and_record_micro.top_25_smooth. That would explain why the bisect labeled it as the culprit when it passed on the waterfall. Another bisect searching for the *same* bug came back with this CL as the culprit: https://chromium.googlesource.com/chromium/src/+/eab85df0275b825e428954c4030e358ce968ecd9 Funny thing: *that* CL also got reverted a long time ago too, and the problem didn't go away that time either. My suspicion is that sometime after https://chromium.googlesource.com/chromium/src/+/846f623cb1ef78d90e620c2a7a6557544a86b692 got submitted (and while it was causing flake for the test on the waterfall), another CL got submitted that caused permanent failure for rasterize_and_record_micro.top_25_smooth. I think we have a pretty good idea of when the "bad" CL was submitted by because rather than having 3 or 4 tests failing every waterfall run (caused by 846f623 flakes), we instead only have one test failing: rasterize_and_record_micro.top_25_smooth. I'm going to use 65f484b3674d177e98ed04aa2ec30482e05a6c3a as my known "bad" CL. What I'm doing now is running a manual bisect with the revert of that CL patched in. I'll be using all 3 Linux trybots to do this. The first run will contain 3 try jobs run at: - 57a75864e012b06db30f4b027e24f5fd758e46a3, the CL before 846f623cb1ef78d90e620c2a7a6557544a86b692 - 65f484b3674d177e98ed04aa2ec30482e05a6c3a, with the revert of 846f623 patched in - the CL right in the middle of those two, with the revert of 846f623 patched in When all of those jobs come back, I'll a trisect of the known bad range, dividing it into quarters and kicking off a try job of rasterize_and_record_micro.top_25_smooth at each quartile of the range (with the revert of 846f623 patched in). I'll continue to do this until we have it narrowed down to one CL. At some point, it may be necessary to patch in the revert of eab85df0275b825e428954c4030e358ce968ecd9, too. I'll update this bug letting everyone know how it goes :)
,
Apr 15 2016
Wow, great detective work. Good luck Charlie!
,
Apr 15 2016
Tryjob for 57a75864e012b06db30f4b027e24f5fd758e46a3, our known GOOD changelist: https://codereview.chromium.org/1896433002/ Tryjob for 65f484b3674d177e98ed04aa2ec30482e05a6c3a, our known BAD changelist: https://codereview.chromium.org/1892103002/ Tryjob for 842eb06726eb451439cf0e54cbf16cebccc93192, the CL exactly between the two: https://codereview.chromium.org/1890143002/ For anyone interested on how to do this, I started out by getting the list of commits in between the two: git log 57a75864e012b06db30f4b027e24f5fd758e46a3...65f484b3674d177e98ed04aa2ec30482e05a6c3a --pretty=oneline --reverse > /tmp/manual_trisect.txt The file was 318 lines long (meaning that there are 318 revisions that we need to bisect through), which means that the revision half way through is the 159th line: 842eb06726eb451439cf0e54cbf16cebccc93192. At all three of the CLs, I was forced to make a meaningless code change: the `tools/perf/run_benchmark try` doesn't allow you to just run a test at a checked-in CL (for some unknown reason): to circumvent this, I just started a branch at each of these CLs and added whitespace to a file, committed the whitespace change, and started a tryjob with: tools/perf/run_benchmark try linux rasterize_and_record_micro.top_25_smooth We'll see what they turn up.
,
Apr 15 2016
Also, to anyone interested in how long this is likely to take: the number of iterations needed to bisect N revisions is ceil(log2(N)). That means that a bisect would take ceil(log2(318)) = ceil(8.31288295528) = 9 iterations to find the correct CL in our range. However, because we're using 3 testers in parallel and cutting our range into quartiles with each step, that means we need ceil(log4(N)) = ceil(4.16) = 5 iterations. ... I think. Just trying to figure out how long we can expect this to take.
,
Apr 15 2016
Ah, and a couple of other important notes: we think that https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692 will make our test flaky, so we need to rever that CL for any tests that we do. However, https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692 was actually reverted in http://crrev.com/e37fce4deefe8c872c4329bdf4233858aab8b9a5 - also in our test range. In our 318 CL range, that means that CL #1 was reverted in CL#101. Any revisions we need to test in that range will need to patch in a revert of https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692.
,
Apr 15 2016
Gah. Looks like my whitespace only changes are failing to merge correctly... investigating.
,
Apr 15 2016
Not exactly sure why it's failing, but it looks like it's failing to find some line in the FrameView.cpp (where I made the dummy change). I'm going to try this instead: touch run_tryjobs git add run_tryjobs git commit -m "Run tryjobs." tools/perf/run_benchmark try linux rasterize_and_record_micro.top_25_smooth Using this strategy, here are the new jobs: GOOD: https://codereview.chromium.org/1889193002 BAD: https://codereview.chromium.org/1890153002 MIDDLE: https://codereview.chromium.org/1891233004 Hopefully this should at least circumvent the problem.
,
Apr 15 2016
,
Apr 15 2016
Closing this as a duplicate of 600377 - please follow along there if you're interested
,
Apr 27 2016
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by nedngu...@google.com
, Apr 11 2016