New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 602413 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 600377
Owner:
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Lots of Android bots and a Linux bot started failing around 10:30 on 3/31/16

Project Member Reported by charliea@chromium.org, Apr 11 2016

Issue description

Revision range first seen: 384361 - 384610
Bots: all Android bots

The actual test failure varies from test to test, but usually involves some sort of timeout: WebSocketTimeoutException, Javascript evaluation timeout, Dev tools timeout exception, etc.

I've seen a Chrome crash screenshot in some of the test failures: http://bit.ly/1N4QzLg

Does anyone have any insight into what might be causing this?
 
This line looks suspicious from the stack:

"Locked Domain Expired: Not valid after 2016-04-11T20:10:05.404"
Cc: rnep...@chromium.org stip@chromium.org

Comment 3 by stip@chromium.org, Apr 11 2016

Labels: Infra-Android
Status: Available (was: Untriaged)
Do you have any links to build outputs that we could look at? Something like https://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/33837
Owner: charliea@chromium.org
Yep - here's the log output for a recent Galaxy S5 failure: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/2258

And here's the log output for a recent Linux failure: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf%20%285%29/builds/14539

Comment 7 by m...@chromium.org, Apr 13 2016

Cc: m...@chromium.org
 Issue 602767  has been merged into this issue.

Comment 8 by m...@chromium.org, Apr 13 2016

Cc: reve...@chromium.org
 Issue 602770  has been merged into this issue.

Comment 9 by dtu@chromium.org, Apr 14 2016

Labels: OS-Android

Comment 10 Deleted

This is a pretty tricky bug. When I ran a normal bisect on the problem, it came back with this CL as the culprit: https://chromium.googlesource.com/chromium/src/+/846f623cb1ef78d90e620c2a7a6557544a86b692. The tricky part: this CL has been reverted and the failures continued.

The interesting thing about this CL is that, when I went back and looked at the actual perf waterfall run that ran at the CL after this one: rasterize_and_record_micro.top_25_smooth passed. This CL seems to have caused a few flaky failures among other performance tests, and my best guess is that it also caused a flaky failure on rasterize_and_record_micro.top_25_smooth. That would explain why the bisect labeled it as the culprit when it passed on the waterfall.

Another bisect searching for the *same* bug came back with this CL as the culprit: https://chromium.googlesource.com/chromium/src/+/eab85df0275b825e428954c4030e358ce968ecd9

Funny thing: *that* CL also got reverted a long time ago too, and the problem didn't go away that time either. 

My suspicion is that sometime after https://chromium.googlesource.com/chromium/src/+/846f623cb1ef78d90e620c2a7a6557544a86b692 got submitted (and while it was causing flake for the test on the waterfall), another CL got submitted that caused permanent failure for rasterize_and_record_micro.top_25_smooth. I think we have a pretty good idea of when the "bad" CL was submitted by because rather than having 3 or 4 tests failing every waterfall run (caused by 846f623 flakes), we instead only have one test failing: rasterize_and_record_micro.top_25_smooth. I'm going to use 65f484b3674d177e98ed04aa2ec30482e05a6c3a as my known "bad" CL.

What I'm doing now is running a manual bisect with the revert of that CL patched in. I'll be using all 3 Linux trybots to do this. The first run will contain 3 try jobs run at:

  - 57a75864e012b06db30f4b027e24f5fd758e46a3, the CL before 846f623cb1ef78d90e620c2a7a6557544a86b692
  - 65f484b3674d177e98ed04aa2ec30482e05a6c3a, with the revert of 846f623 patched in
  - the CL right in the middle of those two, with the revert of 846f623 patched in

When all of those jobs come back, I'll a trisect of the known bad range, dividing it into quarters and kicking off a try job of rasterize_and_record_micro.top_25_smooth at each quartile of the range (with the revert of 846f623 patched in). I'll continue to do this until we have it narrowed down to one CL.

At some point, it may be necessary to patch in the revert of eab85df0275b825e428954c4030e358ce968ecd9, too. 

I'll update this bug letting everyone know how it goes :)
Cc: dtu@chromium.org
Wow, great detective work. Good luck Charlie!
Tryjob for 57a75864e012b06db30f4b027e24f5fd758e46a3, our known GOOD changelist: https://codereview.chromium.org/1896433002/

Tryjob for 65f484b3674d177e98ed04aa2ec30482e05a6c3a, our known BAD changelist: https://codereview.chromium.org/1892103002/

Tryjob for 842eb06726eb451439cf0e54cbf16cebccc93192, the CL exactly between the two: https://codereview.chromium.org/1890143002/

For anyone interested on how to do this, I started out by getting the list of commits in between the two:

git log 57a75864e012b06db30f4b027e24f5fd758e46a3...65f484b3674d177e98ed04aa2ec30482e05a6c3a --pretty=oneline --reverse > /tmp/manual_trisect.txt

The file was 318 lines long (meaning that there are 318 revisions that we need to bisect through), which means that the revision half way through is the 159th line: 842eb06726eb451439cf0e54cbf16cebccc93192.

At all three of the CLs, I was forced to make a meaningless code change: the `tools/perf/run_benchmark try` doesn't allow you to just run a test at a checked-in CL (for some unknown reason): to circumvent this, I just started a branch at each of these CLs and added whitespace to a file, committed the whitespace change, and started a tryjob with: 

  tools/perf/run_benchmark try linux rasterize_and_record_micro.top_25_smooth

We'll see what they turn up.
Also, to anyone interested in how long this is likely to take: the number of iterations needed to bisect N revisions is ceil(log2(N)). That means that a bisect would take ceil(log2(318)) = ceil(8.31288295528) = 9 iterations to find the correct CL in our range. However, because we're using 3 testers in parallel and cutting our range into quartiles with each step, that means we need ceil(log4(N)) = ceil(4.16) = 5 iterations.

... I think. Just trying to figure out how long we can expect this to take.

Comment 15 Deleted

Ah, and a couple of other important notes: we think that https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692 will make our test flaky, so we need to rever that CL for any tests that we do.

However, https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692 was actually reverted in http://crrev.com/e37fce4deefe8c872c4329bdf4233858aab8b9a5 - also in our test range. In our 318 CL range, that means that CL #1 was reverted in CL#101. Any revisions we need to test in that range will need to patch in a revert of https://crrev.com/846f623cb1ef78d90e620c2a7a6557544a86b692.
Gah. Looks like my whitespace only changes are failing to merge correctly... investigating.
Not exactly sure why it's failing, but it looks like it's failing to find some line in the FrameView.cpp (where I made the dummy change). I'm going to try this instead: 

  touch run_tryjobs
  git add run_tryjobs
  git commit -m "Run tryjobs."
  tools/perf/run_benchmark try linux rasterize_and_record_micro.top_25_smooth

Using this strategy, here are the new jobs:

GOOD: https://codereview.chromium.org/1889193002
BAD: https://codereview.chromium.org/1890153002
MIDDLE: https://codereview.chromium.org/1891233004

Hopefully this should at least circumvent the problem.
Summary: Lots of Android bots and a Linux bot started failing around 10:30 on 3/31/16 (was: Lots of Android bots started failing around 10:30 on 3/31/16)
Mergedinto: 600377
Status: Duplicate (was: Available)
Closing this as a duplicate of 600377 - please follow along there if you're interested
Components: Infra>Client>Android
Labels: -Infra-Android

Sign in to add a comment