Frequent purple builds on android_blink_rel |
|||
Issue descriptionThe android_blink_rel fails frequently, much of the time with "infra failure". According to a quick query, in the last 1000 builds, this bot had the following results: SUCCESS: 627 times FAILURE: 153 times RETRY: 3 times INFRA_FAILURE: 217 times (go/android-blink-rel-failure-query) I think the next step should be to investigate details of a typical-looking purple build on https://build.chromium.org/p/tryserver.chromium.android/builders/android_blink_rel/, but I'm not really sure what to look at. "Killing device forwarder" seems to appear quite often in the webkit_tests log, along with lots of other log lines that appear to indicate some trouble communicating with the device. John, do you have any suggestions for what to look at next?
,
Jun 20 2017
I'm the sheriff and I think I noticed this happening quite a bit today. Here are a few recent builds were webkit_tests seems to have succeeded (stdio shows tests run and ends with Testing completed, Exit status: 0) but the bot turned purple: https://build.chromium.org/p/tryserver.chromium.android/builders/android_blink_rel/builds/2729 https://build.chromium.org/p/tryserver.chromium.android/builders/android_blink_rel/builds/2724 https://build.chromium.org/p/tryserver.chromium.android/builders/android_blink_rel/builds/2718
,
Jun 20 2017
It looks to me like the tests are completing but the process is then hanging, as if there are subprocesses lying around. I don't see similar failures on the "WebKit Android (Nexus 4)" bot (which has 3 N4's, compared to the trybot's 7 N5's). @jbudorick - at the very least, it strikes me as bad that we have very different machine configs here. Should we maybe upgrade the waterfall to have 7 N5's? Or upgrade them both to N5X's ?
,
Jun 20 2017
qyearsley: sorry I didn't get back to you. Missed this one along the way. dpranke: It is indeed bad that we have different machine configs between the two. The ideal end state would be flipping these into the N5X swarming pool, but there are a few blockers for that (including either enabling short-term SELinux disabling within a task or fixing the remainder of https://bugs.chromium.org/p/chromium/issues/detail?id=567947, as well as whatever else will be required to get this running on swarming). In the short term, moving WebKit Android (Nexus 4) to a 7xN5 bot would be ideal (along w/ a rename). I'm not sure how many spare N5s we have lying around, though...
,
Jul 11 2017
Moar qq: still seeing ~30% exception rate for android_blink_rel, turning rebaselines into hazardous events for my poor keyboard.
,
Jul 11 2017
fmalita@, do you know whether the rebaselines you're doing involve tests that are listed in https://cs.chromium.org/chromium/src/third_party/WebKit/LayoutTests/SmokeTests? Hopefully now, most rebaselines should be unaffected by android (since android only runs a small number of the tests) -- you should be able to just run rebaseline-cl and it should usually do the right thing if there are no android results available. In general though, this is definitely still an issue, but I don't think it'll be easy to solve since I assume some of this is flakiness with the devices and their set-up, not in the test runner etc. Note the bug for eventually adding newer versions of Android and new devices is bug 733860.
,
Jul 12 2017
Thanks for the SmokeTests pointer. Some of the Skia rebaslines are massive (hundreds, sometimes thousands of tests) so I'd be wary of ignoring missing Android results. For smaller rebaselines sounds like it should be fairly safe though.
,
Dec 10
android_blink_rel is now running on swarming w/ N5s. |
|||
►
Sign in to add a comment |
|||
Comment 1 by qyears...@chromium.org
, May 25 2017