Flaky infra/device failures on L Phone tester |
||||
Issue descriptionhttps://build.chromium.org/p/chromium.android/builders/Lollipop%20Phone%20Tester Lots of timeouts. Usually for content_browsertests. Might be due to https://chromium-swarm.appspot.com/bot?id=build167-b1--device4 which seems to timeout all its content_browsertest runs.
,
Nov 21 2017
Issue 787389 has been merged into this issue.
,
Nov 21 2017
Can we simply remove this device from the bots until we are sure the problem is solved? It is causing a lot of false test failures.
,
Nov 21 2017
I guess the reflash didn't help. I removed the device, so it shouldn't be poisoning the pool.
,
Nov 28 2017
bpastene: we still observe timeouts on this L tester, suspecting an infra issue
,
Dec 1 2017
Issue 787366 has been merged into this issue.
,
Dec 5 2017
The tester still has more failures than successes I've observed issues of 3 types: * Timeout errors on random devices * Multiple failures on a single device: https://chromium-swarm.appspot.com/bot?id=build167-b1--device1&show_all_events=true&sort_stats=total%3Adesc * Just random flaky tests
,
Dec 5 2017
Thanks for the report. I'm still trying to determine what causes the random timeouts. The timeout is set at 30 minutes while each shard usually only takes ~15 min to complete, so I'm not sure if bumping the timeout would help much: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1512517140000&f=buildername%3ALollipop%20Phone%20Tester&f=name%3Achrome_public_test_apk&l=50&n=true&s=created_ts%3Adesc&st=1512430740000 And regarding single phones behaving badly, I've got a change on staging that will theoretically help that type of failure: https://chrome-internal-review.googlesource.com/520328 Will promote to stable when things look good. As for random test flakes, that's not really in scope here. I'm aiming for infrastructure-related failures in this bug.
,
Dec 13 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/d509fad0504eebe6eee586bb226808a8e0ee3a92 commit d509fad0504eebe6eee586bb226808a8e0ee3a92 Author: Benjamin Pastene <bpastene@chromium.org> Date: Wed Dec 13 20:12:32 2017
,
Dec 13 2017
,
Jan 6 2018
,
Jan 6 2018
Looking at this more, it seems that a large amount of the frequent timeouts on this bot fail due to "Input event injection" timeouts. See this logcat for instance: https://luci-logdog.appspot.com/v/?s=chromium%2Fandroid%2Fswarming%2Flogcats%2F3ada088ec098d511%2F%2B%2Flogcat_logcat_org.chromium.chrome.browser.ntp.cards.NewTabPageRecyclerViewTest.testDismissStatusCardWithContextMenu_20180104T130057-UTC_042b12ca3089fa54 This has been a problem that's plagued our L testers for a while. Still unsure what causes it (+bsheedy in case he ever did), but I've noticed that if there's still enough time left before the timeout cutoff in the task, the auto-recover mechanism before the last attempt actually fixes whatever's wrong. See https://chromium-swarm.appspot.com/task?id=3ada088ec098d510 for example where a bunch of tests failed with input timeouts on the first two attempts, and passed on the third after the phone was recovered and rebooted. Could be that increasing that 30min task timeout would give the devices enough time to fail-fail-reboot-pass. Right now they run out of time halfway.
,
Jan 8 2018
Unfortunately, I never did find the root cause of this or a good solution. I think at some point, I tracked the input event injection timeouts to a particular line in the Android source code where some expression was never evaluating to true, but couldn't figure out how it got into that state.
,
Jan 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5dabc8d5b5cd48b1f6783142c327dd511f6ff117 commit 5dabc8d5b5cd48b1f6783142c327dd511f6ff117 Author: Benjamin Pastene <bpastene@chromium.org> Date: Tue Jan 09 18:52:58 2018 android: Recover devices after every attempt when testing devices on L. See these tasks where dozens of tests finish with FAIL-FAIL-PASS. It seems the final recovery fixes *something* that's causing the tests to fail. No clue what it is, but this will at least cure the problem sooner. https://chromium-swarm.appspot.com/task?id=3af051d44c12d910 https://chromium-swarm.appspot.com/task?id=3aefb75ae7714a10 https://chromium-swarm.appspot.com/task?id=3aef865376444810 Bug: 787056 Change-Id: I7688b13e33edfa9034532cd6ba47f18c5ca2827a Reviewed-on: https://chromium-review.googlesource.com/855177 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#528060} [modify] https://crrev.com/5dabc8d5b5cd48b1f6783142c327dd511f6ff117/build/android/pylib/local/device/local_device_test_run.py
,
Jan 11 2018
After the change in #14, I'm not seeing much purple on chrome_public_test_apks. ... but maybe that's because it's now failing differently: https://chromium-swarm.appspot.com/task?id=3afd4b43575e6010 https://chromium-swarm.appspot.com/task?id=3afcbb675bbe5510 https://chromium-swarm.appspot.com/task?id=3afc83098b78f010 https://chromium-swarm.appspot.com/task?id=3afd1cf64b6ef910 All those are hitting "Device or resource busy" errors when trying to create dirs/files on the device. I wonder if we've changed device provisioning or data_deps recently...
,
Jan 12 2018
Keeps happening: https://chromium-swarm.appspot.com/task?id=3b031c9c9c374e10 failed with - mkdir failed for /storage/emulated/legacy/chromium_tests_root/, Device or resource busy
,
Jan 12 2018
https://chromium-swarm.appspot.com/task?id=3b03747382b2d010: mkdir failed for /storage/emulated/legacy/chromium_tests_root/components/, Device or resource busy
,
Jan 12 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/464960c84cce8bcc929b451d3a7257c22a153d03 commit 464960c84cce8bcc929b451d3a7257c22a153d03 Author: Benjamin Pastene <bpastene@chromium.org> Date: Fri Jan 12 19:51:30 2018
,
Jan 12 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/49c9c6edf258b2ee52745bca61ecbb637f7a7aa8 commit 49c9c6edf258b2ee52745bca61ecbb637f7a7aa8 Author: Benjamin Pastene <bpastene@chromium.org> Date: Fri Jan 12 20:19:59 2018
,
Jan 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7f962d1f920030fa101325bef67fa88bf5cc6833 commit 7f962d1f920030fa101325bef67fa88bf5cc6833 Author: Benjamin Pastene <bpastene@chromium.org> Date: Sat Jan 13 01:30:28 2018 android: Take bugreports on intstrumentation test device-setup failures. Bug: 787056 Change-Id: I72a6963340f56b2483a05167ccdcc078420479f4 Reviewed-on: https://chromium-review.googlesource.com/865434 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#529136} [modify] https://crrev.com/7f962d1f920030fa101325bef67fa88bf5cc6833/build/android/pylib/local/device/local_device_instrumentation_test_run.py
,
Jan 19 2018
Going to call this a success. Last few builds on the bot are sans purple: https://ci.chromium.org/buildbot/chromium.android/Lollipop%20Phone%20Tester/ |
||||
►
Sign in to add a comment |
||||
Comment 1 by bpastene@chromium.org
, Nov 20 2017