New issue
Advanced search Search tips

Issue 787056 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocked on:
issue 794723



Sign in to add a comment

Flaky infra/device failures on L Phone tester

Project Member Reported by bpastene@chromium.org, Nov 20 2017

Issue description

https://build.chromium.org/p/chromium.android/builders/Lollipop%20Phone%20Tester

Lots of timeouts. Usually for content_browsertests. Might be due to https://chromium-swarm.appspot.com/bot?id=build167-b1--device4 which seems to timeout all its content_browsertest runs.
 
No idea what's wrong with that device. I could repro on it, but it was failing with all sorts of errors. Decided to just reflash it. We'll see if that cures it.
Issue 787389 has been merged into this issue.
Can we simply remove this device from the bots until we are sure the problem is solved? It is causing a lot of false test failures.
I guess the reflash didn't help. I removed the device, so it shouldn't be poisoning the pool.

Comment 5 by pasko@google.com, Nov 28 2017

bpastene: we still observe timeouts on this L tester, suspecting an infra issue
Issue 787366 has been merged into this issue.
The tester still has more failures than successes

I've observed issues of 3 types:
* Timeout errors on random devices
* Multiple failures on a single device: https://chromium-swarm.appspot.com/bot?id=build167-b1--device1&show_all_events=true&sort_stats=total%3Adesc
* Just random flaky tests
  
Thanks for the report. I'm still trying to determine what causes the random timeouts. The timeout is set at 30 minutes while each shard usually only takes ~15 min to complete, so I'm not sure if bumping the timeout would help much:
https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1512517140000&f=buildername%3ALollipop%20Phone%20Tester&f=name%3Achrome_public_test_apk&l=50&n=true&s=created_ts%3Adesc&st=1512430740000

And regarding single phones behaving badly, I've got a change on staging that will theoretically help that type of failure: https://chrome-internal-review.googlesource.com/520328
Will promote to stable when things look good.

As for random test flakes, that's not really in scope here. I'm aiming for infrastructure-related failures in this bug.
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 13 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/d509fad0504eebe6eee586bb226808a8e0ee3a92

commit d509fad0504eebe6eee586bb226808a8e0ee3a92
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Wed Dec 13 20:12:32 2017

Blockedon: 794723
Cc: jbudorick@chromium.org bpastene@chromium.org
 Issue 793442  has been merged into this issue.
Cc: bsheedy@chromium.org
Looking at this more, it seems that a large amount of the frequent timeouts on this bot fail due to "Input event injection" timeouts. 

See this logcat for instance:
https://luci-logdog.appspot.com/v/?s=chromium%2Fandroid%2Fswarming%2Flogcats%2F3ada088ec098d511%2F%2B%2Flogcat_logcat_org.chromium.chrome.browser.ntp.cards.NewTabPageRecyclerViewTest.testDismissStatusCardWithContextMenu_20180104T130057-UTC_042b12ca3089fa54

This has been a problem that's plagued our L testers for a while. Still unsure what causes it (+bsheedy in case he ever did), but I've noticed that if there's still enough time left before the timeout cutoff in the task, the auto-recover mechanism before the last attempt actually fixes whatever's wrong.

See https://chromium-swarm.appspot.com/task?id=3ada088ec098d510 for example where a bunch of tests failed with input timeouts on the first two attempts, and passed on the third after the phone was recovered and rebooted. Could be that increasing that 30min task timeout would give the devices enough time to fail-fail-reboot-pass. Right now they run out of time halfway.
Unfortunately, I never did find the root cause of this or a good solution. I think at some point, I tracked the input event injection timeouts to a particular line in the Android source code where some expression was never evaluating to true, but couldn't figure out how it got into that state.
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5dabc8d5b5cd48b1f6783142c327dd511f6ff117

commit 5dabc8d5b5cd48b1f6783142c327dd511f6ff117
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Tue Jan 09 18:52:58 2018

android: Recover devices after every attempt when testing devices on L.

See these tasks where dozens of tests finish with FAIL-FAIL-PASS. It
seems the final recovery fixes *something* that's causing the tests to
fail. No clue what it is, but this will at least cure the problem
sooner.
https://chromium-swarm.appspot.com/task?id=3af051d44c12d910
https://chromium-swarm.appspot.com/task?id=3aefb75ae7714a10
https://chromium-swarm.appspot.com/task?id=3aef865376444810

Bug:  787056 
Change-Id: I7688b13e33edfa9034532cd6ba47f18c5ca2827a
Reviewed-on: https://chromium-review.googlesource.com/855177
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>
Cr-Commit-Position: refs/heads/master@{#528060}
[modify] https://crrev.com/5dabc8d5b5cd48b1f6783142c327dd511f6ff117/build/android/pylib/local/device/local_device_test_run.py

After the change in #14, I'm not seeing much purple on chrome_public_test_apks.

... but maybe that's because it's now failing differently:
https://chromium-swarm.appspot.com/task?id=3afd4b43575e6010
https://chromium-swarm.appspot.com/task?id=3afcbb675bbe5510
https://chromium-swarm.appspot.com/task?id=3afc83098b78f010
https://chromium-swarm.appspot.com/task?id=3afd1cf64b6ef910

All those are hitting "Device or resource busy" errors when trying to create dirs/files on the device. I wonder if we've changed device provisioning or data_deps recently...
Keeps happening: https://chromium-swarm.appspot.com/task?id=3b031c9c9c374e10 failed with

  - mkdir failed for /storage/emulated/legacy/chromium_tests_root/, Device or resource busy
https://chromium-swarm.appspot.com/task?id=3b03747382b2d010:
mkdir failed for /storage/emulated/legacy/chromium_tests_root/components/, Device or resource busy
Project Member

Comment 18 by bugdroid1@chromium.org, Jan 12 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/464960c84cce8bcc929b451d3a7257c22a153d03

commit 464960c84cce8bcc929b451d3a7257c22a153d03
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Fri Jan 12 19:51:30 2018

Project Member

Comment 19 by bugdroid1@chromium.org, Jan 12 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/49c9c6edf258b2ee52745bca61ecbb637f7a7aa8

commit 49c9c6edf258b2ee52745bca61ecbb637f7a7aa8
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Fri Jan 12 20:19:59 2018

Project Member

Comment 20 by bugdroid1@chromium.org, Jan 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f962d1f920030fa101325bef67fa88bf5cc6833

commit 7f962d1f920030fa101325bef67fa88bf5cc6833
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Sat Jan 13 01:30:28 2018

android: Take bugreports on intstrumentation test device-setup failures.

Bug:  787056 
Change-Id: I72a6963340f56b2483a05167ccdcc078420479f4
Reviewed-on: https://chromium-review.googlesource.com/865434
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>
Cr-Commit-Position: refs/heads/master@{#529136}
[modify] https://crrev.com/7f962d1f920030fa101325bef67fa88bf5cc6833/build/android/pylib/local/device/local_device_instrumentation_test_run.py

Status: Fixed (was: Assigned)
Going to call this a success. Last few builds on the bot are sans purple:
https://ci.chromium.org/buildbot/chromium.android/Lollipop%20Phone%20Tester/

Sign in to add a comment