New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 634052 link

Starred by 9 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocking:
issue 650674



Sign in to add a comment

Device flakiness on chromium.perf: Android Galaxy S5

Project Member Reported by simonhatch@chromium.org, Aug 3 2016

Issue description

Link to buildbot status page:
https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29

32089721b6a351d1: missing [logdog]
 
Labels: Type-Bug

Comment 2 by vhang@chromium.org, Aug 3 2016

Owner: vhang@chromium.org
Status: Assigned (was: Untriaged)
Can this wait until next week?  bpastene has a script that creates a bug with all the offline phones every monday and Hwops handles it.  If it can wait, then this bot will get fixed sometime next week.
Cc: sullivan@chromium.org
(Similarly to  issue 634054 )

We're seeing lots of device issues (purple) on Android Galaxy S5:

Android Galaxy S5 Perf (1): https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/3434
Android Galaxy S5 Perf (2): https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3004
Android Galaxy S5 Perf (3): https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%283%29/builds/2842


Could someone look at them?

#2: +cc sullivan. No, I don't think this can wait. We need the bots to run benchmarks constantly to make sure that Chrome doesn't regress performance. Could the script run more often? Every day? Once an hour?
Issue 634027 has been merged into this issue.
Ping. Android Galaxy S5 Perf (1) still has some devices offline: https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/3446
Owner: pschmidt@chromium.org
Re: Android Galaxy S5 Perf (1)  see https://bugs.chromium.org/p/chromium/issues/detail?id=634027

I thought one of the major benefits of having multiple devices was to have redundancy?

I'll have a look at Android Galaxy S5 Perf (2) later on today.
The android perf tests really have no redunduncy for when a device goes offline. When a device fails, all tests allocated to that device will not run. This is because we need the same tests to be run on the same device between runs. Different devices yield different values from the same test; so we cannot to between run comparisons if we are running on different devices. 
Understood.  

What I notice with the Galaxy's is that they tend to "come and go"on their own. In Android Galaxy S5 Perf (1) I unplugged one live device and one that was missing came back.  Reconnect that one and the missing device goes missing again.  So there is extreme flakiness.   Trying to narrow down the culprit(s) is a cat and mouse game.

Side note.  It almost appears to be that trying to reset a particular device in the device_recovery step does more harm than good stability wise?

It's definitely the most fragile platform you have.



Ping on this--it looks like these devices are still offline:

https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29
Device 32085d1787be514b
Device 32089721b6a351d1
Cc: pschmidt@chromium.org
 Issue 632750  has been merged into this issue.
Summary: Device flakiness on chromium.perf: Android Galaxy S5 (was: Device offline on chromium.perf: Android Galaxy S5 Perf (2))
Changing title of bug since we are all over the map.

Side note. The device_recovery step is what seems to be hosing these S5's.

Re: build21-b1 see https://bugs.chromium.org/p/chromium/issues/detail?id=634027 for context.

The 5 devices have been stable.  Yesterday I flash the two replacements and let them charge over night (these don't support bc).  New devices are 32082067745c515f & 3208df23b0c251e1.  So the full complement is now

Checking 3208851faca351f3...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 3208e0600bb251f3...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 3208cf5e05b2517f...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 320861234c117165...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 3208584f952c61ef...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 32082067745c515f...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys
Checking 3208df23b0c251e1...
samsung/k3gxx/k3g:5.0/LRX21T/G900HXXE1BOH4:eng/test-keys

build22-b1:

device_recovery at play here:  I'm going to switch out the hub to see if the situation improves.

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3060   all  good

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3061  device_recovery blacklists one for "USB failure"

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3062  all good

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3063  device_recovery blacklists 4 devices for offline/missing/offline/usb failure

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3064  all good.

I'll go back to build23-b1 after I'm done with the hub swap on build22-b1.
build23-b1:  Same device_recovery induced flakiness here.  Devices sometimes flagged as missing/offline/USB error.  Replacing the hub/cables on this slave.
Status: Started (was: Assigned)
build23-b1: hub and usb cables replaced. Effective starting with https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%283%29/builds/2891
Revisiting build21-b1:  

Replaced what looks like flakey devices 320861234c117165 & 3208cf5e05b2517f  with 32085a73842c615b and 3208e0a226fa51b7   This will be effective starting with https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/3499
Cc: jbudorick@chromium.org stip@chromium.org
Looking at build22-b1:  +cc jbudorick, stip for some input.

Here the device_recovery step seems to blacklist random S5's based on USB failures. It does report offline devices correctly.  I've already swapped out the hub and cables.  This step I believe is providing a false negative in a lot of cases.

A couple of examples:
https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3075/steps/device_recovery/logs/stdio

32081d5f765c510d,3208531995be5145,3208dd33a9c25169,32085d1787be514b blacklisted due to "USB failure"

and the very next build  https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3076/steps/device_recovery/logs/stdio

32081d5f765c510d blacklisted due to "USB failure"

The next build it's blacklisting 3208dd33a9c25169 for the same reason. https://build.chromium.org/p/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3077/steps/device_recovery/logs/stdio




Issue 637277 has been merged into this issue.
On build22-b1, 32089721b6a351d1 seems to be a continuous bad apple.  Just replaced it with 320851777626611b.  This will effective starting with https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3114
Cc: bpastene@chromium.org
Components: Infra>Client>Android Infra>Client>Perf
bpastene: can you take a look at these bots as well? I got confused by the comments in bug 638679, these are the ones that have been down for 2 weeks (other ones are critical for BattOr testing).
Cc: benhenry@chromium.org
I talked to Peter in the hallway yesterday. He mentioned that he's going to talk to stip/ben about this.
Cc: aiolos@chromium.org
 Issue 639885  has been merged into this issue.
 Issue 639887  has been merged into this issue.
 Issue 638743  has been merged into this issue.
 Issue 638739  has been merged into this issue.
On Android Galaxy S5 Perf (1) I noticed that 32082067745c515f is flakey.  Just replaced it with 3208e623c90051f7.  Effective starting with https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/3640
I'm seeing two more phones down on "Android Galaxy S5 Perf (3)"; 3208cd5005b25183 and 32089f2db2a351c5.

Is it useful to report these in here? Should I file a separate bug?

(I'm reporting these because I'm on the perf bot rotation today)
Might as well file it here. These devices comes and go depending on how the device_recovery step treats them.
An update. https://codereview.chromium.org/2295933002 was landed that disabled usb resets in the device_recovery step.   Yesterday I cleaned up the actual stale devices on the slaves so now they report all devices available.   Let's see what happens over the weekend.

Effective starting with:

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/3761

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29/builds/3387

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%283%29/builds/3160
Cc: charliea@google.com
It looks like this is still continuing: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20(1)

What the heck are we supposed to do here?

Samsung Galaxy S5 Perf (1) hasn't had a green run in the last 200 runs (https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29?numbuilds=400). Nor has Samsung Galaxy S5 Perf (2) (https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%282%29?numbuilds=200) or Samsung Galaxy S5 Perf (3) (https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%283%29?numbuilds=200).

Even attempting to keep these up is a pretty big burden on the perfbot health sheriffs and infra labs. sullivan@ and nednguyen@, any idea what we should do here?


John, did we turn back on USB Resetting for all android bot? If so, we could try making it so just samsung devices do not reset USB.

Comment 33 by stip@chromium.org, Sep 19 2016

From https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.perf%2FAndroid_Galaxy_S5_Perf__1_%2F4061%2F%2B%2Frecipes%2Fsteps%2Fv8.browsing_mobile_ignition%2F0%2Fstdout:
CRITICAL:root:STDERR: [0917/135846:ERROR:host_forwarder_main.cc(392)] ERROR: Connection to device failed.ERROR: Existing controllers:ERROR:   42931:43241

This appears to be device 3208e0600bb251f3, which is not in our weekly ticket (https://gutsv3.corp.google.com/#ticket/23252763). bpastene@, can you investigate why we're not flagging this?
Our usb story is not consistent.  Let me fix that up.

On build21-b1 (Android Galaxy S5 Perf (1))  the devices are connected to a usb 2.0 hub and onto a usb 2.0 host controller.   These devices appear to be more stable?

build22-b1 (Android Galaxy S5 Perf (2)) and build23-b1 (Android Galaxy S5 Perf (3)) the devices are connected to a usb 3.0 hub and onto a usb 2.0 host controller.   The devices here a much more flakey.

As a test I'm going to switch build23-b1 to a host that supports usb 3.0 
#32: I turned it back on for chromium.perf in https://codereview.chromium.org/2318203002
Re #33: Because device_status on that build saw all devices as healthy: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Android%20Galaxy%20S5%20Perf%20%281%29/builds/4061/steps/device_status/logs/json.output

Look at that device's steps on that build. It runs fine, but the forwarder always crashes after pulling /proc/net/tcp. Whoever owns the forwarder should take a look.
Owner: jbudorick@chromium.org
Status: Assigned (was: Started)
While this does look like a forwarder issue (so I'm self-assigning), we only pull /proc/net/tcp as part of failure diagnosis. It's not the cause of the failure.
Given the recent runs on this bot, it looks like this might be a device issue and a forwarder issue. Will stop by the lab tomorrow.
No clear sign of device malfeasance this morning in the lab.
Going to have to try to catch this in the middle of a run in which it's failing to forward to grab the host forwarder daemon log. Unfortunately, 251f3 is blacklisted in the current run.
Project Member

Comment 41 by bugdroid1@chromium.org, Sep 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/883442d271d1257505fb99d0802b5a1a0c201d51

commit 883442d271d1257505fb99d0802b5a1a0c201d51
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Fri Sep 23 01:28:56 2016

Roll src/third_party/catapult/ b803018ac..7bd10eda4 (9 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/b803018ac776..7bd10eda47f1

$ git log b803018ac..7bd10eda4 --date=short --no-merges --format='%ad %ae %s'
2016-09-22 nednguyen Make use_live_traffic in FakeNetworkController default to False
2016-09-22 jbudorick [Android] Attempt to grab the forwarder daemon logs on map failure.
2016-09-22 aiolos Remove warning when a ref build is set as monitored.
2016-09-22 charliea [trace model] Add .range accessor for Event
2016-09-22 nednguyen [Telemetry] Enable typ's discovery flags for telemetry's unittest_runner framework
2016-09-22 sullivan Add ability to query for test patterns of length 8.
2016-09-22 bccheng Explicitly initialize the network controller
2016-09-22 nednguyen Add logging to _FileLock to debug race condition when multiple processes download a same file
2016-09-22 nednguyen [Telemetry] Start ts_proxy_server with host=None when --use-live-site flag is enabled

BUG= 634052 , 643649 ,647340,643320

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2365803002
Cr-Commit-Position: refs/heads/master@{#420534}

[modify] https://crrev.com/883442d271d1257505fb99d0802b5a1a0c201d51/DEPS

Project Member

Comment 43 by bugdroid1@chromium.org, Sep 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e70970b32c2061fc14fa37a40276f484d772287

commit 2e70970b32c2061fc14fa37a40276f484d772287
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Sat Sep 24 09:29:34 2016

Roll src/third_party/catapult/ a8deb272b..efbf303a5 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/a8deb272b550..efbf303a5360

$ git log a8deb272b..efbf303a5 --date=short --no-merges --format='%ad %ae %s'
2016-09-23 jbudorick [devil] update the forwarder binaries.

BUG= 634052 

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2368813002
Cr-Commit-Position: refs/heads/master@{#420838}

[modify] https://crrev.com/2e70970b32c2061fc14fa37a40276f484d772287/DEPS

Blocking: 650674
Project Member

Comment 45 by bugdroid1@chromium.org, Sep 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3d77e97fb46b4f0a9f9255c30b617abf6721ad33

commit 3d77e97fb46b4f0a9f9255c30b617abf6721ad33
Author: jbudorick <jbudorick@chromium.org>
Date: Fri Sep 30 14:59:17 2016

[Android] Add --unmap-all to forwarder2.

In some scenarios (e.g., single-device restart), we want to unmap all
ports forwarded from a given device up to the host and clear the existing
cached adb port for that device. We want to be able to do this even if
the calling process doesn't know all of those ports. This change adds
the --unmap-all command to forwarder2 to support such use cases.

BUG= 634052 , 650674 

Review-Url: https://codereview.chromium.org/2381063004
Cr-Commit-Position: refs/heads/master@{#422113}

[modify] https://crrev.com/3d77e97fb46b4f0a9f9255c30b617abf6721ad33/tools/android/forwarder2/host_forwarder_main.cc

Project Member

Comment 46 by bugdroid1@chromium.org, Oct 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dccd754c3b5cc5be5c809ffd6a9b742053f25c76

commit dccd754c3b5cc5be5c809ffd6a9b742053f25c76
Author: jbudorick <jbudorick@chromium.org>
Date: Sat Oct 01 01:51:20 2016

[Android] Run shell commands from the forwarder without passing fds.

The forwarder daemon was running commands with system(). This would give
the newly forked process copies of the same file handles held by the
daemon, notably including the unix domain socket.

If the adb server wasn't already running and the daemon called an adb
command, the adb server would be forked from the adb client process
with those same file handles -- including the unix domain socket. This
would interfere both with shutting down the host forwarder daemon
(as we'd see the unix domain socket still held by the adb server) and
with subsequent attempts to bring it up (same reason).

BUG= 634052 , 650674 

Review-Url: https://codereview.chromium.org/2374183008
Cr-Commit-Position: refs/heads/master@{#422263}

[modify] https://crrev.com/dccd754c3b5cc5be5c809ffd6a9b742053f25c76/tools/android/forwarder2/host_forwarder_main.cc

Project Member

Comment 47 by bugdroid1@chromium.org, Oct 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/84526ade9b6d246a8834309d0519d2255c0db91d

commit 84526ade9b6d246a8834309d0519d2255c0db91d
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Sat Oct 01 08:06:33 2016

Roll src/third_party/catapult/ f00b66029..507bed462 (2 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f00b66029517..507bed4626dd

$ git log f00b66029..507bed462 --date=short --no-merges --format='%ad %ae %s'
2016-09-30 jbudorick [telemetry] Update {device,host}_forwarder binaries.
2016-09-30 jbudorick [devil] Use --unmap-all in Forwarder.UnmapAllDevicePorts.

BUG= 634052 , 650674 , 634052 , 650674 

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2378773016
Cr-Commit-Position: refs/heads/master@{#422308}

[modify] https://crrev.com/84526ade9b6d246a8834309d0519d2255c0db91d/DEPS

 Issue 652251  has been merged into this issue.
Project Member

Comment 49 by bugdroid1@chromium.org, Oct 3 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome-golo/chrome-golo.git/+/b303c1e5595e356c281e0d066934a50225fe7dc0

commit b303c1e5595e356c281e0d066934a50225fe7dc0
Author: pschmidt <pschmidt@google.com>
Date: Mon Oct 03 19:22:19 2016

Cc: benjhayden@chromium.org u...@chromium.org eakuefner@chromium.org
 Issue 638404  has been merged into this issue.
Re #51: I've been fiddling with that bot; don't look at it for an indication of how Galaxy's are performing in the lab.
#51: beware that this issue is for the S5 on chromium.perf, not chromium.perf.fyi.
Project Member

Comment 55 by bugdroid1@chromium.org, Oct 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/284d0f102f55bf7c629a297d89c04b8fde110020

commit 284d0f102f55bf7c629a297d89c04b8fde110020
Author: martiniss <martiniss@chromium.org>
Date: Mon Oct 24 21:03:51 2016

Disable galaxy and new mac perf bots for SOM

These should be re-enabled once they're sheriffable

BUG= 634052 , 639530

Review-Url: https://codereview.chromium.org/2286973002

[modify] https://crrev.com/284d0f102f55bf7c629a297d89c04b8fde110020/scripts/slave/gatekeeper.json

Project Member

Comment 56 by bugdroid1@chromium.org, Oct 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/284d0f102f55bf7c629a297d89c04b8fde110020

commit 284d0f102f55bf7c629a297d89c04b8fde110020
Author: martiniss <martiniss@chromium.org>
Date: Mon Oct 24 21:03:51 2016

Disable galaxy and new mac perf bots for SOM

These should be re-enabled once they're sheriffable

BUG= 634052 , 639530

Review-Url: https://codereview.chromium.org/2286973002

[modify] https://crrev.com/284d0f102f55bf7c629a297d89c04b8fde110020/scripts/slave/gatekeeper.json

Comment 57 by zh...@chromium.org, Nov 15 2016

Status: Fixed (was: Assigned)

Sign in to add a comment