New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 768485 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

telemetry_perf_unittests continuously timing out on specific android devices

Project Member Reported by bpastene@chromium.org, Sep 25 2017

Issue description

Ned would know more about this. I don't know much about unittests.
It's possible that some Telemetry test is being stuck here, lemme check.
Hmhh, I ran ./bin/list_telemetry_unittests on the 5 logs, and nothing stands out. Different tests was being stuck for each of those.

Ex:
********** All passed telemetry unittests:

benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:media:facebook_photos 1021.5496
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:news:wikipedia 16.5032
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:games:miniclip 16.4803
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:tools:stackoverflow 16.2789
benchmarks.benchmark_unittest.BenchmarkOptionsTest.memory.top_10_mobile 0.0132
benchmarks.blink_perf_unittest.ComputeTraceEventsMetricsForBlinkPerfTest.testAsyncTraceEventMetricsOverlapping 0.0006

******** All skipped telemetry unittests:

benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.power.typical_10_mobile
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.browse:news:nytimes
core.benchmark_finders_unittest.TestGetBenchmarkNamesForFile.testListSimpleBenchmarksDefinedInOneFileComplex

******* Unfinished telemetry unittests:

benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:facebook
log4.txt
122 KB View Download
Cc: dpranke@chromium.org
Dirk: a big weakness of typ is that when we send SIGKILL to the it, it doesn't show the stack trace of where the program is currently at (even for the case we run typ with --jobs=1, I believe). This would help a lot with figuring out what test got stuck.
Owner: bpastene@chromium.org
Status: Assigned (was: Untriaged)
Could be something wrong with the phone. I'll trigger a reflash and hopefully that'll heal whatever's wrong with it.

BTW, it also might be helpful to prepend a timestamp to each line in the logs like:
https://chromium-swarm.appspot.com/task?id=38d3eb8d68c95810
Or even the time-since-test-start like:
https://chromium-swarm.appspot.com/task?id=38d3ebb021682a10

That way we can tell what's taking so long.
The reflash didn't help. Turns out it's not just that device having issues, but the whole host:
https://chromium-swarm.appspot.com/bot?id=build317-m1--device6
https://chromium-swarm.appspot.com/bot?id=build317-m1--device5
Even if it doesn't timeout, it still takes significantly longer to complete than the average across other bots.

telemetry_perf_unittests does network I/O during the test, right? I wonder if something's wrong in the network stack on the host, slowing down any network calls inside the containers. I'll poke around.
It seems it's always benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse tests that are the reason for the slow down. Their runtimes are always much longer than the average:

benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:tumblr_infinite_scroll passed 880.3343s:
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:news:reddit passed 474.3115s:
benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:tech:discourse_infinite_scroll passed 457.6424s:

On normal bots they usually take 60-80 seconds. Ned, is there anything particular about these benchmarks? Any reason you can think why they'd be so slow on this host's devices?
It's also happening to this device: https://chromium-swarm.appspot.com/bot?id=build321-m1--device1

What the heck is going on?
Given the current swarming outage ( crbug.com/770384 ), I'd pretty much ignore any results from this afternoon.
Cc: perezju@chromium.org
+Juan: let's see if this still happens. If it does, we might need to enable trace uploading to see where time is spent for these tests.
Cc: bpastene@chromium.org nedngu...@google.com
 Issue 771148  has been merged into this issue.
Labels: -Pri-2 Pri-1
Summary: telemetry_perf_unittests continuously timing out on specific android devices (was: telemetry_perf_unittests continuously timing out on a single android bot (build317-m1--device7))
And here's some more:
https://chromium-swarm.appspot.com/bot?id=build381-m4--device4
https://chromium-swarm.appspot.com/bot?id=build382-m4--device6
https://chromium-swarm.appspot.com/bot?id=build380-m4--device7

This is really proving disruptive and seems to be continuously getting worse.
Cc: sullivan@chromium.org
The way to debug this is enable uploading traces to cloud storage, so we can grab the trace from log & look at why the tests are taking so long.

Before I do it, just want to make sure that we don't blow our cloud storage budget.

A telemetry_perf_unittests has about 50 sh stories under tests, each produce a trace file of size ~10Mb. So that mean 500 Mb of data going to chrome-telemetry-output/ bucket for each CQ run.

Does anyone know whether this is too excessive? 
500 Mb / build is a fair amount, but not horrible; it should be fine.
After taking one of the devices offline, running the test manually, and attaching a debugger, I think I've found what the tests are getting hung up on: (logging seems to be a black hole in these tests)
https://codesearch.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=086a927c4c2fa0a96e54ccc17879b82d673e6ffd&l=201

Could be that a lot of our devices are running warmer so it takes a while to cool down to that 35 degree threshold. +rnephew@ fyi. If needed, is there any reason not to bump that temperature limit to 40 degrees? That's what a swarming bot quarantines on. I'll keep poking.
Cc: rnep...@chromium.org
actual +rnephew@
Interesting. Nice find!

If this doesn't fix things, Ned, we need to figure out a plan to hunt this down.
Nice find.

#20: my plan to hunt this down is to enable trace uploading at mentioned in #16. We capture how a test spend it time in the trace.

Example: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/browse_chrome_omnibox_2017-10-03_11-54-37_33716.html (See Process 25832)
Cc: -nednguyen@chromium.org
I can confirm that removing the wait-for-battery-temp line makes one of the memory benchmarks pass in only 130.5397s. With it in, it takes much longer. (I launched it ~10 minutes ago and it hasn't finished yet.)

Looking at the monitoring data, there's been a noticeable increase in battery temperatures for N5Xs in Atlanta:
http://shortn/_EetrXhpPti

About a week ago, the p999 temp visibly jumped from about 34 Celsius to 37. That correlates pretty well with when these tests started to timeout. It may be that the subset of devices the tests keep failing on are responsible for that jump. Indeed, looking at a few of the devices mentioned here seems to confirm it:
http://shortn/_Rs632gDL3v
http://shortn/_trnxZRWeSu

I'll file a ticket to get the tray fans for some of these devices looked at. Chances are they shut down and need repair/replacement.
Yep, logs are full of messages about waiting for temp.

 Current battery temperature: 357
  condition 'cool_device' not met
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?'
  Current battery temperature: 357
  condition 'cool_device' not met
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?'
  Current battery temperature: 355
  condition 'cool_device' not met
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?'
  Current battery temperature: 352
  condition 'cool_device' not met
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?'
  Current battery temperature: 350
  condition 'cool_device' met

The risk in turning it up is that it makes the device more likely to enter thermal throttling. But since this is for the unit tests that shouldn't be an issue. There is also the risk of battery swelling. On the perf waterfall thats more of a risk since we run those devices pretty hard. I'm guessing the risk of bumping it up to 40 is minimal. But it might be a good idea to check on them from time to time for awhile.
I think this points to a need for better monitoring of temperature on the devices. We should be able to track how long devices are spending cooling down, and then be able to distinguish "normal" behavior from devices possibly needing repair/replacement.
I noted on the trace in #21: Is that wait happening *after* launching the browser, in the middle of the story?

Surely we would like to wait for the device to cool down _before_ running the story? Otherwise this is another source of noise for our measurements.
#26: I believe it happen before launching the browser (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=672cabd12c680e8c75d0614c59d97479cbdfc47a&l=201). So you're right that we would be missing this kind of event in the trace. I think later we may need to tweak python tracing so that we capture things that happen before the story run.
Got it. I was confused by the long ~ 10 seconds gap between "TimelineBasedPageTest.RunNavigateSteps" and the first "ActionRunner.Wait", but that seems to be the legitimate code that is there to interact with the omnibox: https://cs.chromium.org/chromium/src/tools/perf/page_sets/system_health/chrome_stories.py?rcl=ef62bce2f7c336f4a0e8c173c463c2c4695d4841&l=42
time_wait.png
38.4 KB View Download
Re #25: I'm not sure what you mean. We already have monitoring data on device temps and quarantine and alert off it. Our quarantine threshold is 40 C, while the tests here wait to be 35 C.

Meanwhile, the hardware repair is taking its time: t/28980895. I'll drain the entire tray for now so at least they're not taking load.
Project Member

Comment 30 by bugdroid1@chromium.org, Oct 6 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/b95a5319aa54fd47e6461923b10553739da76fdd

commit b95a5319aa54fd47e6461923b10553739da76fdd
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Fri Oct 06 22:04:08 2017

Project Member

Comment 31 by bugdroid1@chromium.org, Oct 10 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/991015dd41decb117fca949e05a45eda09a27cb2

commit 991015dd41decb117fca949e05a45eda09a27cb2
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Tue Oct 10 00:13:33 2017

Project Member

Comment 32 by bugdroid1@chromium.org, Oct 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/6292824c3c8eadfe72e69fd8204aab3c6ea8637b

commit 6292824c3c8eadfe72e69fd8204aab3c6ea8637b
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Wed Oct 11 21:29:46 2017

swarming: Ignore the how_chill dimension.

TBR=maruel@chromium.org
Bug:  768485 
Change-Id: If4ce3ddfc3803ce86df83fe4855585070e63b746
Reviewed-on: https://chromium-review.googlesource.com/713437
Reviewed-by: Benjamin Pastene <bpastene@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>

[modify] https://crrev.com/6292824c3c8eadfe72e69fd8204aab3c6ea8637b/appengine/swarming/swarming_bot/bot_code/bot_main.py
[modify] https://crrev.com/6292824c3c8eadfe72e69fd8204aab3c6ea8637b/appengine/swarming/ts_mon_metrics.py

Project Member

Comment 33 by bugdroid1@chromium.org, Oct 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/1373ef1eabde6c1df9c115433c3fd9b1600610cf

commit 1373ef1eabde6c1df9c115433c3fd9b1600610cf
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Wed Oct 11 21:49:39 2017

Project Member

Comment 34 by bugdroid1@chromium.org, Oct 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/6054207e0a3bd88bd2f87907a7056bba8a47aee5

commit 6054207e0a3bd88bd2f87907a7056bba8a47aee5
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Wed Oct 11 23:44:58 2017

Comment 35 by kbr@chromium.org, Oct 12 2017

In this try job there seems to have been a problem starting the forwarder:

https://ci.chromium.org/buildbot/tryserver.chromium.android/android_n5x_swarming_rel/280375

The following telemetry_perf_unittests failed:

* benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:facebook_infinite_scroll
* benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.load:media:soundcloud
* benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.load:social:twitter
* measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails

Log excerpt:

[18/21] measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails queued
[19/21] measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails failed unexpectedly 5.1711s:
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( pm list packages );echo %$?'
  Chose browser: PossibleAndroidBrowser(browser_type=android-chromium)
  Installing /b/swarming/w/ir/out/Release/apks/ChromePublic.apk on device if needed.
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/aapt dump xmltree /b/swarming/w/ir/out/Release/apks/ChromePublic.apk AndroidManifest.xml
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/md5sum_host /b/swarming/w/ir/out/Release/apks/ChromePublic.apk
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( ps | grep -F org.chromium.chrome; echo "PIPESTATUS: ${PIPESTATUS[@]}" );echo %$?'
  Setting permissions for org.chromium.chrome.
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( ( p=org.chromium.chrome;for q in android.permission.ACCESS_COARSE_LOCATION android.permission.ACCESS_FINE_LOCATION android.permission.CAMERA android.permission.GET_ACCOUNTS android.permission.READ_EXTERNAL_STORAGE android.permission.RECORD_AUDIO android.permission.WRITE_EXTERNAL_STORAGE;do pm grant "$p" "$q";echo "~X~$q~X~$?~X~";done )>/data/local/tmp/temp_file-e0d451b177560 2>&1 );echo %$?'
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 pull /data/local/tmp/temp_file-e0d451b177560 /b/swarming/w/it2OhyVw/tmpclSlzF/tmp_ReadFileWithPull
  [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell 'rm -f /data/local/tmp/temp_file-e0d451b177560'
  Traceback (most recent call last):
    File "/b/swarming/w/ir/tools/perf/measurements/thread_times_unittest.py", line 50, in testWithSilkDetails
      results = self.RunMeasurement(measurement, ps, options=self._options)
    File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/testing/page_test_test_case.py", line 71, in RunMeasurement
      measurement, ps, options, results, metadata=EmptyMetadataForTest())
    File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 187, in Run
      test, finder_options.Copy(), story_set)
    File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function
      return func(*args, **kwargs)
    File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 89, in __init__
      use_live_traffic=use_live_traffic)
    File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function
      return func(*args, **kwargs)
    File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/core/network_controller.py", line 21, in InitializeIfNeeded
      self._network_controller_backend.InitializeIfNeeded(use_live_traffic)
    File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py", line 52, in InitializeIfNeeded
      assert bool(self._ts_proxy_server) == bool(self._forwarder)


Is this the same basic problem? Seems different. Should I file it separately?

Comment 36 by kbr@chromium.org, Oct 12 2017

Cc: kbr@chromium.org
Project Member

Comment 37 by bugdroid1@chromium.org, Oct 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e7fd7dadbc7f5810503dc47e048e39b67e2652b8

commit e7fd7dadbc7f5810503dc47e048e39b67e2652b8
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Thu Oct 12 02:38:02 2017

Trigger telemetry_perf_unittests on N5Xs that are relatively cool.

That dimensions splits the fleet in two based on if their batteries
are above or below 30 C. This will be the only test that explicitly
targets the cooler half.

Bug:  768485 
Change-Id: I66af03763e27032ab1ca96cb6e1b8ff03dd881c8
Reviewed-on: https://chromium-review.googlesource.com/714448
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>
Cr-Commit-Position: refs/heads/master@{#508232}
[modify] https://crrev.com/e7fd7dadbc7f5810503dc47e048e39b67e2652b8/testing/buildbot/chromium.android.json

Status: Fixed (was: Assigned)
After the change in #37, the test has stopped timing out excessively. All recent time outs seem to be due to actual CLs themselves:
https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&et=1507938120000&f=name%3Atelemetry_perf_unittests&f=state%3ATIMED_OUT&f=device_type%3ANexus%205X%20(bullhead)&l=1000&s=created_ts%3Adesc&st=1507851720000

Going to close this out


Re #35: yeah, that looks like another issue. I'd file a new bug if you're still seeing that.

Sign in to add a comment