telemetry_perf_unittests continuously timing out on specific android devices |
|||||||||||
Issue descriptionhttps://chromium-swarm.appspot.com/bot?id=build317-m1--device7 It keeps timing out everytime it picks up a telemetry_perf_unittests shard: https://chromium-swarm.appspot.com/task?id=38d0291d6e936410 https://chromium-swarm.appspot.com/task?id=38d3182635c97b10 https://chromium-swarm.appspot.com/task?id=38d2a80489668310 https://chromium-swarm.appspot.com/task?id=38d25cb5d8e4de10 https://chromium-swarm.appspot.com/task?id=38d0cdac87c07210 Probably an infra issue, but +cc some speed people: any idea why it's timing out?
,
Sep 25 2017
It's possible that some Telemetry test is being stuck here, lemme check.
,
Sep 25 2017
Hmhh, I ran ./bin/list_telemetry_unittests on the 5 logs, and nothing stands out. Different tests was being stuck for each of those. Ex: ********** All passed telemetry unittests: benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:media:facebook_photos 1021.5496 benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:news:wikipedia 16.5032 benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:games:miniclip 16.4803 benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:tools:stackoverflow 16.2789 benchmarks.benchmark_unittest.BenchmarkOptionsTest.memory.top_10_mobile 0.0132 benchmarks.blink_perf_unittest.ComputeTraceEventsMetricsForBlinkPerfTest.testAsyncTraceEventMetricsOverlapping 0.0006 ******** All skipped telemetry unittests: benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.power.typical_10_mobile benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.browse:news:nytimes core.benchmark_finders_unittest.TestGetBenchmarkNamesForFile.testListSimpleBenchmarksDefinedInOneFileComplex ******* Unfinished telemetry unittests: benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:facebook
,
Sep 25 2017
Dirk: a big weakness of typ is that when we send SIGKILL to the it, it doesn't show the stack trace of where the program is currently at (even for the case we run typ with --jobs=1, I believe). This would help a lot with figuring out what test got stuck.
,
Sep 25 2017
Could be something wrong with the phone. I'll trigger a reflash and hopefully that'll heal whatever's wrong with it. BTW, it also might be helpful to prepend a timestamp to each line in the logs like: https://chromium-swarm.appspot.com/task?id=38d3eb8d68c95810 Or even the time-since-test-start like: https://chromium-swarm.appspot.com/task?id=38d3ebb021682a10 That way we can tell what's taking so long.
,
Sep 29 2017
The reflash didn't help. Turns out it's not just that device having issues, but the whole host: https://chromium-swarm.appspot.com/bot?id=build317-m1--device6 https://chromium-swarm.appspot.com/bot?id=build317-m1--device5 Even if it doesn't timeout, it still takes significantly longer to complete than the average across other bots. telemetry_perf_unittests does network I/O during the test, right? I wonder if something's wrong in the network stack on the host, slowing down any network calls inside the containers. I'll poke around.
,
Sep 30 2017
It seems it's always benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse tests that are the reason for the slow down. Their runtimes are always much longer than the average: benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:tumblr_infinite_scroll passed 880.3343s: benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:news:reddit passed 474.3115s: benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:tech:discourse_infinite_scroll passed 457.6424s: On normal bots they usually take 60-80 seconds. Ned, is there anything particular about these benchmarks? Any reason you can think why they'd be so slow on this host's devices?
,
Sep 30 2017
It's also happening to this device: https://chromium-swarm.appspot.com/bot?id=build321-m1--device1 What the heck is going on?
,
Sep 30 2017
And this one https://chromium-swarm.appspot.com/bot?id=build379-m4--device2 !!!!
,
Sep 30 2017
,
Sep 30 2017
Given the current swarming outage ( crbug.com/770384 ), I'd pretty much ignore any results from this afternoon.
,
Oct 2 2017
+Juan: let's see if this still happens. If it does, we might need to enable trace uploading to see where time is spent for these tests.
,
Oct 2 2017
Re #11: The problem here is unrelated to the swarming/datastore outage. And it's still happening. Here's the current list of problem bots so far: https://chromium-swarm.appspot.com/bot?id=build379-m4--device2 https://chromium-swarm.appspot.com/bot?id=build317-m1--device7 https://chromium-swarm.appspot.com/bot?id=build321-m1--device1 https://chromium-swarm.appspot.com/bot?id=build317-m1--device6 https://chromium-swarm.appspot.com/bot?id=build441-m1--device5 https://chromium-swarm.appspot.com/bot?id=build382-m4--device3 https://chromium-swarm.appspot.com/bot?id=build381-m4--device3
,
Oct 3 2017
,
Oct 3 2017
And here's some more: https://chromium-swarm.appspot.com/bot?id=build381-m4--device4 https://chromium-swarm.appspot.com/bot?id=build382-m4--device6 https://chromium-swarm.appspot.com/bot?id=build380-m4--device7 This is really proving disruptive and seems to be continuously getting worse.
,
Oct 3 2017
The way to debug this is enable uploading traces to cloud storage, so we can grab the trace from log & look at why the tests are taking so long. Before I do it, just want to make sure that we don't blow our cloud storage budget. A telemetry_perf_unittests has about 50 sh stories under tests, each produce a trace file of size ~10Mb. So that mean 500 Mb of data going to chrome-telemetry-output/ bucket for each CQ run. Does anyone know whether this is too excessive?
,
Oct 3 2017
500 Mb / build is a fair amount, but not horrible; it should be fine.
,
Oct 4 2017
After taking one of the devices offline, running the test manually, and attaching a debugger, I think I've found what the tests are getting hung up on: (logging seems to be a black hole in these tests) https://codesearch.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=086a927c4c2fa0a96e54ccc17879b82d673e6ffd&l=201 Could be that a lot of our devices are running warmer so it takes a while to cool down to that 35 degree threshold. +rnephew@ fyi. If needed, is there any reason not to bump that temperature limit to 40 degrees? That's what a swarming bot quarantines on. I'll keep poking.
,
Oct 4 2017
actual +rnephew@
,
Oct 4 2017
Interesting. Nice find! If this doesn't fix things, Ned, we need to figure out a plan to hunt this down.
,
Oct 4 2017
Nice find. #20: my plan to hunt this down is to enable trace uploading at mentioned in #16. We capture how a test spend it time in the trace. Example: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/browse_chrome_omnibox_2017-10-03_11-54-37_33716.html (See Process 25832)
,
Oct 4 2017
,
Oct 4 2017
I can confirm that removing the wait-for-battery-temp line makes one of the memory benchmarks pass in only 130.5397s. With it in, it takes much longer. (I launched it ~10 minutes ago and it hasn't finished yet.) Looking at the monitoring data, there's been a noticeable increase in battery temperatures for N5Xs in Atlanta: http://shortn/_EetrXhpPti About a week ago, the p999 temp visibly jumped from about 34 Celsius to 37. That correlates pretty well with when these tests started to timeout. It may be that the subset of devices the tests keep failing on are responsible for that jump. Indeed, looking at a few of the devices mentioned here seems to confirm it: http://shortn/_Rs632gDL3v http://shortn/_trnxZRWeSu I'll file a ticket to get the tray fans for some of these devices looked at. Chances are they shut down and need repair/replacement.
,
Oct 4 2017
Yep, logs are full of messages about waiting for temp. Current battery temperature: 357 condition 'cool_device' not met [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?' Current battery temperature: 357 condition 'cool_device' not met [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?' Current battery temperature: 355 condition 'cool_device' not met [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?' Current battery temperature: 352 condition 'cool_device' not met [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01e36de8cc5f34cb shell '( dumpsys battery );echo %$?' Current battery temperature: 350 condition 'cool_device' met The risk in turning it up is that it makes the device more likely to enter thermal throttling. But since this is for the unit tests that shouldn't be an issue. There is also the risk of battery swelling. On the perf waterfall thats more of a risk since we run those devices pretty hard. I'm guessing the risk of bumping it up to 40 is minimal. But it might be a good idea to check on them from time to time for awhile.
,
Oct 4 2017
I think this points to a need for better monitoring of temperature on the devices. We should be able to track how long devices are spending cooling down, and then be able to distinguish "normal" behavior from devices possibly needing repair/replacement.
,
Oct 4 2017
I noted on the trace in #21: Is that wait happening *after* launching the browser, in the middle of the story? Surely we would like to wait for the device to cool down _before_ running the story? Otherwise this is another source of noise for our measurements.
,
Oct 4 2017
#26: I believe it happen before launching the browser (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=672cabd12c680e8c75d0614c59d97479cbdfc47a&l=201). So you're right that we would be missing this kind of event in the trace. I think later we may need to tweak python tracing so that we capture things that happen before the story run.
,
Oct 4 2017
Got it. I was confused by the long ~ 10 seconds gap between "TimelineBasedPageTest.RunNavigateSteps" and the first "ActionRunner.Wait", but that seems to be the legitimate code that is there to interact with the omnibox: https://cs.chromium.org/chromium/src/tools/perf/page_sets/system_health/chrome_stories.py?rcl=ef62bce2f7c336f4a0e8c173c463c2c4695d4841&l=42
,
Oct 5 2017
Re #25: I'm not sure what you mean. We already have monitoring data on device temps and quarantine and alert off it. Our quarantine threshold is 40 C, while the tests here wait to be 35 C. Meanwhile, the hardware repair is taking its time: t/28980895. I'll drain the entire tray for now so at least they're not taking load.
,
Oct 6 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/b95a5319aa54fd47e6461923b10553739da76fdd commit b95a5319aa54fd47e6461923b10553739da76fdd Author: Benjamin Pastene <bpastene@chromium.org> Date: Fri Oct 06 22:04:08 2017
,
Oct 10 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/991015dd41decb117fca949e05a45eda09a27cb2 commit 991015dd41decb117fca949e05a45eda09a27cb2 Author: Benjamin Pastene <bpastene@chromium.org> Date: Tue Oct 10 00:13:33 2017
,
Oct 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/6292824c3c8eadfe72e69fd8204aab3c6ea8637b commit 6292824c3c8eadfe72e69fd8204aab3c6ea8637b Author: Benjamin Pastene <bpastene@chromium.org> Date: Wed Oct 11 21:29:46 2017 swarming: Ignore the how_chill dimension. TBR=maruel@chromium.org Bug: 768485 Change-Id: If4ce3ddfc3803ce86df83fe4855585070e63b746 Reviewed-on: https://chromium-review.googlesource.com/713437 Reviewed-by: Benjamin Pastene <bpastene@chromium.org> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> [modify] https://crrev.com/6292824c3c8eadfe72e69fd8204aab3c6ea8637b/appengine/swarming/swarming_bot/bot_code/bot_main.py [modify] https://crrev.com/6292824c3c8eadfe72e69fd8204aab3c6ea8637b/appengine/swarming/ts_mon_metrics.py
,
Oct 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/1373ef1eabde6c1df9c115433c3fd9b1600610cf commit 1373ef1eabde6c1df9c115433c3fd9b1600610cf Author: Benjamin Pastene <bpastene@chromium.org> Date: Wed Oct 11 21:49:39 2017
,
Oct 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/6054207e0a3bd88bd2f87907a7056bba8a47aee5 commit 6054207e0a3bd88bd2f87907a7056bba8a47aee5 Author: Benjamin Pastene <bpastene@chromium.org> Date: Wed Oct 11 23:44:58 2017
,
Oct 12 2017
In this try job there seems to have been a problem starting the forwarder: https://ci.chromium.org/buildbot/tryserver.chromium.android/android_n5x_swarming_rel/280375 The following telemetry_perf_unittests failed: * benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.browse:social:facebook_infinite_scroll * benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.load:media:soundcloud * benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_mobile.load:social:twitter * measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails Log excerpt: [18/21] measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails queued [19/21] measurements.thread_times_unittest.ThreadTimesUnitTest.testWithSilkDetails failed unexpectedly 5.1711s: [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb devices [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( pm list packages );echo %$?' Chose browser: PossibleAndroidBrowser(browser_type=android-chromium) Installing /b/swarming/w/ir/out/Release/apks/ChromePublic.apk on device if needed. [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/aapt dump xmltree /b/swarming/w/ir/out/Release/apks/ChromePublic.apk AndroidManifest.xml [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/md5sum_host /b/swarming/w/ir/out/Release/apks/ChromePublic.apk [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( ps | grep -F org.chromium.chrome; echo "PIPESTATUS: ${PIPESTATUS[@]}" );echo %$?' Setting permissions for org.chromium.chrome. [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell '( ( p=org.chromium.chrome;for q in android.permission.ACCESS_COARSE_LOCATION android.permission.ACCESS_FINE_LOCATION android.permission.CAMERA android.permission.GET_ACCOUNTS android.permission.READ_EXTERNAL_STORAGE android.permission.RECORD_AUDIO android.permission.WRITE_EXTERNAL_STORAGE;do pm grant "$p" "$q";echo "~X~$q~X~$?~X~";done )>/data/local/tmp/temp_file-e0d451b177560 2>&1 );echo %$?' [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 pull /data/local/tmp/temp_file-e0d451b177560 /b/swarming/w/it2OhyVw/tmpclSlzF/tmp_ReadFileWithPull [host]> /b/swarming/w/ir/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 01f078c9e9c34d96 shell 'rm -f /data/local/tmp/temp_file-e0d451b177560' Traceback (most recent call last): File "/b/swarming/w/ir/tools/perf/measurements/thread_times_unittest.py", line 50, in testWithSilkDetails results = self.RunMeasurement(measurement, ps, options=self._options) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/testing/page_test_test_case.py", line 71, in RunMeasurement measurement, ps, options, results, metadata=EmptyMetadataForTest()) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 187, in Run test, finder_options.Copy(), story_set) File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 89, in __init__ use_live_traffic=use_live_traffic) File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/core/network_controller.py", line 21, in InitializeIfNeeded self._network_controller_backend.InitializeIfNeeded(use_live_traffic) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py", line 52, in InitializeIfNeeded assert bool(self._ts_proxy_server) == bool(self._forwarder) Is this the same basic problem? Seems different. Should I file it separately?
,
Oct 12 2017
,
Oct 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e7fd7dadbc7f5810503dc47e048e39b67e2652b8 commit e7fd7dadbc7f5810503dc47e048e39b67e2652b8 Author: Benjamin Pastene <bpastene@chromium.org> Date: Thu Oct 12 02:38:02 2017 Trigger telemetry_perf_unittests on N5Xs that are relatively cool. That dimensions splits the fleet in two based on if their batteries are above or below 30 C. This will be the only test that explicitly targets the cooler half. Bug: 768485 Change-Id: I66af03763e27032ab1ca96cb6e1b8ff03dd881c8 Reviewed-on: https://chromium-review.googlesource.com/714448 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#508232} [modify] https://crrev.com/e7fd7dadbc7f5810503dc47e048e39b67e2652b8/testing/buildbot/chromium.android.json
,
Oct 13 2017
After the change in #37, the test has stopped timing out excessively. All recent time outs seem to be due to actual CLs themselves: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&et=1507938120000&f=name%3Atelemetry_perf_unittests&f=state%3ATIMED_OUT&f=device_type%3ANexus%205X%20(bullhead)&l=1000&s=created_ts%3Adesc&st=1507851720000 Going to close this out Re #35: yeah, that looks like another issue. I'd file a new bug if you're still seeing that. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by martiniss@chromium.org
, Sep 25 2017