New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 811525 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-04-04
OS: Mac
Pri: 2
Type: Bug

Blocked on:
issue 818883



Sign in to add a comment

AudioOutputUnitStop() can take a loooong time; 10s+ to complete.

Project Member Reported by dalecur...@chromium.org, Feb 12 2018

Issue description

This is the result of the discussion here https://groups.google.com/a/chromium.org/d/msg/blink-dev/amrZuV2eR6Q/wDfRqRxdCAAJ

Reproduction:

./blink/tools/run_layout_tests.py --no-retry-failures http/tests/media/media-src-suspend-after-have-future-* --exit-after-n-failures=1 --timing --verbose --additional-driver-flag=--trace-startup --additional-driver-flag=--trace-startup-duration=60 --additional-driver-flag=--trace-startup-file=/tmp/trace_full.json --repeat-each=30

After step ~14 or 15 (with 1 content shell, ~28 with 2 shells) you'll notice that the timing increases significantly.

$ ./blink/tools/run_layout_tests.py --no-retry-failures http/tests/media/media-src-suspend-after-have-future-* --exit-after-n-failures=1 --repeat-each=100 -f --timing --verbose

[1/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.9929s
[2/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.9942s
[3/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1590s
[4/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1652s
[5/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1743s
[6/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1884s
<...>
[28/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 12.1446s

Startup tracing reveals that AudioOutputUnitStop() is blocking the UI thread for 10s+ when this occurs; this prevents test completion. Note: When using startup tracing, you must wait for your chosen duration to elapse or you will get no trace file. :| :| :|

Switching to a fake audio stream reveals this is the culprit:
Running 1 Content Shell.                                            

[1/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.5970s
[2/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1278s
[3/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1114s
<...>
[28/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1500s
[29/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.0792s
[30/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1098s

For whatever reason, once we cross more than ~25 calls to stop it starts slowing down terribly. I suspect we're sometimes hitting this in production, but probably not very frequently since we typically reuse streams when possible.

I think we're likely hitting this on the macOS layout test bots (possibly not if they don't have working audio). I suspect we should add a flag to allow the layout test shell to use fake audio streams to avoid test slow downs.

+dpranke for layout test commentary and pointers to where we track time to run mac layout tests.
+ossu,olka for CoreAudio expertise
 
trace_moar3.json.gz
4.2 MB Download
it's not super-easy to track test execution time these days. Is there something specific I could look for?
Hmm, I was hoping we might report the average length of the run layout tests step for all mac bots somewhere. Something that we could watch for decreases in if we land a flag to switch to fake output streams for layout tests. Or also to see how it compares to other platforms, I suspect this could cause an order of magnitude difference in run time on the bots depending on how tests w/ audio each shard gets.
We do record that data, but there is so much data, and the platforms are so different from each other, that it'd be hard to make sense out of it, unless you have *very* specific questions you want to ask, like "how long did this specific test take?".
I think just a high level answer is fine for an a/b test. E.g. Over the last N layout test runs on macOS, what's the average time to complete that step? Where N is ~= to few days of runs.

With that value we can switch to fake audio for a couple days and see if there's any improvement at all.
See go/webkit-layout-tests-time-on-mac10.12-tests-for-the-last-week for a graph.

Bear in mind that the total execution time (across all shards) is ~16,000s, so you'd have to have a *lot* of tests being slow for there to be a significant dent :).
Your query doesn't seem to limit to the layout tests step; limiting it to steps with the name 'webkit_layout_tests on Intel GPU on Mac on Mac-10.12.6' yields somewhere around ~6000 seconds on average.

Thanks in any case! There are probably ~1000 media layout tests and ~50000 overall, so if test selection is random and this bug has fall off (i.e. stops slowing down after no audio usage for a while or never slows down without back-to-back audio usage crossing a threshold), there could be no effect. If the bots don't have working audio, there could also be no effect.

In the worst case, (either alphabetical test selection or no fall off) every media layout test past some threshold may cumulatively add up to ~10 seconds per media test per content shell shard. Which would add up to a significant portion of that execution time.
For a probably not comparable comparison point, the Windows 10 layout step takes on average ~3400 seconds across all shards.
Project Member

Comment 8 by bugdroid1@chromium.org, Feb 28 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f29eb01290cd36a30177ecf8197f906c01088a0d

commit f29eb01290cd36a30177ecf8197f906c01088a0d
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Wed Feb 28 01:44:39 2018

Add --disable-audio-output switch for disabling audio browser side.

This ensures we still exercise the majority of the audio output path
(good for testing) but replace the actual OS level stream with a
fake stream at the last moment.

The goal here is to try setting this for macOS layout tests to see
if it reduces the run time due to hanging OS calls.

BUG= 811525 
TEST=manual

Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: Id3b95125696a96ef8d14468f1a97dedf9cdf77b4
Reviewed-on: https://chromium-review.googlesource.com/938152
Reviewed-by: Olga Sharonova <olka@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539638}
[modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/audio/audio_manager_base.cc
[modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/base/media_switches.cc
[modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/base/media_switches.h

Cc: gab@chromium.org
Components: Internals>TaskScheduler
Huh, now that I've landed that switch, running the media/media-ended.html test locally I am still seeing 10s+ timeouts on the 11th run consistently. Trace shows most calls eventually stop happening, leaving just giant gaps between when a PostDelayedTask is requested and when it's actually scheduled!

ERR: [64004:775:0228/112957.940975:ERROR:webmediaplayer_impl.cc(399)] Load(0, file:///d/code/chrome/src/third_party/WebKit/LayoutTests/media/content/silence.oga, 0)\n
ERR: [64004:26895:0228/112957.948370:ERROR:audio_renderer_impl.cc(842)] SetPlaybackRate(0)\n
ERR: [64002:33539:0228/112957.949308:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 0.016ms\n
ERR: [64002:33539:0228/112957.949369:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n
ERR: [64002:33539:0228/112957.949412:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.024ms\n
ERR: [64002:33539:0228/112957.949448:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 23.074ms\n
ERR: [64004:26895:0228/112957.949489:ERROR:audio_renderer_impl.cc(842)] SetPlaybackRate(1)\n
ERR: [64002:33539:0228/113007.973750:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001.2ms\n <---- !!! 10s gap !!!
ERR: [64002:33539:0228/113007.973877:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n
ERR: [64002:33539:0228/113007.973962:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.054ms\n
ERR: [64002:33539:0228/113007.974057:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 5.905ms\n
...

The render thread is the only thread that seems to still be firing regular tasks during this time, AudioThread is a plain base::Thread, but for whatever reason isn't getting any scheduled time despite posting a delayed task for 23.074ms into the future.

The fact that the gap is pretty close to precisely 10s is also really weird. It feels like there's a timeout or something happening somewhere. One reference to a 10s timeout is in the burst mode for the sampling profiler, but not sure how that is related.

I'm a bit stumped. It's starting to look like there's something going wrong inside the MessagePump for macOS. I can't think of any other reason why a post delayed task would not be scheduled. +gab from scheduler team.
trace_hang.json.gz
2.3 MB Download
Repro command w/ ToT:

./blink/tools/run_layout_tests.py --no-retry-failures media/media-ended.html --timing --verbose --additional-driver-flag=--disable-audio-output --driver-logging --repeat-each=11

11th run always times out for me, even if you add the test to SlowTests.
Grabbed a sample from ActivityMonitor, but it doesn't really show anything useful AFAICT; threads are sitting in mach_msg_trap(), but it's unclear from the trace for how long.
sample-2.txt
107 KB View Download
Cc: rsesek@chromium.org
So at some point mach_msg() used in WaitableEventMac starts taking seconds to return:

ERR: [68978:33795:0228/142346.712934:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 1.45\n
ERR: [68978:33795:0228/142346.713006:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n
ERR: [68978:33795:0228/142346.713052:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.031\n
ERR: [68978:33795:0228/142346.713100:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 21.623\n
ERR: [68980:9275:0228/142346.713115:ERROR:audio_renderer_impl.cc(889)] Render delay:0.023219 s prior_frames_skipped:0 frames_requested:1024\n
ERR: [68980:9275:0228/142346.713187:ERROR:audio_renderer_impl.cc(1043)] Render complete...\n
ERR: [68980:25107:0228/142350.562302:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5071.06\n
ERR: [68980:30467:0228/142350.562311:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.76\n
ERR: [68980:10247:0228/142350.562316:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.8\n
ERR: [68980:31235:0228/142350.562326:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.88\n
ERR: [68980:18691:0228/142350.562321:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.88\n
ERR: [68980:25603:0228/142350.562341:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8350.3\n
ERR: [68980:32515:0228/142350.562355:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8350.32\n
ERR: [68980:36111:0228/142351.510343:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1999.96\n
ERR: [68980:36111:0228/142353.511218:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.71\n
ERR: [68978:33283:0228/142354.937406:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40000.3\n
ERR: [68978:37891:0228/142354.940406:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2\n
ERR: [68980:36111:0228/142355.511997:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.67\n
ERR: [68978:33795:0228/142356.735210:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10022.1\n
ERR: [68978:33795:0228/142356.735290:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10000.6\n

+rsesek who worked on this most recently for insight into why this might be...
(units are ms)
Now w/ the timeout value:

RR: [69346:26895:0228/142728.281798:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1215.26, timeout was: 0\n
ERR: [69346:31747:0228/142729.875311:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.45, timeout was: 30000\n
ERR: [69346:32003:0228/142729.875310:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.57, timeout was: 30000\n
ERR: [69346:6407:0228/142729.875304:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5091.63, timeout was: 30000\n
ERR: [69346:31491:0228/142729.875326:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.66, timeout was: 30000\n
ERR: [69346:33283:0228/142729.875333:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.68, timeout was: 30000\n
ERR: [69346:13315:0228/142729.875340:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8354.13, timeout was: 30000\n
ERR: [69346:38147:0228/142729.875359:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8354.17, timeout was: 30000\n
ERR: [69346:26895:0228/142730.283367:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.43, timeout was: 0\n
ERR: [69346:26895:0228/142732.284496:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.97, timeout was: 0\n
ERR: [69344:31747:0228/142734.189675:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000\n
ERR: [69344:38659:0228/142734.199846:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40000.9, timeout was: 40000\n
ERR: [69346:26895:0228/142734.286741:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2002.09, timeout was: 0\n
ERR: [69346:26895:0228/142736.289230:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2002.32, timeout was: 0\n
ERR: [69344:34819:0228/142737.089643:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10023.2, timeout was: 23\n
ERR: [69344:34819:0228/142737.089797:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001\n
ERR: [69344:34819:0228/142737.089856:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n
ERR: [69344:34819:0228/142737.089942:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.054\n
ERR: [69344:34819:0228/142737.090094:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 6.072\n
ERR: [69346:34851:0228/142737.090100:ERROR:audio_renderer_impl.cc(889)] Render delay:0.023219 s prior_frames_skipped:0 frames_requested:1024\n
ERR: [69346:34851:0228/142737.090172:ERROR:audio_renderer_impl.cc(1043)] Render complete...\n
ERR: [69346:26895:0228/142738.290566:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1200.3, timeout was: 0\n
ERR: [69346:26895:0228/142740.291644:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.91, timeout was: 0\n
ERR: [69346:26895:0228/142742.295718:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2003.98, timeout was: 0\n
ERR: [69346:26895:0228/142744.299149:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2003.27, timeout was: 0\n
ERR: [69346:26895:0228/142746.303755:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2004.42, timeout was: 0\n
ERR: [69344:34819:0228/142747.097306:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10007.2, timeout was: 7\n

Will print the thread ids next.
ERR: [69507:26639:0228/143232.908400:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1232.7, timeout was: 0: tid=26639\n
ERR: [69507:12803:0228/143234.485225:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.55, timeout was: 30000: tid=12803\n
ERR: [69507:34051:0228/143234.485225:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5070.74, timeout was: 30000: tid=34051\n
ERR: [69507:33539:0228/143234.485235:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.65, timeout was: 30000: tid=33539\n
ERR: [69507:35075:0228/143234.485242:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.7, timeout was: 30000: tid=35075\n
ERR: [69507:32003:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8377.1, timeout was: 30000: tid=32003\n
ERR: [69507:6407:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.73, timeout was: 30000: tid=6407\n
ERR: [69507:37635:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.72, timeout was: 30000: tid=37635\n
ERR: [69507:26639:0228/143234.913135:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1994.64, timeout was: 0: tid=26639\n
ERR: [69507:26639:0228/143236.914781:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.48, timeout was: 0: tid=26639\n
ERR: [69505:31235:0228/143238.790479:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000: tid=31235\n
ERR: [69505:26115:0228/143238.795351:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000: tid=26115\n
ERR: [69507:26639:0228/143238.915973:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.03, timeout was: 0: tid=26639\n
ERR: [69507:26639:0228/143240.917021:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.85, timeout was: 0: tid=26639\n
ERR: [69505:32259:0228/143241.697752:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10022.3, timeout was: 22: tid=32259\n <--- audio thread.
ERR: [69505:32259:0228/143241.697910:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001.4: tid=32259\n
Yes, it still repros with that section disabled.
rsesek@ any advice on figuring out what mach_msg() may be blocked on? I searched for mach_msg debugging and got:

https://robert.sesek.com/2012/1/debugging_mach_ports.html

So you seem like the right person to ask :p
More debugging notes:
- Forcing UseSlowWatchList() to true had no effect.
- Forcing WaitMany() to use PORT_SET or DISPATCH had no effect.
I noticed we mix mach_port_construct and mach_port_deallocate vs mach_port_destruct, but changing that doesn't seem to make anything better.

Comment 21 by olka@chromium.org, Mar 1 2018

Might that be somehow related to Issue 816483?
Re: #18: Not to be glib, but it's blocked on waiting for a message from calling Signal() :p. Is the WaitableEvent that's blocked part of a MessagePump/TaskScheduler worker implementation, or is it a dedicated synchronization primitive?

Re: #19: Thanks for checking those. The watch list shouldn't have an impact here because it's only for the WaitableEventWatcher, which is used only in a couple of places.

Re: #20: The API usage is correct. The Mach functions are a bit clumsy to use, and the names aren't great.

I'm a bit skeptical of this being an issue in WaitableEvent itself -- we've been shipping this implementation for several milestones. I think it's more likely that this is a synchronization issue with the caller of WaitableEvent instead, or  issue 811525  mentioned in #21 sounds plausible as well.
I'm pretty sure it's part of a MessagePump; I.e., the timeout value for this line comes from PostDelayedTask:
ERR: [69344:34819:0228/142737.089643:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10023.2, timeout was: 23\n

I also don't think it's likely an issue with WaitableEvent, but perhaps some other thing either exhausting mach ports or destabilizing the mach system as used by Chrome.

@olka, it's possible, but with --disable-audio-output this isn't a real time audio thread; the worker thread is just a plain thread that isn't woken up for some reason.
Another debugging note: I tried switching the message pump for the audio thread to UI or IO (since that changes the underlying pump) and didn't see a significant improvement, though the UI one seemed like it might be more responsive. I'll retest that today to confirm.
Thanks for the tip on priority olka@, I didn't think to check that, but increasing the thread priority to REALTIME_AUDIO for the fake thread resolves the issue. For some reason as a normal priority thread the fake audio worker was getting no scheduled time for up to 10s? Will dig into exactly what normal priority means for our threads now.
Cc: mark@chromium.org
Experimenting a bit further, making PlatformThreadMac::SetRealtimeAudioPriority() a no-op causes the problem to return. Toggling various bits of that function on and off shows that any one of the 3 calls for setting various thread policies up is sufficient to resolve the problem.

Most interesting is just setting the time constraints for the thread; this keeps a 0.5 priority but still ensures the audio thread is scheduled sufficiently.

It doesn't seem reasonable for a normal priority thread to be descheduled for 10s... so I wonder if we should be setting time constraints for all normal threads... +mark as another macOS expert.
Interesting. It does seem odd that the thread is starving for 10s. Is this reproable only on a specific OS version, or are you seeing it across versions? And is it only occurring when there's a high level of parallelism with running tests (and other things on the system), or when it's standalone too?
mac_chromium_rel_ng timed out on this test, so I think it's reproable on the bots too, but it's always possible it's something else there. Locally I'm running 10.13.3 a 2017 MBP, and that's the only version I have access to.

If you have a local checkout you can try a quick repro with the command line in c#10. 
It's occurring 100% of the time on the 11th run, nothing else going on in the system.

Comment 30 Deleted

Deleted the last comment since it was wrong, c#26 is correct. I just accidentally left an early return in while testing.
You may want to check out `taskinfo --threads <pid>` which can dump the policies for all the threads for a given PID.
Thanks for that, it seems that as some point the audio thread (along with almost every other thread) is demoted significantly in priority:

https://www.diffchecker.com/Q6a8aRcJ

I.e. even CrBrowserMain drops to a 4 in priority:

good:
	thread ID: 0x1f98a / 129418
	thread name: "CrBrowserMain"
	user/system time: 0.640887 s / 0.219802 s
	CPU usage (over last tick): 5%
	sched mode: timeshare
	policy: POLICY_TIMESHARE
		timeshare max  priority: 63
		timeshare base priority: 46
		timeshare cur  priority: 46
		timeshare depressed: NO, prio -1
	requested policy:

bad:
	thread ID: 0x1f98a / 129418
	thread name: "CrBrowserMain"
	user/system time: 0.888308 s / 0.305418 s
	CPU usage (over last tick): 0%
	sched mode: timeshare
	policy: POLICY_TIMESHARE
		timeshare max  priority: 4
		timeshare base priority: 4
		timeshare cur  priority: 4
		timeshare depressed: NO, prio -1
	requested policy:
early.txt
29.8 KB View Download
stuck.txt
30.0 KB View Download
Despite QOS policy being user interactive the UI thread is shunted to tier5:

	effective policy:
		eff thread qos: THREAD_QOS_USER_INTERACTIVE
		eff thread qos relprio: 0
		eff promotion qos: THREAD_QOS_USER_INTERACTIVE
		eff latency qos: LATENCY_QOS_TIER_5
		eff thruput qos: THROUGHPUT_QOS_TIER_0
		eff darwin BG: NO
		eff iotier: THROTTLE_LEVEL_TIER1 (STANDARD)
		eff other: bg-newsockets

I can't find information on what bg-newsockets is...
Looks like the culprit is related to the energy saving feature "App Nap":

https://developer.apple.com/library/content/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/AppNap.html

Running:
defaults write NSGlobalDomain NSAppSleepDisabled -bool YES

Resolves the issue on future runs...
Cc: lgrey@chromium.org
Interesting find. I thought we disabled our AppNap implementation, but I could be wrong. +lgrey
Adding SupportsAppNap false/ to the .plist for content shell doesn't seem to help.
Doesn't seem like AppNap requires any implementation from us, it is  applied automatically per the docs: "If an app isn’t performing user-initiated work such as updating content on screen, playing music, or downloading a file, the system may put the app in App Nap."

Seems this might only affects layout tests which have a hidden content shell.
We did disable renderer AppNap. Browser AppNap *is* on, per comment #37, since we didn't need to do anything magic to make it work there.

I would find it surprising if an application became AppNap eligible with an active audio stream, but they're pretty opaque on how it works, so who knows?

dalecurtis@ there's a toggleable "AppNap" column in Activity Monitor, so that can be a quick way to verify what process enters AppNap when. There's also a NSAppNap$1 DTrace provider.
Disabling AppNap also seems to fix the original issue listed in this bug. So perhaps the only recommendation here is to ensure our macOs bots all disable that...
@lgrey, with the repro case in c#10 it's running with --disable-audio-output which prevents any active audio streams. That said, the original issue in c#0 plays a stream with audio, though that stream is toggled on and off. Perhaps if no stream is active at a given time creating a new one is insufficient for taking the process out of nap.
Makes sense. My testing with QOS level 5 reveals that 10 seconds is the magic max threshold for timer ticks, as you saw above. In that gap, I don't think there's anything the backgrounded process itself can do. 
@dpranke, do you know if our macOS bots disable AppNap? 
Do we want to disable AppNap on all the bots, though? Presumably we'd lose coverage of the real-world configuration. Perhaps the wrapper script for running layout tests temporarily disables it, and then resets it to its last value when it's done.
Cc: d...@chromium.org
 +dba ... @dba - any idea if we disable AppNap on the swarming bots?
I think any long running non-UI presenting test is going to trigger AppNap; which is pretty much all of our tests. This is all speculation though, so we could try disabling it and see if anything changes, if not flip it back.

What coverage would we be getting though with it enabled?
I think it matters most on the perf bots, where we're trying to get an accurate picture of performance on user-like systems. I suppose for our test binaries it doesn't matter. lgrey@ WDYT?
Do perf bots actively test switching to other applications? If not, I think it's not really a huge deal.
Can you elaborate lgrey? I'm seeing AppNap kick in even when nothing else but the layout test runner is running.
I mean I think it would be fine for the perf bots. Since you mentioned upthread that this only happens the content shell is hidden, I would think the presence of a visible window prevents it from kicking in. If not, I think we'd already be seeing some issues in browser test waits.
ping @dba, any update on whether our bots disable app nap?

Comment 53 by d...@chromium.org, Mar 5 2018

#52 - No, it's not disabled it looks like.
Thanks. It seems reasonable to try disabling it for some time to see if we notice improvements in cycle time for build and test.

@dpranke, is there experiment infrastructure for this? Or just flip and graph the results on plx?
> @dpranke, is there experiment infrastructure for this? 
> Or just flip and graph the results on plx?

No, there's no experiment infrasturcture for this that I'd recommend. Just flip and watch :).
Any tips on how one might go about executing a CLI command across the Mac fleet? :) (See c#35) Should I just file an infrastructure ticket? As rsesek@ notes probably we shouldn't disable it on the Telemetry fleet.
File a Labs ticket (go/infrasys-bug) and ask for them to deal with this via Puppet.
Blockedon: 818883
To close the loop on the fleet change, it didn't seem to have any effect. I'd like to come up with some way to avoid folks who run the layout test runner from being affected without their knowledge, but it seems a bit weird to have the layout test runner (all test runners?) execute the command in c#35, before and after - so maybe the best I can do here is a PSA to chromium-dev@.
I wonder if adding:

  [defaults setBool:YES forKey:@"NSAppSleepDisabled"]

to SetDefaultsToLayoutTestValues() https://cs.chromium.org/chromium/src/content/shell/app/blink_test_platform_support_mac.mm?q=nsuserdefaults+file:content&sq=package:chromium&dr=C&l=25 would work instead of having to put it in the global prefs domain.
NextAction: 2018-04-04
Good idea, I'll try that when I have my mac laptop next.
The NextAction date has arrived: 2018-04-04
Project Member

Comment 64 by bugdroid1@chromium.org, Apr 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/de5d56c9ab59e532f61deb09937315b0346be021

commit de5d56c9ab59e532f61deb09937315b0346be021
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Thu Apr 05 15:58:03 2018

Disable AppNap to avoid timeouts when running blink tests.

Layout tests do not always meet the requirements to avoid being
"napped", in which case test timeouts and other odd behavior may
occur; this is most prevalent with large repeat values and a
small number of tests.

This sets NSAppSleepDisabled to YES for blink tests.

BUG= 811525 
TEST=manual

Change-Id: I80ae1aeb0f5577f1977b3685e98655ea1253e3e6
Reviewed-on: https://chromium-review.googlesource.com/996483
Reviewed-by: Robert Sesek <rsesek@chromium.org>
Reviewed-by: Mike West <mkwst@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#548438}
[modify] https://crrev.com/de5d56c9ab59e532f61deb09937315b0346be021/content/shell/app/blink_test_platform_support_mac.mm

Status: Fixed (was: Assigned)
"Fixed" - thanks for the suggestion Robert!

Comment 66 by gab@chromium.org, Apr 6 2018

Nice diagnosis! Could this explain flakiness in layout tests on other platforms as well? Should we bring up a dummy window during those tests to fool OSes that try to be smart about our visibility?
I guess it's possible, but I don't think so. I haven't seen such aggressive throttling on other platforms. I think developers are most at risk for triggering this in local runs; hence the patch we landed.

In issue 818883 we flipped this on for the macOS bots and didn't see any change in cycle time. That said, I didn't look at flakiness rates over that time period.

If it's easy to add it's always worth an experiment though.

Comment 68 by gab@chromium.org, Apr 6 2018

Fair enough, thinking about it some more, I'd assume infra already disables such OS features through policy whenever possible on our test fleet.

Thanks!

Sign in to add a comment