Issue metadata
Sign in to add a comment
|
AudioOutputUnitStop() can take a loooong time; 10s+ to complete. |
||||||||||||||||||||||
Issue descriptionThis is the result of the discussion here https://groups.google.com/a/chromium.org/d/msg/blink-dev/amrZuV2eR6Q/wDfRqRxdCAAJ Reproduction: ./blink/tools/run_layout_tests.py --no-retry-failures http/tests/media/media-src-suspend-after-have-future-* --exit-after-n-failures=1 --timing --verbose --additional-driver-flag=--trace-startup --additional-driver-flag=--trace-startup-duration=60 --additional-driver-flag=--trace-startup-file=/tmp/trace_full.json --repeat-each=30 After step ~14 or 15 (with 1 content shell, ~28 with 2 shells) you'll notice that the timing increases significantly. $ ./blink/tools/run_layout_tests.py --no-retry-failures http/tests/media/media-src-suspend-after-have-future-* --exit-after-n-failures=1 --repeat-each=100 -f --timing --verbose [1/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.9929s [2/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.9942s [3/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1590s [4/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1652s [5/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1743s [6/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1884s <...> [28/100] http/tests/media/media-src-suspend-after-have-future-data.html passed 12.1446s Startup tracing reveals that AudioOutputUnitStop() is blocking the UI thread for 10s+ when this occurs; this prevents test completion. Note: When using startup tracing, you must wait for your chosen duration to elapse or you will get no trace file. :| :| :| Switching to a fake audio stream reveals this is the culprit: Running 1 Content Shell. [1/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.5970s [2/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1278s [3/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1114s <...> [28/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1500s [29/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.0792s [30/30] http/tests/media/media-src-suspend-after-have-future-data.html passed 2.1098s For whatever reason, once we cross more than ~25 calls to stop it starts slowing down terribly. I suspect we're sometimes hitting this in production, but probably not very frequently since we typically reuse streams when possible. I think we're likely hitting this on the macOS layout test bots (possibly not if they don't have working audio). I suspect we should add a flag to allow the layout test shell to use fake audio streams to avoid test slow downs. +dpranke for layout test commentary and pointers to where we track time to run mac layout tests. +ossu,olka for CoreAudio expertise
,
Feb 13 2018
Hmm, I was hoping we might report the average length of the run layout tests step for all mac bots somewhere. Something that we could watch for decreases in if we land a flag to switch to fake output streams for layout tests. Or also to see how it compares to other platforms, I suspect this could cause an order of magnitude difference in run time on the bots depending on how tests w/ audio each shard gets.
,
Feb 13 2018
We do record that data, but there is so much data, and the platforms are so different from each other, that it'd be hard to make sense out of it, unless you have *very* specific questions you want to ask, like "how long did this specific test take?".
,
Feb 13 2018
I think just a high level answer is fine for an a/b test. E.g. Over the last N layout test runs on macOS, what's the average time to complete that step? Where N is ~= to few days of runs. With that value we can switch to fake audio for a couple days and see if there's any improvement at all.
,
Feb 13 2018
See go/webkit-layout-tests-time-on-mac10.12-tests-for-the-last-week for a graph. Bear in mind that the total execution time (across all shards) is ~16,000s, so you'd have to have a *lot* of tests being slow for there to be a significant dent :).
,
Feb 13 2018
Your query doesn't seem to limit to the layout tests step; limiting it to steps with the name 'webkit_layout_tests on Intel GPU on Mac on Mac-10.12.6' yields somewhere around ~6000 seconds on average. Thanks in any case! There are probably ~1000 media layout tests and ~50000 overall, so if test selection is random and this bug has fall off (i.e. stops slowing down after no audio usage for a while or never slows down without back-to-back audio usage crossing a threshold), there could be no effect. If the bots don't have working audio, there could also be no effect. In the worst case, (either alphabetical test selection or no fall off) every media layout test past some threshold may cumulatively add up to ~10 seconds per media test per content shell shard. Which would add up to a significant portion of that execution time.
,
Feb 13 2018
For a probably not comparable comparison point, the Windows 10 layout step takes on average ~3400 seconds across all shards.
,
Feb 28 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f29eb01290cd36a30177ecf8197f906c01088a0d commit f29eb01290cd36a30177ecf8197f906c01088a0d Author: Dale Curtis <dalecurtis@chromium.org> Date: Wed Feb 28 01:44:39 2018 Add --disable-audio-output switch for disabling audio browser side. This ensures we still exercise the majority of the audio output path (good for testing) but replace the actual OS level stream with a fake stream at the last moment. The goal here is to try setting this for macOS layout tests to see if it reduces the run time due to hanging OS calls. BUG= 811525 TEST=manual Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: Id3b95125696a96ef8d14468f1a97dedf9cdf77b4 Reviewed-on: https://chromium-review.googlesource.com/938152 Reviewed-by: Olga Sharonova <olka@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Cr-Commit-Position: refs/heads/master@{#539638} [modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/audio/audio_manager_base.cc [modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/base/media_switches.cc [modify] https://crrev.com/f29eb01290cd36a30177ecf8197f906c01088a0d/media/base/media_switches.h
,
Feb 28 2018
Huh, now that I've landed that switch, running the media/media-ended.html test locally I am still seeing 10s+ timeouts on the 11th run consistently. Trace shows most calls eventually stop happening, leaving just giant gaps between when a PostDelayedTask is requested and when it's actually scheduled! ERR: [64004:775:0228/112957.940975:ERROR:webmediaplayer_impl.cc(399)] Load(0, file:///d/code/chrome/src/third_party/WebKit/LayoutTests/media/content/silence.oga, 0)\n ERR: [64004:26895:0228/112957.948370:ERROR:audio_renderer_impl.cc(842)] SetPlaybackRate(0)\n ERR: [64002:33539:0228/112957.949308:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 0.016ms\n ERR: [64002:33539:0228/112957.949369:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n ERR: [64002:33539:0228/112957.949412:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.024ms\n ERR: [64002:33539:0228/112957.949448:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 23.074ms\n ERR: [64004:26895:0228/112957.949489:ERROR:audio_renderer_impl.cc(842)] SetPlaybackRate(1)\n ERR: [64002:33539:0228/113007.973750:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001.2ms\n <---- !!! 10s gap !!! ERR: [64002:33539:0228/113007.973877:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n ERR: [64002:33539:0228/113007.973962:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.054ms\n ERR: [64002:33539:0228/113007.974057:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 5.905ms\n ... The render thread is the only thread that seems to still be firing regular tasks during this time, AudioThread is a plain base::Thread, but for whatever reason isn't getting any scheduled time despite posting a delayed task for 23.074ms into the future. The fact that the gap is pretty close to precisely 10s is also really weird. It feels like there's a timeout or something happening somewhere. One reference to a 10s timeout is in the burst mode for the sampling profiler, but not sure how that is related. I'm a bit stumped. It's starting to look like there's something going wrong inside the MessagePump for macOS. I can't think of any other reason why a post delayed task would not be scheduled. +gab from scheduler team.
,
Feb 28 2018
Repro command w/ ToT: ./blink/tools/run_layout_tests.py --no-retry-failures media/media-ended.html --timing --verbose --additional-driver-flag=--disable-audio-output --driver-logging --repeat-each=11 11th run always times out for me, even if you add the test to SlowTests.
,
Feb 28 2018
Grabbed a sample from ActivityMonitor, but it doesn't really show anything useful AFAICT; threads are sitting in mach_msg_trap(), but it's unclear from the trace for how long.
,
Feb 28 2018
So at some point mach_msg() used in WaitableEventMac starts taking seconds to return: ERR: [68978:33795:0228/142346.712934:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 1.45\n ERR: [68978:33795:0228/142346.713006:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n ERR: [68978:33795:0228/142346.713052:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.031\n ERR: [68978:33795:0228/142346.713100:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 21.623\n ERR: [68980:9275:0228/142346.713115:ERROR:audio_renderer_impl.cc(889)] Render delay:0.023219 s prior_frames_skipped:0 frames_requested:1024\n ERR: [68980:9275:0228/142346.713187:ERROR:audio_renderer_impl.cc(1043)] Render complete...\n ERR: [68980:25107:0228/142350.562302:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5071.06\n ERR: [68980:30467:0228/142350.562311:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.76\n ERR: [68980:10247:0228/142350.562316:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.8\n ERR: [68980:31235:0228/142350.562326:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.88\n ERR: [68980:18691:0228/142350.562321:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8349.88\n ERR: [68980:25603:0228/142350.562341:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8350.3\n ERR: [68980:32515:0228/142350.562355:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8350.32\n ERR: [68980:36111:0228/142351.510343:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1999.96\n ERR: [68980:36111:0228/142353.511218:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.71\n ERR: [68978:33283:0228/142354.937406:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40000.3\n ERR: [68978:37891:0228/142354.940406:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2\n ERR: [68980:36111:0228/142355.511997:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.67\n ERR: [68978:33795:0228/142356.735210:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10022.1\n ERR: [68978:33795:0228/142356.735290:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10000.6\n +rsesek who worked on this most recently for insight into why this might be...
,
Feb 28 2018
(units are ms)
,
Feb 28 2018
Now w/ the timeout value: RR: [69346:26895:0228/142728.281798:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1215.26, timeout was: 0\n ERR: [69346:31747:0228/142729.875311:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.45, timeout was: 30000\n ERR: [69346:32003:0228/142729.875310:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.57, timeout was: 30000\n ERR: [69346:6407:0228/142729.875304:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5091.63, timeout was: 30000\n ERR: [69346:31491:0228/142729.875326:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.66, timeout was: 30000\n ERR: [69346:33283:0228/142729.875333:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8353.68, timeout was: 30000\n ERR: [69346:13315:0228/142729.875340:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8354.13, timeout was: 30000\n ERR: [69346:38147:0228/142729.875359:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8354.17, timeout was: 30000\n ERR: [69346:26895:0228/142730.283367:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.43, timeout was: 0\n ERR: [69346:26895:0228/142732.284496:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.97, timeout was: 0\n ERR: [69344:31747:0228/142734.189675:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000\n ERR: [69344:38659:0228/142734.199846:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40000.9, timeout was: 40000\n ERR: [69346:26895:0228/142734.286741:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2002.09, timeout was: 0\n ERR: [69346:26895:0228/142736.289230:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2002.32, timeout was: 0\n ERR: [69344:34819:0228/142737.089643:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10023.2, timeout was: 23\n ERR: [69344:34819:0228/142737.089797:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001\n ERR: [69344:34819:0228/142737.089856:ERROR:fake_audio_output_stream.cc(67)] CallOnMoreData\n ERR: [69344:34819:0228/142737.089942:ERROR:audio_sync_reader.cc(285)] WaitUntilDataIsReady: 0.054\n ERR: [69344:34819:0228/142737.090094:ERROR:fake_audio_worker.cc(157)] DoRead, next delay: 6.072\n ERR: [69346:34851:0228/142737.090100:ERROR:audio_renderer_impl.cc(889)] Render delay:0.023219 s prior_frames_skipped:0 frames_requested:1024\n ERR: [69346:34851:0228/142737.090172:ERROR:audio_renderer_impl.cc(1043)] Render complete...\n ERR: [69346:26895:0228/142738.290566:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1200.3, timeout was: 0\n ERR: [69346:26895:0228/142740.291644:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.91, timeout was: 0\n ERR: [69346:26895:0228/142742.295718:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2003.98, timeout was: 0\n ERR: [69346:26895:0228/142744.299149:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2003.27, timeout was: 0\n ERR: [69346:26895:0228/142746.303755:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2004.42, timeout was: 0\n ERR: [69344:34819:0228/142747.097306:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10007.2, timeout was: 7\n Will print the thread ids next.
,
Feb 28 2018
Can you repro if you #if 0 out the entire impl of this method? https://cs.chromium.org/chromium/src/base/message_loop/message_pump_default.cc?q=message_pump_default.cc&sq=package:chromium&dr&l=89
,
Feb 28 2018
ERR: [69507:26639:0228/143232.908400:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1232.7, timeout was: 0: tid=26639\n ERR: [69507:12803:0228/143234.485225:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.55, timeout was: 30000: tid=12803\n ERR: [69507:34051:0228/143234.485225:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 5070.74, timeout was: 30000: tid=34051\n ERR: [69507:33539:0228/143234.485235:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.65, timeout was: 30000: tid=33539\n ERR: [69507:35075:0228/143234.485242:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.7, timeout was: 30000: tid=35075\n ERR: [69507:32003:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8377.1, timeout was: 30000: tid=32003\n ERR: [69507:6407:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.73, timeout was: 30000: tid=6407\n ERR: [69507:37635:0228/143234.485251:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 8376.72, timeout was: 30000: tid=37635\n ERR: [69507:26639:0228/143234.913135:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 1994.64, timeout was: 0: tid=26639\n ERR: [69507:26639:0228/143236.914781:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.48, timeout was: 0: tid=26639\n ERR: [69505:31235:0228/143238.790479:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000: tid=31235\n ERR: [69505:26115:0228/143238.795351:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 40001.2, timeout was: 40000: tid=26115\n ERR: [69507:26639:0228/143238.915973:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2001.03, timeout was: 0: tid=26639\n ERR: [69507:26639:0228/143240.917021:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 2000.85, timeout was: 0: tid=26639\n ERR: [69505:32259:0228/143241.697752:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10022.3, timeout was: 22: tid=32259\n <--- audio thread. ERR: [69505:32259:0228/143241.697910:ERROR:fake_audio_worker.cc(138)] DoRead, time since scheduled: 10001.4: tid=32259\n
,
Feb 28 2018
Yes, it still repros with that section disabled.
,
Mar 1 2018
rsesek@ any advice on figuring out what mach_msg() may be blocked on? I searched for mach_msg debugging and got: https://robert.sesek.com/2012/1/debugging_mach_ports.html So you seem like the right person to ask :p
,
Mar 1 2018
More debugging notes: - Forcing UseSlowWatchList() to true had no effect. - Forcing WaitMany() to use PORT_SET or DISPATCH had no effect.
,
Mar 1 2018
I noticed we mix mach_port_construct and mach_port_deallocate vs mach_port_destruct, but changing that doesn't seem to make anything better.
,
Mar 1 2018
Might that be somehow related to Issue 816483?
,
Mar 1 2018
Re: #18: Not to be glib, but it's blocked on waiting for a message from calling Signal() :p. Is the WaitableEvent that's blocked part of a MessagePump/TaskScheduler worker implementation, or is it a dedicated synchronization primitive? Re: #19: Thanks for checking those. The watch list shouldn't have an impact here because it's only for the WaitableEventWatcher, which is used only in a couple of places. Re: #20: The API usage is correct. The Mach functions are a bit clumsy to use, and the names aren't great. I'm a bit skeptical of this being an issue in WaitableEvent itself -- we've been shipping this implementation for several milestones. I think it's more likely that this is a synchronization issue with the caller of WaitableEvent instead, or issue 811525 mentioned in #21 sounds plausible as well.
,
Mar 1 2018
I'm pretty sure it's part of a MessagePump; I.e., the timeout value for this line comes from PostDelayedTask: ERR: [69344:34819:0228/142737.089643:ERROR:waitable_event_mac.cc(154)] mach_msg took forever!!! - 10023.2, timeout was: 23\n I also don't think it's likely an issue with WaitableEvent, but perhaps some other thing either exhausting mach ports or destabilizing the mach system as used by Chrome. @olka, it's possible, but with --disable-audio-output this isn't a real time audio thread; the worker thread is just a plain thread that isn't woken up for some reason.
,
Mar 1 2018
Another debugging note: I tried switching the message pump for the audio thread to UI or IO (since that changes the underlying pump) and didn't see a significant improvement, though the UI one seemed like it might be more responsive. I'll retest that today to confirm.
,
Mar 1 2018
Thanks for the tip on priority olka@, I didn't think to check that, but increasing the thread priority to REALTIME_AUDIO for the fake thread resolves the issue. For some reason as a normal priority thread the fake audio worker was getting no scheduled time for up to 10s? Will dig into exactly what normal priority means for our threads now.
,
Mar 1 2018
Experimenting a bit further, making PlatformThreadMac::SetRealtimeAudioPriority() a no-op causes the problem to return. Toggling various bits of that function on and off shows that any one of the 3 calls for setting various thread policies up is sufficient to resolve the problem. Most interesting is just setting the time constraints for the thread; this keeps a 0.5 priority but still ensures the audio thread is scheduled sufficiently. It doesn't seem reasonable for a normal priority thread to be descheduled for 10s... so I wonder if we should be setting time constraints for all normal threads... +mark as another macOS expert.
,
Mar 1 2018
Interesting. It does seem odd that the thread is starving for 10s. Is this reproable only on a specific OS version, or are you seeing it across versions? And is it only occurring when there's a high level of parallelism with running tests (and other things on the system), or when it's standalone too?
,
Mar 1 2018
mac_chromium_rel_ng timed out on this test, so I think it's reproable on the bots too, but it's always possible it's something else there. Locally I'm running 10.13.3 a 2017 MBP, and that's the only version I have access to. If you have a local checkout you can try a quick repro with the command line in c#10.
,
Mar 1 2018
It's occurring 100% of the time on the 11th run, nothing else going on in the system.
,
Mar 1 2018
Deleted the last comment since it was wrong, c#26 is correct. I just accidentally left an early return in while testing.
,
Mar 1 2018
You may want to check out `taskinfo --threads <pid>` which can dump the policies for all the threads for a given PID.
,
Mar 1 2018
Thanks for that, it seems that as some point the audio thread (along with almost every other thread) is demoted significantly in priority: https://www.diffchecker.com/Q6a8aRcJ I.e. even CrBrowserMain drops to a 4 in priority: good: thread ID: 0x1f98a / 129418 thread name: "CrBrowserMain" user/system time: 0.640887 s / 0.219802 s CPU usage (over last tick): 5% sched mode: timeshare policy: POLICY_TIMESHARE timeshare max priority: 63 timeshare base priority: 46 timeshare cur priority: 46 timeshare depressed: NO, prio -1 requested policy: bad: thread ID: 0x1f98a / 129418 thread name: "CrBrowserMain" user/system time: 0.888308 s / 0.305418 s CPU usage (over last tick): 0% sched mode: timeshare policy: POLICY_TIMESHARE timeshare max priority: 4 timeshare base priority: 4 timeshare cur priority: 4 timeshare depressed: NO, prio -1 requested policy:
,
Mar 1 2018
Despite QOS policy being user interactive the UI thread is shunted to tier5: effective policy: eff thread qos: THREAD_QOS_USER_INTERACTIVE eff thread qos relprio: 0 eff promotion qos: THREAD_QOS_USER_INTERACTIVE eff latency qos: LATENCY_QOS_TIER_5 eff thruput qos: THROUGHPUT_QOS_TIER_0 eff darwin BG: NO eff iotier: THROTTLE_LEVEL_TIER1 (STANDARD) eff other: bg-newsockets I can't find information on what bg-newsockets is...
,
Mar 1 2018
Looks like the culprit is related to the energy saving feature "App Nap": https://developer.apple.com/library/content/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/AppNap.html Running: defaults write NSGlobalDomain NSAppSleepDisabled -bool YES Resolves the issue on future runs...
,
Mar 1 2018
Interesting find. I thought we disabled our AppNap implementation, but I could be wrong. +lgrey
,
Mar 1 2018
https://cs.chromium.org/chromium/src/chrome/app/app-Info.plist?type=cs&q=AppNap&sq=package:chromium&l=353 is all I can find for that
,
Mar 1 2018
Adding SupportsAppNap false/ to the .plist for content shell doesn't seem to help.
,
Mar 1 2018
Doesn't seem like AppNap requires any implementation from us, it is applied automatically per the docs: "If an app isn’t performing user-initiated work such as updating content on screen, playing music, or downloading a file, the system may put the app in App Nap." Seems this might only affects layout tests which have a hidden content shell.
,
Mar 1 2018
We did disable renderer AppNap. Browser AppNap *is* on, per comment #37, since we didn't need to do anything magic to make it work there. I would find it surprising if an application became AppNap eligible with an active audio stream, but they're pretty opaque on how it works, so who knows? dalecurtis@ there's a toggleable "AppNap" column in Activity Monitor, so that can be a quick way to verify what process enters AppNap when. There's also a NSAppNap$1 DTrace provider.
,
Mar 1 2018
Disabling AppNap also seems to fix the original issue listed in this bug. So perhaps the only recommendation here is to ensure our macOs bots all disable that...
,
Mar 1 2018
@lgrey, with the repro case in c#10 it's running with --disable-audio-output which prevents any active audio streams. That said, the original issue in c#0 plays a stream with audio, though that stream is toggled on and off. Perhaps if no stream is active at a given time creating a new one is insufficient for taking the process out of nap.
,
Mar 1 2018
Makes sense. My testing with QOS level 5 reveals that 10 seconds is the magic max threshold for timer ticks, as you saw above. In that gap, I don't think there's anything the backgrounded process itself can do.
,
Mar 1 2018
@dpranke, do you know if our macOS bots disable AppNap?
,
Mar 1 2018
Do we want to disable AppNap on all the bots, though? Presumably we'd lose coverage of the real-world configuration. Perhaps the wrapper script for running layout tests temporarily disables it, and then resets it to its last value when it's done.
,
Mar 1 2018
+dba ... @dba - any idea if we disable AppNap on the swarming bots?
,
Mar 1 2018
I think any long running non-UI presenting test is going to trigger AppNap; which is pretty much all of our tests. This is all speculation though, so we could try disabling it and see if anything changes, if not flip it back. What coverage would we be getting though with it enabled?
,
Mar 1 2018
I think it matters most on the perf bots, where we're trying to get an accurate picture of performance on user-like systems. I suppose for our test binaries it doesn't matter. lgrey@ WDYT?
,
Mar 1 2018
Do perf bots actively test switching to other applications? If not, I think it's not really a huge deal.
,
Mar 2 2018
Can you elaborate lgrey? I'm seeing AppNap kick in even when nothing else but the layout test runner is running.
,
Mar 2 2018
I mean I think it would be fine for the perf bots. Since you mentioned upthread that this only happens the content shell is hidden, I would think the presence of a visible window prevents it from kicking in. If not, I think we'd already be seeing some issues in browser test waits.
,
Mar 5 2018
ping @dba, any update on whether our bots disable app nap?
,
Mar 5 2018
#52 - No, it's not disabled it looks like.
,
Mar 5 2018
Thanks. It seems reasonable to try disabling it for some time to see if we notice improvements in cycle time for build and test. @dpranke, is there experiment infrastructure for this? Or just flip and graph the results on plx?
,
Mar 5 2018
> @dpranke, is there experiment infrastructure for this? > Or just flip and graph the results on plx? No, there's no experiment infrasturcture for this that I'd recommend. Just flip and watch :).
,
Mar 5 2018
Any tips on how one might go about executing a CLI command across the Mac fleet? :) (See c#35) Should I just file an infrastructure ticket? As rsesek@ notes probably we shouldn't disable it on the Telemetry fleet.
,
Mar 5 2018
File a Labs ticket (go/infrasys-bug) and ask for them to deal with this via Puppet.
,
Mar 5 2018
,
Mar 30 2018
To close the loop on the fleet change, it didn't seem to have any effect. I'd like to come up with some way to avoid folks who run the layout test runner from being affected without their knowledge, but it seems a bit weird to have the layout test runner (all test runners?) execute the command in c#35, before and after - so maybe the best I can do here is a PSA to chromium-dev@.
,
Mar 30 2018
I wonder if adding: [defaults setBool:YES forKey:@"NSAppSleepDisabled"] to SetDefaultsToLayoutTestValues() https://cs.chromium.org/chromium/src/content/shell/app/blink_test_platform_support_mac.mm?q=nsuserdefaults+file:content&sq=package:chromium&dr=C&l=25 would work instead of having to put it in the global prefs domain.
,
Mar 30 2018
Good idea, I'll try that when I have my mac laptop next.
,
Apr 4 2018
The NextAction date has arrived: 2018-04-04
,
Apr 4 2018
Nice, that worked! https://chromium-review.googlesource.com/c/chromium/src/+/996483
,
Apr 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/de5d56c9ab59e532f61deb09937315b0346be021 commit de5d56c9ab59e532f61deb09937315b0346be021 Author: Dale Curtis <dalecurtis@chromium.org> Date: Thu Apr 05 15:58:03 2018 Disable AppNap to avoid timeouts when running blink tests. Layout tests do not always meet the requirements to avoid being "napped", in which case test timeouts and other odd behavior may occur; this is most prevalent with large repeat values and a small number of tests. This sets NSAppSleepDisabled to YES for blink tests. BUG= 811525 TEST=manual Change-Id: I80ae1aeb0f5577f1977b3685e98655ea1253e3e6 Reviewed-on: https://chromium-review.googlesource.com/996483 Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: Mike West <mkwst@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Cr-Commit-Position: refs/heads/master@{#548438} [modify] https://crrev.com/de5d56c9ab59e532f61deb09937315b0346be021/content/shell/app/blink_test_platform_support_mac.mm
,
Apr 5 2018
"Fixed" - thanks for the suggestion Robert!
,
Apr 6 2018
Nice diagnosis! Could this explain flakiness in layout tests on other platforms as well? Should we bring up a dummy window during those tests to fool OSes that try to be smart about our visibility?
,
Apr 6 2018
I guess it's possible, but I don't think so. I haven't seen such aggressive throttling on other platforms. I think developers are most at risk for triggering this in local runs; hence the patch we landed. In issue 818883 we flipped this on for the macOS bots and didn't see any change in cycle time. That said, I didn't look at flakiness rates over that time period. If it's easy to add it's always worth an experiment though.
,
Apr 6 2018
Fair enough, thinking about it some more, I'd assume infra already disables such OS features through policy whenever possible on our test fleet. Thanks! |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by dpranke@chromium.org
, Feb 13 2018