Issue metadata
Sign in to add a comment
|
webrtc multiple_peerconnections story is failing on all platforms |
||||||||||||||||||||
Issue descriptionThe webrtc benchmark is failing because of the multiple_peerconnections story. This was previously run as a legacy benchmark, and was recently migrated to use TBMv2. [1] It looks like processing the traces takes a very long time. I tried reducing the test to 20s, but it doesn't seem to help. Should we disable the test in the meantime? [1] https://chromium-review.googlesource.com/c/506017/
,
May 23 2017
cc'ed Charlie & Ehsan for question about trace events limitation.
,
May 23 2017
I'm OOO today and tomorrow and cannot test the story locally, but I remember running multiple_peerconnections took minutes not seconds to finish. The issue is that when the trace size is bigger than ~260M (which was the case for multiple_peerconnections), we have to use a 3rd party lib to parse it, instead of JSON.parse and the library that we are using is slow.
,
May 23 2017
I'm seeing this line:
(INFO) 2017-05-22 20:59:48,277 trace_data.Serialize:189 Trace sizes in bytes: {'traceEvents': 1100857148, 'telemetry': 39008, 'tabIds': 40}
If I'm reading that right, it's issuing 1.1 *billion* trace events during the 45s trace. Normal trace events vary from about 80 to 120 bytes, which means that the trace is probably about 80-120GB, which is just way too big. We'd have to reduce the runtime to fractions of a second to make the test pass with the current trace event issuance rate.
,
May 23 2017
Why are there so many trace events there in 40s? Are they all top level message loop tasks? If so, this is the case of battery draining that should be fixed, otherwise users maybe having a really bad time with this, I think :)
,
May 23 2017
Randy: can you help ehmaldonado@ with disabling multiple_peerconnections test for now?
,
May 23 2017
Attached is a screenshot of the most issued trace events in another benchmark that ran for about 45s. You can see that webrtc issued the most events, followed by toplevel, but both still issue significantly fewer events in 45s that we're seeing in the multiple_peerconnections story. I tried to reproduce this locally but am seeing some really strange behavior. If I run the story as-is on dev channel, the browser runs the benchmark for 45 seconds as expected, hangs for a few seconds as if it's downloading the trace, then continues running (indefinitely?). I remember SSHing into a Windows bot earlier and seeing this test running, and thinking it was strange that we'd allow a single user story run for over 45 minutes (it was definitely minutes - not seconds). You can see the failure https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf/633 by searching for "21360s". In a completely baffling turn of events, when I change this line https://cs.chromium.org/chromium/src/tools/perf/page_sets/webrtc_cases.py?type=cs&q=multiplepeerconnections&sq=package:chromium&l=119 to "action_runner.Wait(2)", the test runs for about 49s and exits normally. Opening up the trace file for this run (attached) takes about 11 seconds, and the trace is about 9MB. I have no idea what's happening the rest of the time - maybe streaming back the trace from Chrome? The trace only seems to cover a period of about 4.5s, 2s of which is waiting, so... it looks like the story is doing the right thing. My question is: how much of this is Telemetry being slow at streaming the trace back versus trace viewer being slow at importing the metric? I'm going to try to determine that now.
,
May 23 2017
Attached is the list of trace events issued in the 4.5s, 9MB trace. It looks like just webrtc is issuing about 287,000 trace events per second (the 2 seconds while the webrtc is active has the vast majority of the trace events). Assuming that each trace event is about 80-120 bytes, that's about 14MB per second. That means I'd expect a 45 second trace to be about 14 * 45 = 630MB... which is a far cry from the 80 to 120GB that we're seeing. Still not sure what's going on.
,
May 23 2017
I tried commenting out lines 30 and 38 here https://cs.chromium.org/chromium/src/tools/perf/benchmarks/webrtc.py?l=30 and running the benchmark. It produces about ~200 million trace events, finishes after 5 minutes, and then crashes after succeeding (see logs attached). The logs go like this [ RUN ] multiple_peerconnections ... (INFO) 2017-05-23 21:42:22,015 trace_data.Serialize:189 Trace sizes in bytes: {'traceEvents': 196959210, 'telemetry': 54837, 'tabIds': 40} [ RUN ] /tmp/tmpokw0GT.html [ OK ] /tmp/tmpokw0GT.html (39860 ms) ... *browser crashes* ... [ OK ] multiple_peerconnections (291884 ms) ... [ PASSED ] 2 tests.
,
May 23 2017
We could try disabling the webrtc metric instead, and record the "WebMediaPlayerMS::UpdateCurrentFrame" event (the only one we use for the webrtc rendering metric) in another category. We'd only need to update a couple of lines here: https://cs.chromium.org/chromium/src/content/renderer/media/webmediaplayer_ms_compositor.cc?type=cs&q=WebMediaPlayerMS::UpdateCurrentFrame&l=273
,
May 23 2017
If "WebMediaPlayerMS::UpdateCurrentFrame" event is the problem, you may want to rethink the design of emitting trace event & how to compute the webrtc metrics.
,
May 23 2017
It doesn't seem to be the problem. It is not emitted too often.
,
May 23 2017
CL to disable the test is up at https://codereview.chromium.org/2899123002/
,
May 23 2017
I actually made a mistake in the above calculations: the 80 to 120GB trace that we're seeing is 80 to 120GB uncompressed, whereas the 9MB trace is compressed. that 2s wait, 9MB trace was actually more like 60MB unzipped. I went and collected a trace with a 4s wait time and broke it apart. It looks like a 4s trace has about 800,000 events and a size of 122MB unzipped. That makes sense and would mean that each trace event average 150 bytes, which is about right. I don't have any insightful help here, unfortunately, other than that for the number of events you're recording, the story is far too long. I've honestly never seen a benchmark that causes Chrome to work as hard as this one: in just 4s of waiting, there are 50,000 MessageLoop::RunTask events. You have a few options: - Reduce the workload in the benchmark. One possible way is to reduce the number of peer connections. - Reduce the amount that you're tracing. Getting rid of the "webrtc" category would reduce the number of trace events to about 1/5 of what it currently is - Remove some of the "webrtc" trace events from Chromium, or move them to debug.webrtc. I think that, in the long run, this is probably the right thing to do: we really don't like to issue as many trace events for a non-debug category as webrtc currently is. - Reduce the amount of time that the benchmark is active for. Given the amount of signal that you're getting for these events, it's pretty unlikely that you need a full 45s worth of data.
,
May 23 2017
,
May 23 2017
Yes, I think removing some of the "webrtc" trace events from Chromium, or move them to debug.webrtc would be the best option. I'd prefer to disable the WebRTC rendering metric rather than the page test for now. I'll discuss next steps with my team.
,
May 23 2017
Thanks for your help. :)
,
May 23 2017
I'll hold off on committing that CL disabling multiple_peerconnections then, unless you ping me and tell me to land it.
,
May 23 2017
Another kind of annoying quirk to keep in mind: if your ultimate solution is to decrease the number of trace events that webrtc issues, then you'll have to disable the reference build for a while. The reference build usually lags 1-2 versions behind tip of tree and, although tip of tree will start working, the reference build will be running with the old code that still issues too many trace events and takes far too long to process those trace events.
,
May 23 2017
Randy: please go ahead with disabling the test for now. Once ehmaldonado@ decides to disable the WebRTC rendering metric, we can always revert your disable CL. Right now that test is not producing data, yet it keeps making the waterfall red & make sheriffs' shift harder.
,
May 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/480e803afdf8b9a1b0623b8af3575c244ddd6f3b commit 480e803afdf8b9a1b0623b8af3575c244ddd6f3b Author: rnephew <rnephew@chromium.org> Date: Wed May 24 02:23:54 2017 [Telemetry] Disable webrtc.multiple_peerconnections story. It is failing on all platforms. BUG=725502 Review-Url: https://codereview.chromium.org/2899123002 Cr-Commit-Position: refs/heads/master@{#474129} [modify] https://crrev.com/480e803afdf8b9a1b0623b8af3575c244ddd6f3b/tools/perf/benchmarks/webrtc.py [modify] https://crrev.com/480e803afdf8b9a1b0623b8af3575c244ddd6f3b/tools/perf/page_sets/webrtc_cases.py
,
May 24 2017
The CL in #21 seems to break 6 builders. Please see issue 725796 for details.
,
May 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a1db6b7d87922f09904037e82447170cc97c0a2e commit a1db6b7d87922f09904037e82447170cc97c0a2e Author: ehmaldonado <ehmaldonado@chromium.org> Date: Wed May 24 07:39:44 2017 [Telemetry] Disable webrtc.multiple_peerconnections story. It is failing on all platforms. The previous CL (https://codereview.chromium.org/2899123002) caused builders to break. BUG=725502, 725796 # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true TBR=rnephew@chromium.org, nednguyen@google.com, vitaliii@chromium.org Review-Url: https://codereview.chromium.org/2896253004 Cr-Commit-Position: refs/heads/master@{#474198} [modify] https://crrev.com/a1db6b7d87922f09904037e82447170cc97c0a2e/tools/perf/benchmarks/webrtc.py [modify] https://crrev.com/a1db6b7d87922f09904037e82447170cc97c0a2e/tools/perf/page_sets/webrtc_cases.py
,
May 24 2017
,
May 25 2017
The NextAction date has arrived: 2017-05-25
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e commit f9fab9349e8f92f5e5280fb483b7c5b04f385a1e Author: rnephew <rnephew@chromium.org> Date: Thu May 25 17:15:09 2017 [Telemetry] Change story disabling in webrtc pageset to StoryExpectations. This requires https://codereview.chromium.org/2903023002/ to land. BUG=725502 Review-Url: https://codereview.chromium.org/2900343003 Cr-Commit-Position: refs/heads/master@{#474683} [modify] https://crrev.com/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e/tools/perf/benchmarks/webrtc.py [modify] https://crrev.com/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e/tools/perf/page_sets/webrtc_cases.py
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/09f367096ac4724bd0a14d86a346317e4cc68500 commit 09f367096ac4724bd0a14d86a346317e4cc68500 Author: ehmaldonado <ehmaldonado@chromium.org> Date: Fri May 26 23:40:17 2017 Re-enable the multiple_peerconnections story and disable webrtc traces and metrics. BUG=chromium:725502 Review-Url: https://codereview.chromium.org/2909653003 Cr-Commit-Position: refs/heads/master@{#475190} [modify] https://crrev.com/09f367096ac4724bd0a14d86a346317e4cc68500/tools/perf/benchmarks/webrtc.py [modify] https://crrev.com/09f367096ac4724bd0a14d86a346317e4cc68500/tools/perf/page_sets/webrtc_cases.py
,
Jan 16
,
Jan 16
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by kjellander@chromium.org
, May 23 2017