New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 725502 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-05-25
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

webrtc multiple_peerconnections story is failing on all platforms

Project Member Reported by ehmaldonado@chromium.org, May 23 2017

Issue description

The webrtc benchmark is failing because of the multiple_peerconnections story.

This was previously run as a legacy benchmark, and was recently migrated to use TBMv2. [1]

It looks like processing the traces takes a very long time. I tried reducing the test to 20s, but it doesn't seem to help.

Should we disable the test in the meantime?

[1] https://chromium-review.googlesource.com/c/506017/
 
What if you lower it even more, to like 5 or 10 seconds?

rnephew@: what is a reasonable amount of traceEvents that can be handled? We were seeing numbers like 13 millions and 210 million events....
Cc: charliea@chromium.org chiniforooshan@chromium.org
cc'ed Charlie & Ehsan for question about trace events limitation.
I'm OOO today and tomorrow and cannot test the story locally, but I remember running multiple_peerconnections took minutes not seconds to finish.

The issue is that when the trace size is bigger than ~260M (which was the case for multiple_peerconnections), we have to use a 3rd party lib to parse it, instead of JSON.parse and the library that we are using is slow.
I'm seeing this line:

(INFO) 2017-05-22 20:59:48,277 trace_data.Serialize:189  Trace sizes in bytes: {'traceEvents': 1100857148, 'telemetry': 39008, 'tabIds': 40}

If I'm reading that right, it's issuing 1.1 *billion* trace events during the 45s trace. Normal trace events vary from about 80 to 120 bytes, which means that the trace is probably about 80-120GB, which is just way too big. We'd have to reduce the runtime to fractions of a second to make the test pass with the current trace event issuance rate.
Why are there so many trace events there in 40s? 

Are they all top level message loop tasks? If so, this is the case of battery draining that should be fixed, otherwise users maybe having a really bad time with this, I think :)
Randy: can you help ehmaldonado@ with disabling multiple_peerconnections test for now?
Attached is a screenshot of the most issued trace events in another benchmark that ran for about 45s. You can see that webrtc issued the most events, followed by toplevel, but both still issue significantly fewer events in 45s that we're seeing in the multiple_peerconnections story.

I tried to reproduce this locally but am seeing some really strange behavior. If I run the story as-is on dev channel, the browser runs the benchmark for 45 seconds as expected, hangs for a few seconds as if it's downloading the trace, then continues running (indefinitely?). I remember SSHing into a Windows bot earlier and seeing this test running, and thinking it was strange that we'd allow a single user story run for over 45 minutes (it was definitely minutes - not seconds). You can see the failure https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf/633 by searching for "21360s".

In a completely baffling turn of events, when I change this line https://cs.chromium.org/chromium/src/tools/perf/page_sets/webrtc_cases.py?type=cs&q=multiplepeerconnections&sq=package:chromium&l=119 to "action_runner.Wait(2)", the test runs for about 49s and exits normally. Opening up the trace file for this run (attached) takes about 11 seconds, and the trace is about 9MB. I have no idea what's happening the rest of the time - maybe streaming back the trace from Chrome? The trace only seems to cover a period of about 4.5s, 2s of which is waiting, so... it looks like the story is doing the right thing.

My question is: how much of this is Telemetry being slow at streaming the trace back versus trace viewer being slow at importing the metric? I'm going to try to determine that now.
TEdUBZqve70.png
100 KB View Download
multiple_peerconnections02017-05-23_15-40-28.html
8.8 MB View Download
Summary: webrtc multiple_peerconnections story is failing on all platforms (was: webrtc benchmark is failing.)
Attached is the list of trace events issued in the 4.5s, 9MB trace. It looks like just webrtc is issuing about 287,000 trace events per second (the 2 seconds while the webrtc is active has the vast majority of the trace events). Assuming that each trace event is about 80-120 bytes, that's about 14MB per second. That means I'd expect a 45 second trace to be about 14 * 45 = 630MB... which is a far cry from the 80 to 120GB that we're seeing. Still not sure what's going on.
trace_events.png
86.5 KB View Download
I tried commenting out lines 30 and 38 here https://cs.chromium.org/chromium/src/tools/perf/benchmarks/webrtc.py?l=30 and running the benchmark.

It produces about ~200 million trace events, finishes after 5 minutes, and then crashes after succeeding (see logs attached). The logs go like this
[ RUN      ] multiple_peerconnections
...
(INFO) 2017-05-23 21:42:22,015 trace_data.Serialize:189  Trace sizes in bytes: {'traceEvents': 196959210, 'telemetry': 54837, 'tabIds': 40}
[ RUN      ] /tmp/tmpokw0GT.html
[       OK ] /tmp/tmpokw0GT.html (39860 ms)
...
*browser crashes*
...
[       OK ] multiple_peerconnections (291884 ms)
...
[  PASSED  ] 2 tests.

log.txt
249 KB View Download
We could try disabling the webrtc metric instead, and record the "WebMediaPlayerMS::UpdateCurrentFrame" event (the only one we use for the webrtc rendering metric) in another category. We'd only need to update a couple of lines here: https://cs.chromium.org/chromium/src/content/renderer/media/webmediaplayer_ms_compositor.cc?type=cs&q=WebMediaPlayerMS::UpdateCurrentFrame&l=273
If "WebMediaPlayerMS::UpdateCurrentFrame" event is the problem, you may want to rethink the design of emitting trace event & how to compute the webrtc metrics.
It doesn't seem to be the problem. It is not emitted too often.
CL to disable the test is up at https://codereview.chromium.org/2899123002/
I actually made a mistake in the above calculations: the 80 to 120GB trace that we're seeing is 80 to 120GB uncompressed, whereas the 9MB trace is compressed. that 2s wait, 9MB trace was actually more like 60MB unzipped.

I went and collected a trace with a 4s wait time and broke it apart. It looks like a 4s trace has about 800,000 events and a size of 122MB unzipped. That makes sense and would mean that each trace event average 150 bytes, which is about right.

I don't have any insightful help here, unfortunately, other than that for the number of events you're recording, the story is far too long. I've honestly never seen a benchmark that causes Chrome to work as hard as this one: in just 4s of waiting, there are 50,000 MessageLoop::RunTask events. 

You have a few options:

- Reduce the workload in the benchmark. One possible way is to reduce the number of peer connections.
- Reduce the amount that you're tracing. Getting rid of the "webrtc" category would reduce the number of trace events to about 1/5 of what it currently is
- Remove some of the "webrtc" trace events from Chromium, or move them to debug.webrtc. I think that, in the long run, this is probably the right thing to do: we really don't like to issue as many trace events for a non-debug category as webrtc currently is.
- Reduce the amount of time that the benchmark is active for. Given the amount of signal that you're getting for these events, it's pretty unlikely that you need a full 45s worth of data.
NextAction: 2017-05-25
Yes, I think removing some of the "webrtc" trace events from Chromium, or move them to debug.webrtc would be the best option.

I'd prefer to disable the WebRTC rendering metric rather than the page test for now. I'll discuss next steps with my team.
Thanks for your help. :)
I'll hold off on committing that CL disabling multiple_peerconnections then, unless you ping me and tell me to land it.
Another kind of annoying quirk to keep in mind: if your ultimate solution is to decrease the number of trace events that webrtc issues, then you'll have to disable the reference build for a while. The reference build usually lags 1-2 versions behind tip of tree and, although tip of tree will start working, the reference build will be running with the old code that still issues too many trace events and takes far too long to process those trace events.
Randy: please go ahead with disabling the test for now. Once  ehmaldonado@ decides to disable the WebRTC rendering metric, we can always revert your disable CL. Right now that test is not producing data, yet it keeps making the waterfall red & make sheriffs' shift harder.
Project Member

Comment 21 by bugdroid1@chromium.org, May 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/480e803afdf8b9a1b0623b8af3575c244ddd6f3b

commit 480e803afdf8b9a1b0623b8af3575c244ddd6f3b
Author: rnephew <rnephew@chromium.org>
Date: Wed May 24 02:23:54 2017

[Telemetry] Disable webrtc.multiple_peerconnections story.

It is failing on all platforms.

BUG=725502

Review-Url: https://codereview.chromium.org/2899123002
Cr-Commit-Position: refs/heads/master@{#474129}

[modify] https://crrev.com/480e803afdf8b9a1b0623b8af3575c244ddd6f3b/tools/perf/benchmarks/webrtc.py
[modify] https://crrev.com/480e803afdf8b9a1b0623b8af3575c244ddd6f3b/tools/perf/page_sets/webrtc_cases.py

The CL in #21 seems to break 6 builders. Please see  issue 725796  for details.
Project Member

Comment 23 by bugdroid1@chromium.org, May 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a1db6b7d87922f09904037e82447170cc97c0a2e

commit a1db6b7d87922f09904037e82447170cc97c0a2e
Author: ehmaldonado <ehmaldonado@chromium.org>
Date: Wed May 24 07:39:44 2017

[Telemetry] Disable webrtc.multiple_peerconnections story.

It is failing on all platforms.
The previous CL (https://codereview.chromium.org/2899123002) caused builders to break.

BUG=725502, 725796 
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
TBR=rnephew@chromium.org, nednguyen@google.com, vitaliii@chromium.org

Review-Url: https://codereview.chromium.org/2896253004
Cr-Commit-Position: refs/heads/master@{#474198}

[modify] https://crrev.com/a1db6b7d87922f09904037e82447170cc97c0a2e/tools/perf/benchmarks/webrtc.py
[modify] https://crrev.com/a1db6b7d87922f09904037e82447170cc97c0a2e/tools/perf/page_sets/webrtc_cases.py

Cc: -charliea@chromium.org
The NextAction date has arrived: 2017-05-25
Project Member

Comment 26 by bugdroid1@chromium.org, May 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e

commit f9fab9349e8f92f5e5280fb483b7c5b04f385a1e
Author: rnephew <rnephew@chromium.org>
Date: Thu May 25 17:15:09 2017

[Telemetry] Change story disabling in webrtc pageset to StoryExpectations.

This requires https://codereview.chromium.org/2903023002/ to land.

BUG=725502

Review-Url: https://codereview.chromium.org/2900343003
Cr-Commit-Position: refs/heads/master@{#474683}

[modify] https://crrev.com/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e/tools/perf/benchmarks/webrtc.py
[modify] https://crrev.com/f9fab9349e8f92f5e5280fb483b7c5b04f385a1e/tools/perf/page_sets/webrtc_cases.py

Project Member

Comment 27 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/09f367096ac4724bd0a14d86a346317e4cc68500

commit 09f367096ac4724bd0a14d86a346317e4cc68500
Author: ehmaldonado <ehmaldonado@chromium.org>
Date: Fri May 26 23:40:17 2017

Re-enable the multiple_peerconnections story and disable webrtc traces and metrics.

BUG=chromium:725502

Review-Url: https://codereview.chromium.org/2909653003
Cr-Commit-Position: refs/heads/master@{#475190}

[modify] https://crrev.com/09f367096ac4724bd0a14d86a346317e4cc68500/tools/perf/benchmarks/webrtc.py
[modify] https://crrev.com/09f367096ac4724bd0a14d86a346317e4cc68500/tools/perf/page_sets/webrtc_cases.py

Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment