New issue
Advanced search Search tips

Issue 644134 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 3
Type: Feature



Sign in to add a comment

Network stack should be optimized to utilize the power of Streams API based fetching

Reported by justinbe...@gmail.com, Sep 5 2016

Issue description

UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4

Example URL:

Steps to reproduce the problem:
1. Use the fetch Readable Streams API to stream a remote file
2. Observe CPU usage, compare to GET
3. Observe that the CPU usage is much higher than GET

What is the expected behavior?
Given that ReadableStreams and browser fetch is supposed to eliminate memory and disk usage - which it does - I would expect the CPU and resource use to be less than XHR GET, however it is much higher.

What went wrong?
Measuring over a 10gigabit link the maximum speed on an 3.1 GHz Intel Core i7 of readable streams is about 1gigabit. At that level the CPU is completely occupied across several Chrome Helper processes and in the Kernel. Note that I am just streaming, not doing anything with it! Profile shows all the cpu being spent in "(program)" and 5% in javascript calls to code.

By contrast, Chrome XHR GET will get the file - and write it to disk while doing so - as quickly as the disk can write. Which on this macbook pro, is over 1 gigabyte per second (over 10 gigabit).

Did this work before? No 

Chrome version: <Copy from: 'about:version'>  Channel: n/a
OS Version: OS X 10.11.3
Flash Version: 

The aim of "Readable Streams" is to reduce resource use and allow progressive loading of large files without storing them all to a blob, and writing them to disk, as  is done by XHR GET. In theory, they should be faster than web sockets too, as the data is not packed up into frames and so on.

However my experience is that the current implementation is extremely CPU heavy. Unfortunately illustrating this is difficult without a high speed link and a web server. Perhaps a local web server would illustrate the CPU usage.
 
Components: -Internals>Network Blink>Network>FetchAPI
Labels: -Pri-2 Performance Pri-3
Components: Blink>Network>StreamsAPI
Labels: Needs-Feedback
Thank you for reporting. Could you provide the script you used for measuring?
Hi,
 So I used this code:

https://googlechrome.github.io/samples/fetch-api/fetch-response-stream.html
stripped of any search or dom update in pump(), fetching a binary from
nginx, and measuring the speed the file streamed in.

I tried one fetch, and eight fetches in parallel, the total bandwidth was
about the same.
Someone else tried windows (not OSX) and the result was similar.
nginx cpu was minimal. Chrome cpu use was near 100%

xhr GET, URL get, or curl on the same payload file saturated the link.
Labels: -Needs-Feedback
Unfortunately, the pattern has a performance problem with the existing V8 implementation (unless v8 fixed it recently). Please read  issue 477661  and https://github.com/domenic/streams-demo/issues/4 . Doesn't your code consume much more memory than expected?
So I gather from that issue 644134, that promises have a memory/stack issue.

I can confirm that the memory footprint for the example page on github rises
as it searches: starting at less than 50 mb, it is 80 mb after a few
minutes of
streaming at a slow 10 megabit speed. If it was left to stream, perhaps it
would
eventually run out of memory? After 1 minute consumed of CPU time it is
over 110mb which is double where it started.

In addition, CPU use while the tab is in focus is 40% (of one core), which
is
very high considering that all it is doing is scanning for a small sequence
of
numbers in each incoming chunk, and the streaming rate is a slow 10 megabit.

I ran a javascript profile and the majority of cpu is spent in (program)
again,
and not in any code.

thanks.
My point is that the pattern has a performance problem which consumes much more memory and CPU time (due to GC) than necessary. Can you measure the performance without the pattern?

function pump(reader) {
  return reader.read().then(result => {
    if (result.done) {
      // DONE!
      return;
    }
    // GET SOME DATA
    return pump(reader);
  });
}

has the problem while

function pump(reader) {
  function rec() {
    return reader.read().then(result => {
      if (result.done) {
        return;
      }
      // GET SOME DATA
      rec();
    });
  }
  rec();
  return reader.closed.then(() => {
    // DONE!
  });
}

doesn't.
Also taking the first example in the topic 644134, and re-writing the promises to "remove the chain", still makes my 4 gigabit stream test blow out instantly to 500mb of memory, performance with flat out CPU caps out at about 1200 megabit max.

XHR GET performance is a perfectly reliable 4 gigabit, with still some idle CPU, despite having to write to temporary files, and despite storing 120mb file ranges into blobs.
Sorry our comments crossed.

Yes I re-wrote to remove the chain with the new Promise construct.

memory use: still blew up from 50mb to 500mb. CPU usage, still flat out. Speed maybe slightly improved however I did see 1200 megabit with the original promise chain, but for no apparent reason it would sometimes drop to 700 megabit and stay like that until I quit and restarted the browser.

verified back to back with same browser instance that XHR GET still can do full speed with idle CPU. Streams cannot.
Interesting. Sorry for asking again, could you share your measurement script, please? With the script we will be able to see where the performance problem comes from.

Fetch API + Streams is currently not very optimized. Feedbacks like yours are very valuable to optimize the implementation.

Thanks,
Please wait a day I'll post a jsfiddle that shows the issue.
thanks.
Hi,
So I've made a test program and discovered that XHR GET runs at a similar
 speed to readable streams if the XHR content type is not set to
application/octet-stream, and response type is not set to blob.

So now I am thinking the getReader({mode: 'byob'}) is what I should
be pursuing in case the problem is character by character processing or
something for default getReader() / ReadableStream.

Is there an obvious way to force a ReadableStream to operate as a byte
stream if that is indeed the issue? I've set the headers to the fetch
request
to include Content-Type, and cache: 'no-store' but it makes no speed
improvement reading the stream.

I don't want to have you waste time looking at a test program if it is
missing
something important about ReadableStreams, that make them raw byte streams.
Thank you for the information.

That depends on what you want to do. If what you want is a blob, xhr + blob is the most efficient way. From some implementation reasons, fetch(url).then(res => res.blob()) is not as fast as xhr + blob right now.

If you want to do something on the bytes, you need to create a FileReader and loading bytes from a blob takes some time.

I'm not sure if the BYOB reader solves your problem (as I don't know what it is). Right now it is not implemented.

I'm just curious for which operations CPU time is consumed, as "Profile shows all the cpu being spent in "(program)" and 5% in javascript calls to code."
I caught my attention that unless XHR is returning a blob, it operated the same speed as a ReadableStream() (on my laptop, the is about 1 gigabit).

So I was thinking perhaps ReadableStream has a code path through similar character processing that afflicts XHR responseText instead of response as a type blob.

As to why I want ReadableStream to be fast it is because it seems to me that using all of an i7 to do while(1) chunk = read(); at 1gigabit is very inefficient. And yes javascript profiler is not saying where that time is being spent. just in "(program)".

And the memory issue was a surprise as well. I would like to read progressively, use little memory, and no disk i/o, without killing the CPU so there is plenty of time left to do things with the data.

Would you still like the test page? Testing at (say) 10mbit doesn't show that much cpu. Maybe 10%. But of course, if its proportional to speed you can see why 1 gigabit is a problem.
Cc: hirosh...@chromium.org
So the problem is that fetch/xhr/websocket bytes transfer is slow, right? About two years ago hiroshige@ measured the performance of fetch/xhr/websocket and the result was the order of 100Mbyte/s, IIRC. At that time we thought it was fast enough.

Regarding the CPU load, I don't have an idea. hiroshige@, do you remember if you saw extremely high CPU usage when loading a big resource in the measurement?

Anyway, the test page is welcome.
In summary:

xhr get - blob - very fast I think it could run at 10gigabit if the
temporary cache directory is a ram disk.
xhr get - response type not blob - ~100mb/sec
websocket - ~100mb/sec
fetch - readableStream - ~100mb/sec

I'm going to try building chromium and profile it, however I don't see an
obvious reason why fetch with readableStreams, which return a sequence of
blobs, should not be as fast as xhr get returning a blob ..
Chromium has two kinds of processes: the browser process and the renderer process. Every privileged operation such as networking is done at the browser process. For XHR + text, XHR + arraybuffer, fetch + text, fetch + arraybuffer, fetch + blob, fetch + readable stream, the content is downloaded to the browser process and (progressively) transferred to the renderer process. For XHR + blob, the blob is created on the browser process and only its handle is passed to the renderer process. Thus XHR + blob skips the bytes transfer from the browser process to the renderer process. When you use the bytes stored in the blob, you need to use a FileReader which transfers the blob contents from the browser process to the renderer process.

Yes I see, well after 8 hours of compiling, and doing a performance
profile, I can see that clearly every little chunk goes through what seems
like 100+ layers, before each appears as a small blob for use in Javascript.

So all cores of the CPU are busy simply moving those bytes from the network
layer to memory where they can be used by js.

I guess this bug report is dead in the water - to efficiently implement a
readable stream (or websocket stream) fed by a network read appears to be
to put it mildly, non trivial.

Although wouldn't performance increase dramatically if the buffer size is
increased all the way down the chain? perhaps currently each packet
triggers the whole pile of code from back to front. A selectable buffer
size, for instance 64k might dramatically increase performance and decrease
cpu and power use..

thanks for your patience.
Status: Available (was: Unconfirmed)
Summary: Network stack should be optimized to utilize the power of Streams API based fetching (was: Readable streams are CPU heavy)
Changing the status to Available given the problem has been confirmed though it's not well determined what we should do.
Labels: -Performance Performance-Network
Project Member

Comment 21 by sheriffbot@chromium.org, Jul 16

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Type-Bug -Hotlist-Recharge-Cold Type-Feature
Status: Available (was: Untriaged)
Changing the Type to "Feature", where the desired feature is "it should be a lot faster".

Someone should re-measure this at some point to keep track of where we are.

Sign in to add a comment