New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 640707 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 15
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

Use UTF-8 in DOMStorage

Project Member Reported by dskiba@chromium.org, Aug 24 2016

Issue description

Right now DOMStorage uses NullableString16(), which is just string16 with additional features.

What was behind the decision to use string16 in the first place?

Can we use string (UTF-8) instead? Or maybe we can replace NullableString16 with something that would decide whether to use UTF-8 or UTF-16 depending on the string itself (but still provide UTF-16 get/set methods)?
 

Comment 1 by jsb...@chromium.org, Aug 24 2016

Without looking at the specific code in question:

Strings coming out of script are 16-bit but not necessarily (valid) UTF-16, so can't be represented as (valid) UTF-8.

Example: "\uDC00\uD800" (a swapped surrogate pair) is not a valid UTF-16 string, but a perfectly valid DOMString to pass around and in and out of APIs. Converting that to a sequence of Unicode code points - e.g. during conversion to UTF-8 - should either fail or yield U+FFFD U+FFFD, but it would be an error to get "\uFFFD\uFFFD" back out of e.g. a storage API, leading to bugs (like issue 432746 for example)

(Various blink and chromium code paths will smuggle "invalid utf-16" into "invalid utf-8" and back, but not all of them. *sigh*)

So code handling strings coming out of script tend to use string16.

Comment 2 by dskiba@chromium.org, Aug 24 2016

Thanks for the explanation, jsbell@!

A little more context - I'm primarily concerned about DOMStorage memory usage in the browser process. For example browsing yahoo.com can easily bring DOMStorage to ~5 MiBs.

So I wonder if we can trick from https://codereview.chromium.org/1768063002 on the browser side?

Or maybe we can make DOMStorageArea and friends to save/load opaque byte blobs instead of string16, and prepare (compress) them on the renderer side?

Comment 3 by jsb...@chromium.org, Aug 25 2016

Yeah, supporting "WTF-16" / "WTF-8" is plausible (and I'm supportive). I don't know why horo@ didn't pursue that approach further.

Comment 4 by jsb...@chromium.org, Aug 29 2016

Labels: -Type-Bug Type-Feature
Status: Available (was: Untriaged)
Re: "wtf-8"/"wtf-16" - see https://simonsapin.github.io/wtf-8/ for context.

A few notes:

* The backing store accepts string16s. It writes those as binary blobs to the database (I'm not sure about endianness?). It would save disk space to serialize those as wtf-8 but that would require migration.

* Given the concrete motivation here (browser reduce memory usage), a scoped change would be to have DOMStorageValuesMap hold std::strings but keep APIs generally the same, with base::string16 (wtf-16) <--> std::string (wtf-8) conversions at API boundaries (i.e. DOMStorageArea) and inside the database impls.

Not something I'm going to get to, so moving to "Available"
BTW, as wtf-8 consumes more memory on some strings than wtf-16, maybe it makes sense to first add UMA of some sort to measure "compressability" of DOM storage strings?

I.e. measure average ratio of string.length() / WTF8Length(string).
Yeah, gather data first. :)

Also, we could use a fast, general purpose compressor like third_party/snappy instead. My hunch is that it would be an even bigger savings than wtf-8 in most cases, if what's getting stored looks at all like general web content (e.g. cached HTML or JS data)

Comment 8 by jsb...@chromium.org, Feb 23 2017

Owner: mek@chromium.org
Status: Assigned (was: Available)
mek@ - FYI; this may fall out of the new impl

Comment 9 by mek@chromium.org, Feb 23 2017

Certainly using UTF-8 won't automagically happen. But if we do want to change the storage format doing so at the same time as switching to the new backend probably makes sense, as we would avoid having to do multiple migrations.

On disk of course (at least in the new backend) things will already be compressed by snappy (at least I believe leveldb does that by default?), so the question would really be about memory usage in the browser process (also a little bit about memory usage in the renderer process, but there it's easier to solve since onion-soupyfying would result in just using WTFStrings in the renderer).

Actually, at the mojo level the new implementation just stores opaque byte arrays for keys and values. So what would need to change is how the renderer converts strings to byte arrays, and it shouldn't be too difficult to change that to either convert to UTF-8, or maybe just store keys and values as a one-byte format flag (8-bit vs 16-bit, possibly other formats later) and the raw data as it is stored in WTFString, to avoid as many conversions as possible.
See #c2 for context - the motivating factor is indeed saving browser memory prior to writing.

Thanks for clarifying that the new impl. uses opaque byte arrays. So "lenient" conversion ("WTF-8") in the renderer would work, or a format flag as you suggest.


Cc: mariakho...@chromium.org
Labels: EM
Labels: -EM LowMemory
Project Member

Comment 13 by bugdroid1@chromium.org, Mar 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f8d3cf5cfc015e0cbffc2a1fa65b6b96f85fe353

commit f8d3cf5cfc015e0cbffc2a1fa65b6b96f85fe353
Author: mek <mek@chromium.org>
Date: Wed Mar 22 01:44:42 2017

Future-proof data format used for LocalStorage.

To enable future changes to the way data is stored in localstorage without
requiring a full migration of all existing data (for example adding
compression, or simply storing 8-bit only strings more efficiently), add
a tag to each encoded string to indicate the data format used.

BUG= 640707 

Review-Url: https://codereview.chromium.org/2748023008
Cr-Commit-Position: refs/heads/master@{#458614}

[modify] https://crrev.com/f8d3cf5cfc015e0cbffc2a1fa65b6b96f85fe353/content/renderer/dom_storage/local_storage_cached_area.cc

Comment 15 by ssid@chromium.org, May 12 2017

Cc: ssid@chromium.org
Labels: -Pri-3 Pri-2
Is there any update on timeline of this feature?
Labels: M-61
Wanted to check in on this as well. We'd really like to have this in M-61.

Comment 17 by mek@chromium.org, Jun 2 2017

I hope to be able to land the change that makes the new localstorage implementation the default soon, which would make that M61. Currently that still always stores things as UTF-16, but changing it to be more memory efficient at least in the browser process should be a fairly minor change afterwards (basically change Uint8VectorToString16 and String16ToUint8Vector in //src/content/renderer/dom_storage/local_storage_cached_area.cc to encode data differently). Making things more memory efficient renderer side would be trickier for now, but easier after the localstorage implementation is onion-soupifed, although I have no immediate plans to do that.

Also none of this changes anything yet for the session storage implementation.

Comment 18 by ssid@chromium.org, Jun 8 2017

Ping on this. I guess the new impl is enabled. Let's do the browser side changes first, thanks!
Project Member

Comment 19 by bugdroid1@chromium.org, Jun 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/65ae6f8d391b1eae9b953ba1c4e7c859684e54bf

commit 65ae6f8d391b1eae9b953ba1c4e7c859684e54bf
Author: Marijn Kruisselbrink <mek@chromium.org>
Date: Mon Jun 26 19:41:42 2017

Reduce memory usage in browser when a string is 8bit only.

Bug:  640707 
Change-Id: I05b11f638465f17191b709578dee901fb904495a
Reviewed-on: https://chromium-review.googlesource.com/528023
Commit-Queue: Marijn Kruisselbrink <mek@chromium.org>
Reviewed-by: Dmitry Skiba <dskiba@chromium.org>
Cr-Commit-Position: refs/heads/master@{#482360}
[modify] https://crrev.com/65ae6f8d391b1eae9b953ba1c4e7c859684e54bf/content/renderer/dom_storage/local_storage_cached_area.cc
[modify] https://crrev.com/65ae6f8d391b1eae9b953ba1c4e7c859684e54bf/content/renderer/dom_storage/local_storage_cached_area_unittest.cc

What's left to do? When will the work get done? What's the plan, here?

Comment 21 by mek@chromium.org, Sep 27 2017

What's left to do is session storage. Mojo-ifying session storage would likely give us this as well, but I don't believe anybody is currently planning to work on it.
Labels: -Pri-2 Pri-3
Given the changes in session storage structure to drop values, this is less likely to result in big savings. Moving to P3.
Cc: dmurph@google.com
Status: Fixed (was: Assigned)
I don't think there is anything left to do here, so marking this as fixed. In the renderer everything is cached as WTF::String, so using 8-bit chars when that's all the strings use. The browser process doesn't store values unless needed (and it'll rarely be needed for session storage).

Comment 25 by dmu...@chromium.org, Yesterday (34 hours ago)

Really? Wow. So keys & values are usually utf8?

Comment 26 by jsb...@chromium.org, Yesterday (34 hours ago)

keys and values are usually (guesstimate: by several orders of magnitude) composed of code points U+0000 through U+00FF. Blink special-cases such strings to store them as 8 bits per code point in memory (technically, encoded as Latin-1). Strings with other code points are stored with 16 bits per code unit (UTF-16LE*). Blink doesn't UTF-8 encode string in memory.

(* = not really; JavaScript strings are sequences of 16 bit code units, which aren't necessarily valid UTF-16LE)

The point of this issue was that when holding strings to the browser, strings mostly composed of code points in U+0000 through U+007F will consume less memory if encoded as UTF-8 than if encoded to UTF-16LE. mek@'s last point is that on the Blink side this is moot, per above - there's already a space optimization, although it's *not* related to UTF-8 encoding.

(Not sure that made things any clearer...)

Sign in to add a comment