New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 890413 link

Starred by 9 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocked on:
issue 881517

Blocking:
issue 866722



Sign in to add a comment

Various perf/memory regressions from WebView entropy provider change

Project Member Reported by paulmiller@chromium.org, Sep 28

Issue description

 https://crbug.com/887853 
2.9% regression in system_health.memory_mobile at 592894:592936
https://chromeperf.appspot.com/group_report?bug_id=887853

 https://crbug.com/887959 
2.7% regression in system_health.memory_mobile at 592894:592936
https://chromeperf.appspot.com/group_report?bug_id=887959

 https://crbug.com/888522 
5.1% regression in rendering.mobile at 592863:592926
https://chromeperf.appspot.com/group_report?bug_id=888522

 https://crbug.com/888875 
1.1%-2934.2% regression in rendering.mobile/queueing_durations at 592863:592942
https://chromeperf.appspot.com/group_report?bug_id=888875

 https://crbug.com/889059 
4.3%-78.6% regression in rendering.mobile/tasks_per_frame_total_all at 592828:592926
https://chromeperf.appspot.com/group_report?bug_id=889059

 https://crbug.com/889112 
29.6%-104.4% regression in rendering.mobile/queueing_durations at 592937:592970
https://chromeperf.appspot.com/group_report?bug_id=889112

 https://crbug.com/889127 
1.3%-415.9% regression in rendering.mobile/queueing_durations at 592894:592978
https://chromeperf.appspot.com/group_report?bug_id=889127

 https://crbug.com/890205 
3.1%-17.3% regression in system_health.memory_mobile at 592863:592926
https://chromeperf.appspot.com/group_report?bug_id=890205

 https://crbug.com/890212 
3.5% regression in system_health.memory_mobile at 592773:592862
https://chromeperf.appspot.com/group_report?bug_id=890212

all Pinpointed to https://crrev.com/c/1227531
 
Cc: kouhei@chromium.org hpayer@chromium.org jgruber@chromium.org lanwei@chromium.org paulmiller@google.com
 Issue 887853  has been merged into this issue.
 Issue 887959  has been merged into this issue.
 Issue 888522  has been merged into this issue.
 Issue 888875  has been merged into this issue.
 Issue 889059  has been merged into this issue.
 Issue 889112  has been merged into this issue.
 Issue 889127  has been merged into this issue.
 Issue 890205  has been merged into this issue.
 Issue 890212  has been merged into this issue.
Cc: boliu@chromium.org
The permuted entropy provider is known to be slower. On Chrome on Android and iOS we use the caching version of that to avoid paying a big startup cost each time when there's a lot of experiments.

So one thing to check is whether that's the cause? How many experiments are there on webview? Are those perf results from Chromium builds where testing config is used?

If that's indeed the cause (and not something else), then I had some ideas on how to speed it up - but it involves reshuffling users - so we've not done it for Chrome yet.
One option would be to apply those ideas to webview, since we're reshuffling users anyway when switching the entropy source.

(And once we have the new impl, we can look at switching Chrome to it in the future.)

Here's a copy of my thoughts on speeding it up that were previously posted on an internal mailing list:

"On a related note, it occurs to me that we could optimize the permuted entropy algorithm and make it much more efficient. Right now we shuffle an 8000 element array to get the low entropy source mapping.

Instead, we could first shuffle an 80 element array to find a mapping of 100 ranges of entropy sources. Then, we just need to shuffle an 100 element array to map an entropy source that's within the range.

This would of course change results since shuffling will be different - but should still keep the main algorithm the same (i.e. each low entropy source value uniquely maps to another).

There's a question of whether we want to keep the same  sub-mapping for each 100 value range or whether we should do additional salting (e.g. with the range index) before shuffling the second array.

Anyway, something to think about. We could look at perf data to see how much this matters and whether it's worth doing."
> On Chrome on Android and iOS we use the caching version of that to avoid paying a big startup cost each time when there's a lot of experiments. So one thing to check is whether that's the cause?

Yep. WebView uses MetricsStateManager::CreateLowEntropyProvider(), which creates a CachingPermutedEntropyProvider. But I don't know how many times the tests start WebView - if only once, they'll show no benefit from caching.

I notice that PermutedEntropyProvider::GetPermutedValue allocates, shuffles, and discards an ~16KB array on every call. Is that a deliberate trade-off between time and constant memory usage?

You mentioned on a finch-users thread that the permuted provider was created because SHA1 on the low entropy source gave insufficiently uniform results. What's the problem with non-uniform values?

I tested the SHA1 and permuted entropy providers using every possible low-entropy source value (0-7999) and "AImageReaderMediaPlayer" as the study name. The resulting histograms are attached (and permuting is indeed much smoother than SHA1).

> Instead, we could first shuffle an 80 element array to find a mapping of 100 ranges of entropy sources. Then, we just need to shuffle an 100 element array to map an entropy source that's within the range.

Like so?

class DoublePermutedEntropyProvider : public variations::PermutedEntropyProvider {
...
 protected:
  uint16_t GetPermutedValue(uint32_t randomization_seed) const override {
    using variations::internal::PermuteMappingUsingRandomizationSeed;

    size_t level_1_blocks = 80; 
    std::vector<uint16_t> mapping_1(level_1_blocks);
    PermuteMappingUsingRandomizationSeed(randomization_seed, &mapping_1);

    size_t level_2_blocks = low_entropy_source_max_ / level_1_blocks;
    std::vector<uint16_t> mapping_2(level_2_blocks);
    PermuteMappingUsingRandomizationSeed(randomization_seed, &mapping_2);

    size_t block_size = low_entropy_source_max_ / level_1_blocks;
    return mapping_1[low_entropy_source_ * level_1_blocks / low_entropy_source_max_] * block_size + 
        mapping_2[low_entropy_source_ % block_size];
  }

This also produces a smooth-looking distribution, although I wonder if artifacts could show up, since I permuted both arrays with the same randomization_seed (which is the hash of the study name). Since the shuffling algorithm only swaps each element with previous elements, we'd get very similar results in the first 80 elements of each array. I could imagine getting a sort of 2-level fractal pattern where a given input low-entropy source value offsets into a certain block, and then offsets into a similar, scaled-down offset within that block.

We could split the study name (or the hash of the study name) in half, and use one half for each shuffle. Although, some of the study names are only 3 letters long.
entropy-histogram-sha1-AImageReaderMediaPlayer.png
7.8 KB View Download
entropy-histogram-permuted-AImageReaderMediaPlayer.png
7.1 KB View Download
"You mentioned on a finch-users thread that the permuted provider was created because SHA1 on the low entropy source gave insufficiently uniform results. What's the problem with non-uniform values?"

Uniformity is required to have percentages of groups be accurate. Non-uniform distributions can result in a 1% getting some other percentage.

"Like so?"

Yes, that looks about right - though I think there's opportunity to make the code a little bit easier to follow. Should also make a unit test to make sure it produces a 1-1 mapping.

I'm thinking this could be set up such that it's a new config param such we can recommend that it be used for Chrome-targeting experiments too, going forward.
If we're okay with introducing a new algorithm, though, we could do better. What about just doing:

HashName(trial_name) / UINT32_MAX + low_entropy_source_value / low_entropy_source_max

and wrapping around if that's > 1? low_entropy_source_value is generated by base::RandInt, which is already uniform (at least in the rand_util_posix.cc implementation; see attached histogram of 1 million RandInt(0,7999) trials), so this should be equivalent to shuffling. That's the simplest algorithm I can think of.

We could also take asvitkine@'s idea further. E.g. we could decompose 8000 into 20^3 (as opposed to 80 * 100).
RandInt-0-7999-1m.png
16.4 KB View Download
The algorithm needs to avoid crosstalk.

I believe your simple algorithm suggestion will have lots of cross talk.

Because you're effectively offsetting the assignments by the name hash.

Let's say for simplicity study A's name hash is 0 and study B's name hash is half of uint32_max, then two 50% groups will look like this before wrapping:

Study1: [AAAA][BBBB]
Study2:       [AAAA][BBBB]

Once you do the wrapping then, they both become:
[AAAA][BBBB]
[BBBB][AAAA]

Causing 100% cross talk (everyone who's in group A in Study1 is in group B in Study2).

I think the 20^3 would work as well, but now it becomes more complex and I'm not sure when we do it that way whether we're still retaining all the same properties. Certainly, it's still 1-1, but it's hard to reason about what other effects that has.

So to me the 2-stage as proposed is the best compromise. It's very close to the original algorithm which has been vetted and it's simpler to reason about.




I think we can fix that by replacing offset with xor:

(HashName(trial_name) ^ reversed_value) / UINT32_MAX

where reversed_value is the uint32_t representation of low_entropy_source_value, but with the bits reversed; reversed_value's high 13 bits can each be 0 or 1, while the low 19 bits are all 0.

More precisely, the highest 6 bits each have a 50% chance of being 1; the next 2 have a 49.6% chance; the next 5 have a 48.8% chance. If kMaxLowEntropySize were a power of 2, they'd all be 50%, but omitting the highest values 8000-8192 means low_entropy_source_value's more-significant bits get less representation.

> I think the 20^3 would work as well, but now it becomes more complex and I'm not sure when we do it that way whether we're still retaining all the same properties. Certainly, it's still 1-1, but it's hard to reason about what other effects that has. So to me the 2-stage as proposed is the best compromise. It's very close to the original algorithm which has been vetted and it's simpler to reason about.

I'm not sure similarity to a vetted algorithm is a good source of confidence; a small modification could break properties we care about just as well as a large one. What properties are you confident the 2-level shuffle has, but not sure a higher-level shuffle would have?
I'm not against finding something better, but if we want to go that route, we should be careful and make sure the new algorithm indeed meets all our requirements. I recall we tried other things before coming up with the permuted entropy algorithm, so I'd like to see a proper evaluation of a completely new algorithm.

The main factors we care about in terms of functionality:
  1. Uniformity of outputs given input low entropy source values. An algorithm that does 1-1 mapping from entropy source values has this guarantee, whereas others may not.
  2. Minimal cross-talk between different studies (where input is pairs of study names).

And after that, then factors such as performance (ops/s) and memory use. We know the current permuted entropy algorithm is poor on this, so this part shouldn't be hard to beat the existing one. We can of course compare different proposed algorithms together.

I think if you're happy to look at this, we can definitely consider other algorithms. But I'd like to see a doc proposing the algorithm and evaluating the above two criteria (via some programs to simulate different inputs - e.g. all 8000 low entropy source values and a corpus of example study names to evaluate cross talk.)
By the way, I found a test CL I still had open when I did some analysis like this when adding the initial permuted entropy algorithm, in case it's helpful:

https://codereview.chromium.org/11411285/
Did you have any way of quantifying crosstalk in your previous investigation?
Components: Internals>Metrics>Variations
The CL I linked to in comment 10 does the simulation across two experiments. The expectation is uniformity between the two studies. It can be expanded to test multiple pairs together.

We also have the algorithm server-side where we do it as an analysis on study configs. (This means any change to algorithm would need to be reflected there.) That version tries to compute the crosstalk for a given new config across the existing set of studies that apply to the same population.

I can send you a link to that if you're interested.
Project Member

Comment 13 by bugdroid1@chromium.org, Oct 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/bbac4d2c4c896f3bd836f825619c29ff219e9254

commit bbac4d2c4c896f3bd836f825619c29ff219e9254
Author: Paul Miller <paulmiller@chromium.org>
Date: Wed Oct 03 21:39:33 2018

Reland "WebView: Ignore permanent-consistency experiments"

This reverts commit 8bf4bf245ce923a47a49d6448a1ff20cf94e6ead.

Reason for revert: PermutedEntropyProvider regressed perf & memory.

BUG=890413

Original change's description:
> Revert "WebView: Ignore permanent-consistency experiments"
> 
> This reverts commit ef70c85e24ab28f8cc8d67666e6ee42ea62d4ab0.
> 
> Reason for revert: WebView now supports permanent-consistency.
> 
> BUG=866722
> 
> Original change's description:
> > WebView: Ignore permanent-consistency experiments
> > 
> > Android WebView currently treats permanent-consistency experiments
> > (which are intended to retain their group assignments across runs) as
> > session-consistency (which are intended to be randomly assigned on each
> > run). It's safer to not apply permanent-consistency experiments to
> > WebView at all, to avoid mistakes.
> > 
> > BUG=866722
> > 
> > Change-Id: I9d501437718d934c2a59b95335c787302ce758e7
> > Reviewed-on: https://chromium-review.googlesource.com/1148991
> > Reviewed-by: Changwan Ryu <changwan@chromium.org>
> > Reviewed-by: Steven Holte <holte@chromium.org>
> > Commit-Queue: Paul Miller <paulmiller@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#578070}
> 
> TBR=changwan@chromium.org,holte@chromium.org,paulmiller@chromium.org
> 
> # Not skipping CQ checks because original CL landed > 1 day ago.
> 
> Bug: 866722
> Change-Id: Ifd00450fbba1afc1db619109cf918ef75cbcf84f
> Reviewed-on: https://chromium-review.googlesource.com/1237244
> Reviewed-by: Paul Miller <paulmiller@chromium.org>
> Reviewed-by: Steven Holte <holte@chromium.org>
> Commit-Queue: Paul Miller <paulmiller@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#594491}

TBR=changwan@chromium.org,holte@chromium.org,paulmiller@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: 866722
Change-Id: I58435ac0c107214e15b0e54ff8aee422eb98fa06
Reviewed-on: https://chromium-review.googlesource.com/c/1259807
Reviewed-by: Paul Miller <paulmiller@chromium.org>
Commit-Queue: Paul Miller <paulmiller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#596380}
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/android_webview/browser/aw_variations_service_client.cc
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/android_webview/browser/aw_variations_service_client.h
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/client_filterable_state.h
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/service/BUILD.gn
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/service/variations_field_trial_creator.cc
[add] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/service/variations_service_client.cc
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/service/variations_service_client.h
[modify] https://crrev.com/bbac4d2c4c896f3bd836f825619c29ff219e9254/components/variations/study_filtering.cc

Cc: chiniforooshan@chromium.org
Issue 891225 has been merged into this issue.
Issue 891361 has been merged into this issue.
 Issue 892162  has been merged into this issue.
Cc: -paulmiller@google.com
It turns out my xor algorithm has lots of crosstalk. I think one problem is that I'm basically throwing out the less-significant bits of the study name hash; when dividing by UINT32_MAX, the contribution of those bits becomes tiny, and unlikely to influence the group assignment. So hashes which are similar in their more-significant bits (even if the other bits are very different) will have lots of crosstalk. The permuting algorithm makes better use of the entire hash.

We could improve the xor algorithm by using the hash's less-significant bits to scramble its more-significant bits, increasing the influence of the less-significant bits on the outcome. But any algorithm based on xor'ing with the low-entropy source value can only produce 8000 different permutations. Whereas shuffling allows for 8000! permutations, of which at most 2^32 are reachable with our permuting algorithm (since we seed the shuffle with a 32-bit hash).
How about just wrapping "HashName(trial_name) ^ reversed_value" as s seed with additional randomization function (e.g. xorshift+) before dividing it?
> How about just wrapping "HashName(trial_name) ^ reversed_value" as s seed with additional randomization function (e.g. xorshift+) before dividing it?

That seems like it would resolve the issue of discounting the less-significant bits, but not the issue of the small number of possible permutations.

Expanding on my concern in comment 4 about how to seed each level of the shuffle -

We don't want to seed the 1st and 2nd levels of the shuffle the same way, because (since the 2 arrays are of similar length, and the shuffling algorithm only swaps each element with previous elements) we'll get almost identical permutations for both levels, greatly reducing the randomness.

We also don't want to shuffle each of the 2nd-level blocks the same way; that would also greatly reduce the randomness.

Idea #1 -

The SHA1 of the study name produces 20 bytes; we currently use only the first 4 bytes to seed the shuffle. That gives us 4 more high-quality seed values. We could divide the 8000-array into only 4 blocks, and then all 5 shuffles (1 shuffle of 4 elements at level 1, then 4 shuffles of 2000 elements at level 2) would get high-quality seeds, but performance at the 2nd level would suffer.

We could ameliorate this somewhat by switching to a hash with a larger digest. E.g. SHA-384 would allow for 10 blocks. We could do 16 blocks with SHA-512 if we re-used the 1st-level seed for just one of the 2nd-level blocks.

However, reducing the number of blocks increases the chance that, with enough studies, we'll end up needing to shuffle every block anyway.

Idea #2 -

We could also shuffle each 2nd-level block serially, using the same Mersenne Twister object. To shuffle a given block, we'd invoke the twister and discard the result for every element in every preceding block, then finally shuffle the current block. This unfortunately has the same time complexity as the current permuting algorithm, but saves on space.

Idea #3 -

The 1st-level shuffle is seeded with the bytes 0-3 of the hash. At the 2nd level, to get the seed for block N, take bytes 4-7 of the hash, then add N. This produces a different (though still closely related) seed for each 2nd-level block.

I tested this algorithm and found the crosstalk to be much better than my xor algorithm, but still worse than the current permuting algorithm. 

Other thoughts about optimization -

The Mersenne Twister we use for the shuffle produces 32-bit values, of which we only use the bottom 13 bits. We could split each value in half and only invoke MT half as many times.

There's a trade-off in whether to cache 1st- or 2nd-level shuffled arrays between assigning each study. (The current permuting algorithm caches nothing, which is best for persistent memory cost but worst for performance.) With enough active experiments, it's highly likely that the same 2nd-level block will be needed multiple times, and also possible that the majority of 2nd-level blocks will be needed at some point.
Scratch that last note about caching. The shuffle is different for every study, so there's nothing to cache between studies in the current algorithm.
Another random thought:

1) Juxtapose hashName(trial_name) and randomization_seed as one seed. (64-bit or 128-bit, fill 0 at the end if necessary)
2) Generate random number from 1) efficiently. (xorshift+ for 128 bit?)
3) Bucket the output into 8000 (or 8192 to avoid a skew) to make it low-entropy.

This should be
1) consistent: you get the same value from same trial_name and rand_seed as long as your pseudo random generator is consistent
2) uniform: a strong random generator should provide uniform distribution for different seed values even when it has a pattern.
3) no crosstalk: there should be no dependency as long as random generator is strongly random and as long as any different hash names or seed numbers give independent results. If it's strongly random, bucketing it to 8192 shouldn't hurt independence, either.

I might be missing something, but permutation is used only to avoid crosstalk, and it's not the requirement, right? There are random generation algorithms that are more performant and more strongly random than MT, so I wonder if a strong random generator could just solve this problem.

Comment 21 Deleted

> Juxtapose hashName(trial_name) and randomization_seed as one seed.

What do you mean by randomization_seed? In the current algorithm, randomization_seed is HashName(trial_name).

I ran the "overlap" script asvitkine@ mentioned in #12, on all 1,800 studies (with salts removed), with 3 algorithms: the current permuting one, the 2-level permuting one (seeded using "idea 3" in #18), and the xor one I described in #8.

With the current algorithm, no pair of studies gets 100% crosstalk; with the 2-level algorithm, 100% is found 219 times; with the xor algorithm, it's found 2606 times.
Labels: -Pri-2 Pri-1
Re #22, I mean low_entrypy_source.
I experimented with xorshift (specifically xorshift128+, https://en.wikipedia.org/wiki/Xorshift#xorshift+).

There are some variables to tweak:
- how much of the hash to use (SHA1 produces 20 bytes, whereas xorshift128+ accepts 16.)
- how to combine the hash with the low-entropy source value
- how many rounds of xorshift to do

Doing just a few rounds is insufficient; it takes many rounds for the 13 bits of the low-entropy source value to get thoroughly mixed into all 128 bits of the xorshift state.

The uniformity isn't great (although it's better than plain SHA1). Increasing the number of rounds increases uniformity, but only up to a point (which, just eyeballing it, seems like ~20). See attached histogram of all possible output values, given all 8000 low-entropy source values, "AImageReaderMediaPlayer" as the study name, and 1000 rounds of xorshift.

The output data contains 95 values > .99 (rather than the expected 80). So if you did an experiment which put the top 1% of users in a certain group, that group would actually get ~1.2% of users.

2 questions for asvitkine@:
- What level of uniformity is required? Do we need something perfectly even (like the current shuffling algorithm)?
- Are there any notes on what other algorithms were considered before settling on the shuffling algorithm?
xorshift128+1000-0-7999-AImageReaderMediaPlayer.png
6.8 KB View Download
2 questions for asvitkine@:
- What level of uniformity is required? Do we need something perfectly even (like the current shuffling algorithm)?

I don't think we should regress in level of uniformity. We want perfect uniformity, as otherwise it causes uneven groups which can impact statistical analyses.

- Are there any notes on what other algorithms were considered before settling on the shuffling algorithm?

I don't think we have formal notes. Originally it was SHA1 which worked poorly for low entropy users. I think we experimented with something simple based on XOR and didn't really work well and someone suggested the permuted approach which worked well and we went with it.
Thanks for doing the experiment, Paul!

Crosstalk result is yet to see, but I feel that a little tweak on this might actually solve this. Here are more tweaking ideas:

1) Would it be worth trying putting low-entropy source before HashName?
xorshift+ is known to have some weakness in the low-order bits of its output (according to wikipedia), although you might have applied the correction algorithm, it may still be the case. If this is the case, we may consider using another algorithm.

2) Is there 64 bit random number generation algorithm that you could use?
- Current HashName() algorithm already works with shuffling - this may mean that the part to compress trial name into 32 bit would be valid even for a new algorithm. We probably don't need to make it longer.
- Supposedly larger bit random generation would require more time/rounds and poor performance compared to smaller bit random generation.

3) By the same token, would it make sense to change HashName() to produce 16 bit result and run 32-bit random generation algorithm after putting it together with low entropy source?

4) How would the results differ if using default random algorithm or MT?

If we can't regress uniformity at all (and shuffling is 1:1, and so exhibits perfect uniformity), then I don't think any algorithm whose output is simply the output of a hash function or pseudorandom generator is going to work. The chance that any given hash function covers its range with perfect uniformity for any given 8000 inputs is basically zero.

I also think there might be a fundamental limitation with the 2-level shuffle, in that corralling each block of 100 elements together, and only shuffling them with each other, greatly reduces the randomness of the shuffle, which increases crosstalk. When shuffling the whole array, there are 8000! (or ~5*10^27752) possible permutations. With the 2-level shuffle, there are "only" 80! * 100!^80 (or ~3*10^12756). Adding more levels exacerbates things; my 20^3 suggestion would have 20! * 20!^20 * 20!^(20^2) (or ~4*10^7740).

Some new ideas:

1) Improving on idea #2 from comment #18 - it turns out that Mersenne Twister supports "jump ahead", where you advance the generator by N steps, for less cost than generating N more numbers:

http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/JUMP/index.html

So, we could:
- Seed the MT the same way we currently do.
- Use the MT to shuffle the 1st-level array of 80 blocks.
- Find which 2nd-level block our low-entropy source value maps into. (Call it block N, N ∈ [0,79]).
- Advance the MT by 100 * N (to simulate shuffling all blocks 0 through N-1).
- Shuffle block N.

This removes the problem of how to seed the MT for each 2nd-level block because, as in the current algorithm, we use the same MT object for all shuffling. (However, this doesn't address the fundamental limitation described above.)

2) We could keep the current shuffle, but substitute a cheaper pseudorandom generator. This would save time, but not space. xorshift is a possible replacement. (xorshift also supports jump ahead, so xorshift could also be used with the 2-level shuffle.)

3) I also explored "un-memoizing" the current shuffling algorithm. We'd never allocate the array; every time we needed the value at a certain index, we'd re-do all the work to compute it. This turns into a pretty complicated recursive function with a maximum stack depth larger than the array itself, so it's not actually better.
4) Use a non-uniform algorithm like SHA1, but then normalize the result, using its position in the sorted list of all possible results.

Like so:

x = hash(study_name + low_entropy_source_value)
x_ordinal = 0
for i in [0,8000)
  if i != low_entropy_source_value
    y = hash(study_name + i)
    if y < x
      x_ordinal += 1
return x_ordinal / 8000.0

The first line is basically the high-entropy provider; everything else is for enforcing uniformity. Notice that we can find x's ordinal position without allocating or sorting the entire array, so this is O(1) space. Time depends on the choice of hash function, since we call it 8000 times.
Cc: holte@chromium.org
/cc holte@, who had some ideas about how we might want to improve randomization for Chrome – it might be nice to incorporate some of those ideas into WebView as well, while it's still easy to change.
Blocking: 866722
In 4), could you also try x = rng(hash(study_name), low_entrypy_source_value) if you end up finding crosstalks? BTW, for (y == x) we may need to add artificial partial order by checking (i < low_entropy_source_value) as we talked today.

I naive implementation of my algorithm in #29 is so slow (at least with SHA1 or MD5 as the hash) that I can't even run the overlap script on any meaningful number of studies.

I had the idea of saving the state of the hash object after ingesting study_name, and using a separate copy of that state for each 'i' value. That way, we only do the work of hashing study_name once, and just do a small amount of additional work for each 'i'. With SHA1, that speeds up my algorithm by ~35x.

There are faster hashes, too. I'm experimenting with this one: https://cyan4973.github.io/xxHash/ which makes my naive implementation ~7x faster than with SHA1.

It turns out that Mersenne Twister uses quite a lot of internal state; Wikipedia quotes 2.5 KB; sizeof(std::mt19937) returns 4 KB on my Linux machine. So contrary to my statement in #28, switching to a more efficient shuffling algorithm could still save some memory, even if we keep the 15.625 KB array.
I think the normalizing algorithm I described in comment #29 is the way to go. I tested crosstalk and it scores similarly to the current permuting algorithm.

To summarize the algorithms I've considered so far:

- sha1 (the current high-entropy algorithm)
- permuting (the current low-entropy algorithm)
- 2-level permuting (asvitkine@'s idea - shuffle an array of 80, then another of 100, see comment #3)
- 3-level permuting (like 2-level, but with 3 levels of 20 each)
- xorshift permuting (like the current algorithm, but replace Mersenne Twister with xorshift)
- offset (see comment #6)
- reverse-xor (see comment #8)
- xorshift (changwan@'s idea, see comment #25)
- normalized hash (sha1, md5, xxhash...I'd also like to try some checksums e.g. CRC32) (see comments #29, #32)
- un-memoized permuting (see comment #28)

Scrapped for lack of uniformity:

- sha1 (see graph in comment #4)
- xorshift (see graph in comment #25)

Scrapped for excessive crosstalk:

- 2-level permuting (test results in comment #22, lower crosstalk than the next 3 in this list, but still way worse than permuting or normalizing)
- 3-level permuting (didn't bother to test this since it must be worse than 2-level)
- offset (see asvitkine@'s explanation in comment #7)
- reverse-xor (test results in comment #22)

Scrapped for high time/space cost:

- permuting (the cause of this bug :P)
- xorshift permuting (saves on several KB of Mersenne Twister state, but still costs 16 KB of array)
- un-memoized permuting (ended up being worse than permuting)

That leaves only the normalizing algorithms. These have perfect uniformity, very low crosstalk (on par with permuting), and very low memory cost (no arrays necessary - it just needs the state of whatever hash function is used - 48 bytes in the case of xxhash). The only drawback is moderate time cost (currently between permuting and 2-level permuting in performance, but I'm still trying to improve that).
I had a thought here.  Can we just have the client send it's low entropy source to the Finch server, and then Finch server can send back a signed seed that actually contains the precomputed bucket that the client would assign itself too?
Then we wouldn't have any time/space cost on the client, and it's trivially amortized for the server.

We already reveal the low-entropy value to most services via headers (that's why it's low-entropy), and we should be able to trust the precomputed assignment since it's part of the seed.
In WebView's case, that would require all apps for a given (device, user) to have the same low entropy source. Is that acceptable? I wonder if that could bias data, if certain sets of apps tend to be installed together.

(As for implementation, any logic about creating or managing the low entropy source would have to be copied to Java, because WebView's seed service is Java-only, and the service would have to store via some mechanism other than prefs.)

Would there be any server-side performance implications for sending different seeds to different clients? I don't actually know how any of that works...is there caching of seed responses at any stage? (because caching would be harder if seeds are not identical.)
There is caching, and we do already serve different copies of the them prefiltered based on platform/channel/milestone from the client.  Since the seeds are not too big, I would be a little surprised if even adding the x8000 source values to that makes it too big to cache reasonably. If necessary, we could just append a separately cached randomization_seed->bucket map to the seed and not pre-filter that bit by the other criteria, since it would be mostly duplicates anyway.

Do the experiments mainly target the apps or Webview?  If you are more experimenting on Webview, you might actually want them to be consistent?  If you have mostly app-specific experiments, then consistency would be less useful since you would have to worry about crosstalk between those experiments, but you also might want to have per-app seeds in that case.
I'm concerned about the effect of caching if we do this.

The side map might be more feasible though. The key can be the study name hash which is what controls the assignment anyway, so we don't need to worry about multiple configs for the same study and how to associate with each one.

However, this won't help a bunch of cases like first run (no entropy source yet), testing config - which is what the perf bots likely exercise and so forth

We also should discuss with privacy if we want to pursue this.

In short, I think it could work, but it's definitely more involved than just changing the algo. And maybe we still want to start by changing the algo.



> Do the experiments mainly target the apps or Webview?

I'm not aware of any experiments that were designed with certain apps in mind. It's more that different apps exercise WebView differently.
I tested the performance of my algorithm in WebView with various hashes:
https://chromium-review.googlesource.com/c/chromium/src/+/1285710

I test on Nexus 5 running Android MRA58Z. I built WebView as official, non-component, dchecks enabled. The files implementing my algorithm were built with -O3.

I timed EntropyProvider::GetEntropyForTrial using ElapsedTimer. (It's hard to measure at a finer granularity than GetEntropyForTrial because ElapsedTimer itself takes enough time to mess with the performance.)

SHA1 (using the implementation in base/sha1.cc, which doesn't appear to take advantage of ARM's crypto extensions) averaged 420 microseconds per study.

For comparison, the current permuting algorithm averaged 1713 microseconds per study.

SHA1 (using BoringSSL's SHA1 implementation, which does appear to use the crypto extensions) averaged 5855 microseconds per study. I don't yet know why this was so terrible.

xxHash (used this xxHash implementation: https://github.com/RedSpah/xxhash_cpp) averaged 889 microseconds per study. It was oddly bi-modal, with a cluster of times in the 1-1.5 millisecond range.

I didn't expect SHA1 to outperform xxHash in practice. xxHash outperformed SHA1 in a simple test program on my workstation. Since the incremental hashing work at each iteration of the inner loop is small, I would have thought that copying the hash state would be most of the work, and xxHash has smaller state than SHA1.

MD5 (using the implementation in base/md5.cc) averaged 7002 microseconds per study, which is also suspiciously bad. Our MD5 has an awkward API that requires you to put your data in a string before passing it to MD5, so that overhead might be at fault.

Overall, these results are the opposite of what I expected. MD5 and xxHash are both faster than SHA1 on paper, but they were slower in this test. BoringSSL's SHA1 looks like it should be faster than base/sha1.cc, but it was >10x slower.
I done goofed. SHA1, using base/sha1.cc, actually averages 9954 microseconds per study. I forgot to call SecureHashAlgorithm::Final() inside the loop in my original test, which makes it much slower.

It looks like SHA1 doesn't actually do any work until an internal 64-byte buffer fills up. Then it hashes those bytes and clears the buffer. So if the study name is short, no work will actually be done until Final(). Then my optimization about ingesting the study name outside the loop doesn't actually help.

My previous results now make sense (except for MD5): base/sha1.cc SHA1 < BoringsSSL SHA1 < xxHash. Unfortunately, that means my algorithm isn't as large a performance win over the current algorithm; it's ~2x faster, as opposed to the erroneous ~4x I got earlier.
The study name optimization does still help with xxHash; if I remove the optimization, xxHash's average goes up to 1207 microseconds.

I also tested MurmurHash3 (using the implementation under third_party/smhasher). It was suspiciously fast: 573 microseconds per study.

https://chromium-review.googlesource.com/c/chromium/src/+/1285710 is updated with the new experiments.

CityHash is another possibility.

I still need to test all these hashes for crosstalk; I was using SHA1 when I found my algorithm to have similar crosstalk to the current algorithm.
I tested MurmurHash3 for crosstalk. It did similarly/slightly better compared to xxHash and the permuting algorithm.

I tested CityHash32 for performance (using the same test setup with the Nexus 5, and Google's implementation here: https://github.com/google/cityhash), and got 652 microseconds per study.
Changwan proposed using xorshift as the hash in the normalizing algorithm. I tested this with both 5 and 50 rounds of xorshift, and got very high crosstalk in both cases.
I tested CRC32 for crosstalk; it was high.

These results sort of make sense. The "proper" hash algorithms I've tried have had low crosstalk, while xorshift and CRC32 had high crosstalk. There are many more psuedorandom/checksum-generating algorithms I could try, but I'd expect those to also get high crosstalk. Unfortunately, the "proper" hashes are also slower.
Looking at the MurmurHash implementation (https://cs.chromium.org/chromium/src/third_party/smhasher/src/MurmurHash3.cpp), I think MurmurHash would be amenable to the optimization I described in #32 (where we ingest the study name first, then copy the hash state). While SHA1 hashes in 64-byte chunks, MurmurHash3_32 hashes in 4-byte chunks, so it'd be easier to hash the study name separately. Also, it keeps only 4 bytes of state between chunks (the "h1" variable), so copying that state would be cheap.

The MurmurHash implementation doesn't have separate "update" and "final" steps like the other hashes do, so it'll take a little refactoring to enable hashing in parts like this.

However, MurmurHash takes an optional "seed" value, which allows for a similar optimization: hash the study name first, then use that digest as the seed when hashing each 'i' value. This means we're still doing the "finalization" work after hashing the study name, which is unnecessary, but we only hash the study name once. I tested this, and it still has good crosstalk, while reducing the time to 264 microseconds per study.

(Fun fact: std::string::data() doesn't always return an aligned address. I discovered this via SIGBUS when hashing the study names; the MurmurHash implementation assumes 4-byte alignment.)
I re-implemented MurmurHash3_32 to have separate Update/Final steps, to allow for the optimization I described in #32. As a further optimization, I also padded the study name and 'i' values out to multiples of 4 bytes, so that each 'i' can be ingested as a single update step. This brought the average down to 210 microseconds per study. 54 microseconds is a smaller improvement than I was hoping for :P

Updated https://chromium-review.googlesource.com/c/chromium/src/+/1285710 with the new experiments. block_murmur.h is my Murmur implementation.
Another small optimization - change the algorithm from:

x = hash(study_name + low_entropy_source_value)
x_ordinal = 0
for i in [0,8000)
  if i != low_entropy_source_value
    y = hash(study_name + i)
    if y < x
      x_ordinal += 1
return x_ordinal / 8000.0

to:

x = hash(study_name + low_entropy_source_value)
x_ordinal = 0
for i in [0,8000)
  y = hash(study_name + i)
  if y < x
    x_ordinal += 1
return x_ordinal / 8000.0

That is, remove "if i != low_entropy_source_value". This means we'll do one extra iteration where i == low_entropy_source_value, which is unnecessary. But it has no effect on the result, because if i == low_entropy_source_value, then x == y, so x_ordinal is unaffected.

This requires an extra MurmurHash computation, but it saves 8000 conditional branches. I now get 197 microseconds per study.
So that's nearly a 10x improvement over the current one? Nice!

Have you tried tricks with trying to get rid of the other conditional - "y < x" - e.g. perhaps we can do some unconditional math that results in 1 or 0 that can be added to x_ordinal.
e.g. via something like http://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching based on the sign bit when subtracting y from x
Although first thing to try is to just do something like:

x_ordinal += (y < x)

Not actually sure if it makes a difference - but should be easy to check - depending what it gets compiled to, it will still avoid the code jumping based on the result. But it might still be the case that the "y < x" is the slow part - then we can try something like what I link to in comment 49.
I'd assumed the compiler would automatically optimize that, but I haven't actually looked at the assembly...
GetEntropyForTrial is pretty large after all the inlining, so it's difficult to read the assembly to see what's actually going on. But I came up with:

x_ordinal += ((y - x) >> 31) & 1

This grabs the sign bit of the difference (assuming int is 32-bit). Now I get 127 microseconds. Good call!

Harebrained theory as to why eliminating the 2nd condition was more impactful than eliminating the 1st: the 1st condition is almost always true, while the 2nd is basically random, true ~half the time. So a branch predictor might find it easier to predict the 1st condition than the 2nd. Especially if it's the type of predictor that works by remembering the last N results for a given branch.
Actually, that formula suffers from overflow below the line y = x - INT_MAX / 2 and above the line y = x + INT_MAX / 2. That might not actually be a problem, because it still partitions the x,y plane into half 1 and half 0; it's just not the same y = x boundary that the previous (y < x) formula creates.
^ I mean, UINT_MAX / 2...or INT_MAX
I measured various alternatives:

if (y < x) x_ordinal++;
→ 197 microseconds

x_ordinal += (y < x);
→ 112 microseconds

x_ordinal += !(x / y); // x and y are uint32_t
→ 203 microseconds

x_ordinal += ((y - x) >> 31) & 1; // x and y are int32_t
→ 127 microseconds (but overflows)

int64_t x64 = x, y64 = y;
x_ordinal += ((y64 - x64) >> 63) & 1;
→ 155 microseconds

(These measurements are a bit noisy, even after averaging 85 runs, as I did.)

"+= (y < x)" is the winner, unless someone has something cleverer. These optimizations don't change the algorithm's output, though, so we can optimize this more later if we want.
Project Member

Comment 56 by bugdroid1@chromium.org, Nov 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7c0efea0a6b11b1ee2585a737e53b3535864bef8

commit 7c0efea0a6b11b1ee2585a737e53b3535864bef8
Author: Paul Miller <paulmiller@google.com>
Date: Tue Nov 13 23:49:00 2018

Implement NormalizedMurmurHashEntropyProvider

Design document:
https://docs.google.com/document/d/1cPF5PruriWNP2Z5gSkq4MBTm0wSZqLyIJkUO9ekibeo

BUG=890413

Change-Id: Ib372a573b1a0f68467f785ce74ef7821c9d48614
Reviewed-on: https://chromium-review.googlesource.com/c/1322350
Reviewed-by: Grace Kloba <klobag@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Paul Miller <paulmiller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#607816}
[modify] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/BUILD.gn
[modify] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/DEPS
[modify] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/entropy_provider.cc
[modify] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/entropy_provider.h
[modify] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/entropy_provider_unittest.cc
[add] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/variations_murmur_hash.cc
[add] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/variations_murmur_hash.h
[add] https://crrev.com/7c0efea0a6b11b1ee2585a737e53b3535864bef8/components/variations/variations_murmur_hash_unittest.cc

Blockedon: 881517
Project Member

Comment 58 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c4267fbbca666a8a354f01bf905adfdc0d917f1a

commit c4267fbbca666a8a354f01bf905adfdc0d917f1a
Author: Paul Miller <paulmiller@google.com>
Date: Thu Jan 03 03:20:35 2019

Remove (Caching)PermutedEntropyProvider

It's been replaced by NormalizedPermutingEntropyProvider; see commits
7c0efea0a6 and 4355667974.

BUG=890413,912368

Change-Id: I962ecdde976f84c0cdadfb6c1a36a3d12246fb0b
Reviewed-on: https://chromium-review.googlesource.com/c/1388287
Reviewed-by: Changwan Ryu <changwan@chromium.org>
Reviewed-by: Ilya Sherman <isherman@chromium.org>
Commit-Queue: Paul Miller <paulmiller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#619573}
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/metrics/metrics_service.h
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/metrics/metrics_state_manager.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/metrics/metrics_state_manager_unittest.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/BUILD.gn
[delete] https://crrev.com/4675eb4b376d0f9b5591fabadb2c61b78c30c3f4/components/variations/caching_permuted_entropy_provider.cc
[delete] https://crrev.com/4675eb4b376d0f9b5591fabadb2c61b78c30c3f4/components/variations/caching_permuted_entropy_provider.h
[delete] https://crrev.com/4675eb4b376d0f9b5591fabadb2c61b78c30c3f4/components/variations/caching_permuted_entropy_provider_unittest.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/entropy_provider.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/entropy_provider.h
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/entropy_provider_unittest.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/pref_names.cc
[modify] https://crrev.com/c4267fbbca666a8a354f01bf905adfdc0d917f1a/components/variations/proto/BUILD.gn
[delete] https://crrev.com/4675eb4b376d0f9b5591fabadb2c61b78c30c3f4/components/variations/proto/permuted_entropy_cache.proto

Sign in to add a comment