New issue
Advanced search Search tips

Issue 735663 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug

Blocking:
issue 735674



Sign in to add a comment

Reactivate HashTable stats features

Project Member Reported by cavalcantii@chromium.org, Jun 21 2017

Issue description

HashTable has a nice functionality that allows to collect statistics about its use (e.g. accesses, number of collisions, rehashes, etc).

I'm talking about: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/wtf/HashTable.h?q=HashTable+package:%5Echromium$&l=35

Issue is that the code is broken and won't compile (It is probably broken for the last 2-3 years since the oilpan move).

The idea is to fix this (and preferably add some unit tests to help us to avoid the situation of having broken/dead code in HashTable).
 
Cc: tkent@chromium.org haraken@chromium.org thakis@chromium.org
Owner: cavalcantii@chromium.org
Status: Started (was: Untriaged)
An ugly workaround just to verify if it still works:
https://gist.github.com/Adenilson/4a07037eafd5827ec8b8e29d6e7ad4e8
Cc: drott@chromium.org
Labels: OS-All
Blocking: 735674
The output for loading a wikipedia page (https://en.wikipedia.org/wiki/C%2B%2B).
#### Dumping HashTable stats

WTF::HashTable statistics

29248 accesses
5982 total collisions, average 1.20 probes per access
longest collision chain: 8
  4719 lookups with exactly 1 collisions (12.86% , 16.13% with this many or more)
  957 lookups with exactly 2 collisions (2.56% , 3.27% with this many or more)
  209 lookups with exactly 3 collisions (0.52% , 0.71% with this many or more)
  58 lookups with exactly 4 collisions (0.12% , 0.20% with this many or more)
  24 lookups with exactly 5 collisions (0.04% , 0.08% with this many or more)
  12 lookups with exactly 6 collisions (0.03% , 0.04% with this many or more)
  2 lookups with exactly 7 collisions (0.00% , 0.01% with this many or more)
  1 lookups with exactly 8 collisions (0.00% , 0.00% with this many or more)
16 rehashes
2040 reinserts

I'm dumping the HashMap that is used by the ShapeCache. 
Anyone more familiar with the inner workings of the HashTable class could comment on which operations are more costly at run time (reinserts X rehashes, etc) so that I can have a target concerning selecting a new hash function (https://bugs.chromium.org/p/chromium/issues/detail?id=735674)?

The whole idea is that a better distributed hash may be a bit more expensive to calculate but could avoid costly operations in the HashTable.

Comment 7 by tkent@chromium.org, Jun 22 2017

Cc: -tkent@chromium.org yutak@chromium.org
Components: Blink>Internals>WTF
yutak@ is very familiar with HashTable.

Comment 8 by yutak@chromium.org, Jun 23 2017

I'm actually trying to kill current HashTable and create a new one, and the part
of the reason is that the stats code gets constantly broken.

But it will take some time for that to happen, so yeah, I'm OK with fixing that
code.

However, I already tried to change hash functions and my conclusion at the time
was: better hash functions did NOT reduce collision rates meaningfully. That's
partly because most of our hash tables are small, and we only use a few bits of
hash values in most cases. I suspect our number of probes is already close to
the theoretical lower bound in our use cases.
Dear Yutak

Thanks a lot for the feedback. I wonder if it would make sense to change the way that the stats reporting code is implemented to solve the issue of it being constantly broken.

Basically the way it is today, it requires to define a macro to enable the code (therefore being disabled by default and not easily testable). I wonder if instead we could re-write it in a way to avoid using macros?

I can see at least 2 possibilities:

a) Have a specialization of HashTable that would have the instrumentation code (e.g. StatsHashTable or something like) and it would be a child of the HashTable. One advantage of this design is that we separate the stats code from the HashTable (as this is some considerable clutter). One disadvantage is that the client code (e.g. in this case, ShapeCache/HashMap) would have to switch to this new class to actually enable stats collecting.

b) Maybe have a template parameter in HashTable (bool = true | false) to enable the instrumentation? The cost would be to have an extra pointer inside of the HashTable to host the object that collects the stats (therefore 4bytes@32bits/8bytes@64bits). I'm unsure about the performance implications at runtime (e.g. checks for the pointer?).

Maybe there is a better way to implement this... any thoughts?

Concerning re-writing the HashTable class, that would be awesome! Last week I found a nice report (http://www.idryman.org/blog/2017/05/03/writing-a-damn-fast-hash-table-with-tiny-memory-footprints/) where the author reports on a Hash table implementation that behaves better both in speed and memory use (something important in mobile).

Finally, I was able to have less collisions loading a wikipedia article (https://en.wikipedia.org/wiki/C%2B%2B) using cityhash and siphash (will post the data next).

Questions:

a) How can we avoid rehashes?
b) Is it bound to the initial table size?
c) Maybe we could fine tune it to the typical number of words in a page?

Loading the page (https://en.wikipedia.org/wiki/C%2B%2B) with CityHash seems to yield 16% less collisions


#### Dumping HashTable stats

WTF::HashTable statistics

20306 accesses
5031 total collisions, average 1.25 probes per access
longest collision chain: 9
  3756 lookups with exactly 1 collisions (14.06% , 18.50% with this many or more)
  900 lookups with exactly 2 collisions (3.17% , 4.43% with this many or more)
  256 lookups with exactly 3 collisions (0.91% , 1.26% with this many or more)
  71 lookups with exactly 4 collisions (0.21% , 0.35% with this many or more)
  29 lookups with exactly 5 collisions (0.09% , 0.14% with this many or more)
  11 lookups with exactly 6 collisions (0.03% , 0.05% with this many or more)
  5 lookups with exactly 7 collisions (0.01% , 0.02% with this many or more)
  2 lookups with exactly 8 collisions (0.00% , 0.01% with this many or more)
  1 lookups with exactly 9 collisions (0.00% , 0.00% with this many or more)
18 rehashes
4088 reinserts

Loading the page (https://en.wikipedia.org/wiki/C%2B%2B) with SipHash seems to yield 24% less collisions:
#### Dumping HashTable stats

WTF::HashTable statistics

20092 accesses
4575 total collisions, average 1.23 probes per access
longest collision chain: 11
  3225 lookups with exactly 1 collisions (11.95% , 16.05% with this many or more)
  824 lookups with exactly 2 collisions (2.71% , 4.10% with this many or more)
  279 lookups with exactly 3 collisions (0.83% , 1.39% with this many or more)
  113 lookups with exactly 4 collisions (0.33% , 0.56% with this many or more)
  46 lookups with exactly 5 collisions (0.07% , 0.23% with this many or more)
  32 lookups with exactly 6 collisions (0.05% , 0.16% with this many or more)
  22 lookups with exactly 7 collisions (0.03% , 0.11% with this many or more)
  16 lookups with exactly 8 collisions (0.00% , 0.08% with this many or more)
  16 lookups with exactly 9 collisions (0.07% , 0.08% with this many or more)
  1 lookups with exactly 10 collisions (0.00% , 0.00% with this many or more)
  1 lookups with exactly 11 collisions (0.00% , 0.00% with this many or more)
18 rehashes
4088 reinserts

Comment 12 by yutak@chromium.org, Jun 26 2017

Re comment 9,

What I'm currently thinking is close to the (b), except that the template
parameter is a type defining a set of hook functions, not a bool. In this way,
you can inject your measurement code in a limited place you like, and other
tables' performance is unaffected, as long as default empty hook functions are
inlined.

But I don't plan to do this in the current HashTable code. It's so cluttered
and does too many things in one class, making it pretty hard to add another
template parameter.

Re questions,

(a) and (b) is kinda hard to answer. Supposedly the initial table size should
reduce the number of rehashes, but I'm not sure whether the current HashTable
implementation works that way. HashTable does "shrinking" rehash or even
"same-size" rehash in cases where there are too few live elements or there are
too many deleted slots. I'm not quite sure whether those routines are invoked
at the correct timings (e.g. don't shrink if the default table size set to a
large number).

(c) is not practical. We use loooooots of HashTables and many of those have
nothing to do with the number of words in the page. Even further, it's almost
impossible to tune parameters of HashTable, because a slightest change in
those parameters would cause chaotic performance changes everywhere. This is
because we inline a lot of functions, and a small change in the code can
result a butterfly effect, changing compiler's inlining decisions everywhere.

Re comment 10 and 11,

I would recommend repeating the tests, say, 10 times. AFAIK those results are
pretty flaky, and I was not able to get statistically significant results.
Project Member

Comment 13 by bugdroid1@chromium.org, Jul 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/093e477880015860a43550d9330ed922810ceeb2

commit 093e477880015860a43550d9330ed922810ceeb2
Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com>
Date: Fri Jul 21 22:34:26 2017

Reactivate HashTable stats feature

It is a feature that allows to collect data from HashTables (e.g. rehashes,
reads, number of collisions, etc).

This change does:
- fixes the functionality for general HashTable stats
- also fixes for individual HashTables

And introduces a public member function (guarded behind a macro)
called DumpStats() that allows to dump the stats from HashMap.

Bug:  735663 
Change-Id: I86f826c3d23f02f0d06dba7e0a5a157601bbd4ce
Reviewed-on: https://chromium-review.googlesource.com/556919
Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
Reviewed-by: Yuta Kitamura <yutak@chromium.org>
Cr-Commit-Position: refs/heads/master@{#488777}
[modify] https://crrev.com/093e477880015860a43550d9330ed922810ceeb2/third_party/WebKit/Source/platform/wtf/HashMap.h
[modify] https://crrev.com/093e477880015860a43550d9330ed922810ceeb2/third_party/WebKit/Source/platform/wtf/HashTable.cpp
[modify] https://crrev.com/093e477880015860a43550d9330ed922810ceeb2/third_party/WebKit/Source/platform/wtf/HashTable.h

Status: Verified (was: Started)

Sign in to add a comment