Prefetch SimpleCache stream 1 |
|||||
Issue description90% of cache reads are < 32KB. If we prefetch stream 1 when opening an entry, then for cache hits, we'd make 90% of the read operations synchronous. There is a slight additional cost to cache misses but it's just a single read call. This could help bring SimpleCache speeds up to BlockFile. Granted, we could make the same optimization for BlockFile.
,
May 9 2017
Hmm, my take on it would be to read the entire file for files < 32KiB, though then one has to decide whether to memcpy or overlap the pieces. The trouble with later is that you waste memory if the payload is rewritten. Hmm, maybe we can copy in that case, though. Is there a cord or something type that may be usable here?
,
May 9 2017
I am skeptical, but I'd be happy to help reviewing code for this experiment anyway. Reason for skepticism is the following intuition: we have 3 tiers of latency from what disk cache does: 1. CPU work and IO thread queueing 2. reading incrementally by predictable chunks 3. opening/seeking/mmap-ing files This optimization seems to be targeting (1), while we spend an order of magnitude (or maybe 2 orders of magnitude) more in (2)+(3) than in (1). It is not seen on synthetic tests because we cannot synthesize disk latency properly. For (2) it seems that all OSes are smart enough to do readahead and they are limited in the same way by physics of the storage devices. Maybe ext4 is slightly better here because of its approach to data locality, but it seems unlikely for the difference to be big. For (3) the simplecache is at serious disadvantage, since it uses more of these open/mmap/seek/close/unlink. If this intuition is correct, then things to help simplecache would be crazy stuff like: 1. research if there is something we can do to tell those OSes to open files faster (hacks like make antiviruses fail fast at intercepting these file accesses? replace hypothetically slow unlink with faster rename followed by a more async unlink?) 2. use sqlite/leveldb underneath simplecache as a key-value backend store. This would be awkward to justify as end goal (why didn't we choose sqlite in the first place? maybe to avoid loosing all cache at once? now with mmap-ed sqlite disk corruption kills chrome? etc), but if the experiment like that shows good perf, we won't hurt code complexity too much while gaining some understanding.
,
May 9 2017
Start 100 resources in parallel and the IO thread becomes the problem. I believe that avoiding that extra thread hop can reduce TTFB by several ms.
,
May 9 2017
Re: I/O latency: did you see my numbers on https://codereview.chromium.org/2842193002/ on Android? I don't really have any intuition on how Flash behaves on phones, but that change's numbers (if not based on mistake, and if in any way transferable to real behavior) make me wonder if trying to do a single read in small-file cases may be worthwhile. Thoughts? OTOH, you are right that open latency is probably the real #1 priority, yes. 95th percentile of SimpleCache.Http.DiskOpenLatency on Windows is like ~65ms[1], and that's very much blocking on doing anything else. I don't see much of opportunity there, since for a lot of users with MS antivirus it seems to do all the work on the file open (even maybe running some JavaScript, apparently!). The higher level approaches of storing more stuff in the memory index so hits that actually can't be used can be turned into misses w/o going to the disk would likely help a lot, of course. We already do a rename on Windows, BTW, since you can't really get unlink semantics from delete. [1] I've added a bunch of SimpleCache.Http.Disk*Latency stats recently, and deprecated the old unmarked SimpleCache.DiskOpenLatency.
,
May 11 2017
Ah, I forgot just how bad the opening time was on Windows at 95th percentile :( I do think that this patch would help, but you're right, it won't bring it up to par with block file cache.
,
May 11 2017
I am sketching this up, it's at least worth seeing how it looks on various platforms. It's not like any of the other ideas I am aware of look clearly better. (https://codereview.chromium.org/2874833005/)
,
May 11 2017
OK, that has something that works now, though it needs a bunch of robustification to file format errors and refactors to avoid code dupe, so I ran it through bulk read benchmark. See last 4 tabs of https://docs.google.com/spreadsheets/d/1ri_AF3va-yZWIfo66-ZrQHiDpjOcziRdXoprzlqPeVU/edit#gid=319700794 Summary: Windows, SSD: nil OS X,SSD: noticeable speed up on cold reads (median: 1,426ms -> 1,197ms) noticeable speed up on warm reads (median: 701ms -> 631ms) Android: noticeable speed up on cold reads (median: 13,511ms -> 10,877ms) noticeable slow down on warm reads (median: 2,575ms -> 3,643ms) Linux, spinning disk: noticeable speed up on cold reads (median: 18,139ms -> 15,014ms) noticeable slow down on warm reads (median: 191ms -> 251ms) Obvious guess for the warm regression on Android + Linux may be the memcpies? May be worth trying to avoid them, though it would be a bit of a refcount dance.
,
May 12 2017
I don't have good ideas on causes of 50% regression on Android, memcpy does not make an immediately comfortable explanation. Maybe we could measure overall time spent in those memcpies before jumping into conclusions? I'd also like to try other devices (I have a few here at my desk), but not sure how many more would sound convincing to me :) Meta-note: The number of measurements is growing (which is good), and the benchmark is sensitive to changes being introduced (awesome!), which suggests that we have a few valuable assets for later times. Can we make things more reproducible and validate-able for the next generations of simplecache hacking? I'd be willing to make a colab notebook that loads raw output from net_perftests and produces graphs, but making it a one time investment would be sad. Would you be OK using/updating such a colab template for the new simplecache benchmark studies? I also think we should land the perftest into trunk, since we are going to make more than one conclusion based on it, and hence it's better be reviewed and maintained for future studies.
,
May 12 2017
Hacking out the data memcpy does seem to bring it back into line. (And makes cold numbers even better). It is avoidable, and not even tricky, just more extra code than I would like to add (needs a little helper keeping track of up to 2 views of shared buffer, when one can own its own data). re: colab: how would it run stuff? re: perf tests: well, you did catch a real problem with write #s on review, and thinking some some of my reasons for thinking it uncommittable aren't really good: DiskOpenEntryCount can probably be a properly documented test-only method (the reasons I wanted it here --- knowing when closes actually finish --- might make it useful for unit tests, too), and to fix the write thing I will probably drop the dependency on the MessageLoopHelper stuff anyway, so I shouldn't need to change it. And key length? Who cares.
,
May 15 2017
Uploaded a version that tries to do it properly, seems to have mixed results on Linux, looking at Android now. (Also including headers-only behavior, since this change has potential to hurt that case, and the numbers between headers-only and headers-and-body do differ considerably on Android)
,
May 15 2017
Added a couple of more tabs re-testing the new revision (on Android and Linux). Doesn't seem that it avoided the warm-case regression, despite what testing with hacked-out-memcpy would indicate. Will need to think some on it.
,
Aug 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be commit 2acfa3ba04b5902deefdc2aacdcc1d71e785d9be Author: morlovich <morlovich@chromium.org> Date: Tue Aug 29 22:24:46 2017 For files smaller than a configurable threshold we will only issue a single file read, immediately reading in all of metadata and stream 0 and stream 1 content upon entry opening. Files exceeding the threshold will keep the old behavior of only reading stream 0 and metadata (with somewhat more complicated sequence of file ops), and will read stream 1 content on-demand. The exact threshold will be determined by an experiment, with the code defaulting to 0 (e.g. disabled). Prefetching stream 1 content like this seem to help cold read performance significantly on Mac (and less so on Android and Linux, haven't seen much effect on Windows). Doc for experiment: https://goo.gl/RgW942 A few bits and pieces based on jkarlin's 2872943002 BUG=719979 Review-Url: https://codereview.chromium.org/2874833005 Cr-Commit-Position: refs/heads/master@{#498256} [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/net/disk_cache/entry_unittest.cc [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/net/disk_cache/simple/simple_entry_impl.cc [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/net/disk_cache/simple/simple_entry_impl.h [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/net/disk_cache/simple/simple_synchronous_entry.cc [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/net/disk_cache/simple/simple_synchronous_entry.h [modify] https://crrev.com/2acfa3ba04b5902deefdc2aacdcc1d71e785d9be/tools/metrics/histograms/histograms.xml
,
Aug 30 2017
Woot!
,
Sep 11 2017
Launch bug is https://bugs.chromium.org/p/chromium/issues/detail?id=762614 experiment is running as SimpleCachePrefetchExperiment (some things don't make sense but I probably should wait for a few more days before drawing any conclusions)
,
Sep 18 2017
Hmm, lots of # differences in the experiment, but none are statistically significant after a week. I may have gotten way overambitious with # of branches, may want to simplify.
,
Sep 18 2017
I definitely recommend running just one experiment group.
,
Sep 18 2017
Any advice on fine-tuning the threshold then. Maybe waiting for beta?
,
Sep 18 2017
I think we'll need stable to get solid fine-tuning numbers. In the meanwhile let's do 32KB and see if it has any impact.
,
Sep 18 2017
However, it looks like we're already seeing wins with the smaller groups: https://uma.googleplex.com/p/chrome/variations/?sid=de3c2b9d8338733fc17a7daf7135d0cf shows significant wins in AccessToDone.Used on Android with negligible impact on sent requests.
,
Sep 18 2017
Hmm, is that looking at both canary and dev at once?
,
Sep 18 2017
Yes.
,
Sep 22 2017
This is somewhat encouraging (though not statistically so on Mac --- probably need at least enough time for the simplified to roll out..) W: https://uma.googleplex.com/p/chrome/variations/?sid=1abc4e12c07c6cf1309bb6228f4a6586 M: https://uma.googleplex.com/p/chrome/variations/?sid=3f73fb05c6369a990b2bec66bee9edf1 A: https://uma.googleplex.com/p/chrome/variations/?sid=3d553e4a1d6e131688f88f246cadaebf
,
Sep 26 2017
So I am probably getting ahead of things before the numbers are solid, but it feels like a win on Windows and Android, but is in trouble on OS X, because while the Used case is much improved, the CantConditionalize and Updated get hurt even more significantly. That's about the opposite of what I expected, platform-wise, but the good part is that CantConditialize is what the in-memory hints work is supposed to help...
,
Sep 28 2017
I'm really interested in synchronously handling the Read() call if the data has been prefetched. That should have nice impact as well and should be part of this experiment imho. Any chance that can land soon?
,
Sep 28 2017
Quite possible: it would be https://chromium-review.googlesource.com/c/chromium/src/+/682139, which I think I've cleaned up enough to be reviewable. I feel a bit awkward going ahead with it, though, since pasko@ didn't think an earlier version was worthwhile, and he is away for a bit, though I think the presence of throttling was a factor for him.
,
Sep 28 2017
Well, we'll see if there is a bump in the finch trial. That can help to determine if it's worthwhile or not. Reviewing now.
,
Sep 28 2017
What stats would you expect to be affected?
,
Sep 28 2017
Net.HttpTimeToFirstByte, HttpCache.AccessToDone.Used, PageLoad.PaintTiming.NavigationToFirstContentfulPaint
,
Sep 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f1625c46e1ca3097df9c8a1a49ef91caa064e432 commit f1625c46e1ca3097df9c8a1a49ef91caa064e432 Author: Maks Orlovich <morlovich@chromium.org> Date: Thu Sep 28 18:57:27 2017 SimpleCache: synchronously reply on reads from idle if data is in memory. This includes stream 0 all the time, and stream 1 if available from a prefetch. This change does regress some zero-length-read type cases from non-idle being sync to reduce code duplication, which may be worth reconsidering and undoing. This also includes a microbenchmark to measure some of our bookkeeping overhead, which is also improved a bit by this, but it's not the sort of numbers (order of half a microseconds) to get overly excited about. Bug: 719979 Change-Id: I1079d023fd3c555a87ae4175225154467243e6e3 Reviewed-on: https://chromium-review.googlesource.com/682139 Reviewed-by: Josh Karlin <jkarlin@chromium.org> Commit-Queue: Maks Orlovich <morlovich@chromium.org> Cr-Commit-Position: refs/heads/master@{#505106} [modify] https://crrev.com/f1625c46e1ca3097df9c8a1a49ef91caa064e432/net/disk_cache/disk_cache_perftest.cc [modify] https://crrev.com/f1625c46e1ca3097df9c8a1a49ef91caa064e432/net/disk_cache/entry_unittest.cc [modify] https://crrev.com/f1625c46e1ca3097df9c8a1a49ef91caa064e432/net/disk_cache/simple/simple_entry_impl.cc [modify] https://crrev.com/f1625c46e1ca3097df9c8a1a49ef91caa064e432/net/disk_cache/simple/simple_entry_impl.h
,
Oct 20 2017
So... It's hard to be sure since I screwed up and managed to not extend the main experiment, causing a gap in data, but what's encouraging is that the last 7 days of data actually show a statistically significant improvement in *median* for HttpCache.AccessToDone.Used over blockfile on Windows (while 95th percentile is still worse, but less worse than before), which I haven't seen before, so this synchronous thing may actually be helping.
,
Oct 20 2017
That's great! Overall the numbers for this experiment seem like a really big win.
,
Oct 20 2017
By the way, how are you comparing the prefetch experiment to the block file cache experiment? Directly comparing median times? Don't forget that only half of the windows clients in the prefetch experiment group are actually using simplecache. So the actual impact should be even greater.
,
Oct 20 2017
Two experiment filters. These ones were first grouped by the backend experiment, so I don't know simultaneously whether the different numbers for effect of backend with prefetch on or not are actually statistically different from each other.
,
Oct 20 2017
Anyway, it feels like it's a win on everything but OS X. So what's the next step --- beta experiment for Android/ChromeOS/Linux? (I am not sure it's worthwhile to do windows yet, since it doesn't quite close the gap, and perhaps the memhint stuff will help...)
,
Oct 20 2017
Is the concern tail on Windows? I haven't seen the complete stats to know for sure.
,
Oct 21 2017
Pretty much, yeah. UMA Links, though probably want a week or so more of data (can't go back further much because of my screw up) "Is SimpleCache good enough with this prefetch thing": Windows: https://uma.googleplex.com/p/chrome/variations/?sid=5900d63fbed6a4076658349a563b77e6 (Actually way closer than I remembered) OS X: https://uma.googleplex.com/p/chrome/variations/?sid=d7d857cf5c07c55152cc54f2f8601e99 (Further away, but also a bit closer than I remembered) "Is this prefetch thing helping": Windows: https://uma.googleplex.com/p/chrome/variations/?sid=682d86b4c8112a5aff98f5d281b5e119 (yep) OS X: https://uma.googleplex.com/p/chrome/variations/?sid=ffa73554837347f6e1bd452e58599b56 (probably a net loss --- the .used case is improved, but all others are on wrong side) Android: https://uma.googleplex.com/p/chrome/variations/?sid=f51473c2bde3d3ce90d9c0f24e70077b (yep) ChromeOS: https://uma.googleplex.com/p/chrome/variations/?sid=2da6e59f8f5d8877354169fa88100b72 Linux: probably should verify on beta, or probably even stable
,
Oct 23 2017
> "Is SimpleCache good enough with this prefetch thing": Windows: > https://uma.googleplex.com/p/chrome/variations/?sid=5900d63fbed6a4076658349a563b77e6 > (Actually way closer than I remembered) I am having a hard time interpreting the link. Is this Canary or Canary+Dev? If I restrict to only SimpleCacheExperiment (https://uma.googleplex.com/p/chrome/variations/?sid=3e855b3fa773b0d2b66738116b70c77a), there is no significant regression, and all metrics tend to go down with simplecache on HttpCache.AccessToDone. Is this expected or I am doing something wrong? Also, do we understand the population shifts on .used and .ntfcp? I would say, if this trend reproduces on Stable, and I am not doing something wring, then we should rollout simplecache on Windows. A document with more detailed analysis and conclusions would help.. > OS X: > https://uma.googleplex.com/p/chrome/variations/?sid=ffa73554837347f6e1bd452e58599b56 > (probably a net loss --- the .used case is improved, From this .. a wild guess would be that OSX prefetches aggressively on its own?
,
Oct 23 2017
I don't know how to look at these links. What does it mean in the finch dashboard when you have results from multiple studies?
,
Oct 23 2017
re: comment #39: yeah, canary + dev to bump up the sample count. I am confused, though: the link you gave is before the change, and there is a significant regression in the .Used portion --- any chance you pasted the wrong one? After this change (flipping the filter Control -> Prefetch32K) it's a lot closer. WRT population shifts, the population with simple cache seems to have more requests overall, too, not just with the .used. That seems to be just the #s based on the actives count... which may be a concern. As for OS X: empirically, no --- if it did, the change would be pretty much a no-op. It seems to do so little prefetch in manual testing that I was expecting this to be a big win primarily there. Actually, I am pretty confused now.... My previous impression was that prefetch was helping in cases where the body was useful, hurting otherwise, which resulted in net loss (.Used got much better, .NotSent got a bit worse, net a bit worse). Now, I can't make much sense out of it.... the BeforeSend. metrics are a very weird mix, but the .SentRequest portion still looks worse. Weird. Are our canary users on NFS or something, so that disk I/O is slowing down network loading? https://uma.googleplex.com/p/chrome/variations/?sid=54a3fd0081627ead756f0473be818d2f in the past: re: comment #40: it's just applying the second study as a filter, the exact same way it would apply a filter on platform or channel or whatever. What may worry is that the populations are a bit different in size.
,
Oct 23 2017
Let's move this experiment up to beta to find out more. For blockfile platforms: 50% - SimpleCache + Prefetch 50% - Control (Blockfile) For simplecache platforms: 50% Prefetch 50% Control WDYT?
,
Oct 23 2017
> WRT population shifts, the population with simple cache seems to have more > requests overall, too, not just with the .used. That seems to be just the #s > based on the actives count... which may be a concern. Yeah, this makes it more difficult to say whether there is a statistically significant change. Specifically, in case there is a significant population shift, all the further shifts in quantiles that the Finch dashboard is helpful to provide, should be taken with a huge stone of salt. We should avoid population shifts on metrics as much as possible. > I am confused, though: the link you gave is before the change, What do you mean by "before the change"? I think I selected the cases when prefetching is not done and wanted to check whether ExperimentYes(=simplecache) is better than ExperimentControl(=blockfilecache). I think I selected the same data range as you. Annotated some: https://screenshot.googleplex.com/JKu2aJj4UaX Basically, are these numbers trustworthy? If we trusted them, then we would have launched simplecache on windows even before any prefetching, no? > and there is a significant regression in the .Used portion --- any chance you > pasted the wrong one? After this change (flipping the filter Control -> > Prefetch32K) it's a lot closer.
,
Oct 23 2017
Egor: The numbers you're highlighting there aren't statistically significant. If you scroll on down to AccessToDone.Used you'll see that SimpleCache is significantly hurting performance but when you turn on prefetch SimpleCache gets closer to blockfile. I still think we need more data though. Let's do beta.
,
Oct 23 2017
re #42. I am confused with naming you suggested for the groups. Also why not checking simplecache without prefetch on blockfile platforms? It would be good to check our assumptions about it are still valid. Is it because population-is-too-small? Naming I'd suggest: For blockfile platforms: BlockfileCacheWithCacheDroppedAtStart (this is the control group for the rest of the arms) SimpleCacheWithoutPrefetch SimpleCacheWithPrefetch For simplecache platforms: SimpleCacheWithoutPrefetch (this would be a control group) SimpleCacheWithPrefetch Seems like we need to reshuffle again to get the counts more equal? Then we'd again need to wait for the reshuffle to 'settle' right?
,
Oct 23 2017
> If you scroll on down to AccessToDone.Used you'll see that SimpleCache is > significantly hurting performance I don't see that. I see a huge increase in COUNT(AccessToDone.Used) which forces me to ignore other 'significance'. It could be because cache eviction is better or due to other reasons. If it because of better eviction with simplecache, the performance on access to done will likely be improved, right?
,
Oct 23 2017
#45: I believe we're suggesting the same thing, the only difference being splitting apart with and without prefetch on blockfile platforms. I wanted to have two groups instead of three to increase sample size and because I thought we'd already run SimpleCache on beta in the past and rolled back due to poor performance. But, looking through SimpleCacheBeta.json's past it appears we never ran Windows on beta. So, splitting it is fine with me. #46: Hmm, true there is a 7% count difference. All the more reason to move on to beta and get more samples.
,
Oct 23 2017
I think splitting w/ and w/o on blockfile platforms is worth it at least on Mac since I am concerned prefetch is a net regression there --- and it's probably a good idea to have Windows use a consistent setup...
,
Oct 24 2017
re: skew on cases: some of it may be because some of the caches haven't quite recovered from my backend experiment-non-renewal mistake.... so the win/mac metrics may be less reliable than what I would like...
,
Nov 27 2017
So... It's been 4 weeks since started on beta, so with 63 stable coming up soonish might be a good time to look at things. Warning: this bug report is public, so going to be vague.. SC platforms only, since this is about maybe-going-ahead-with-stable: Android: https://uma.googleplex.com/p/chrome/variations/?sid=9f9f3228e083421886a6dd55b9328aa0 About 12% faster on cache hits, even a tiny improvement at median for overall loading. Disk opens and beforesend in non-hit/miss cases are a tiny bit worse, but doesn't seem like enough to matter. Android Low Mem: https://uma.googleplex.com/p/chrome/variations/?sid=2317714243621c0fdb10836fc76332d5 Actually an even bigger improvement on hit, but some of the other cases seem to have a bigger downside. @pasko: Anything else I should be looking at that may be affected by the I/O change? CrOS: https://uma.googleplex.com/p/chrome/variations/?sid=6dd56e68438a7e0a49c334721fe23858 Also nice improvement on hit, but also noticeably more of a cost for the useless-open cases. Linux: Not enough samples(?) for anything to be conclusive.
,
Nov 28 2017
Short answer: your metrics are great, I have nothing to add. Longer: I prefer 14 day aggregation because the groups ramped from 40%->50% in the first week, which does not change much in the conclusion. Agreed that beforesend.updated is not very concerning. Lowmem is fascinating! 17%-20% improvement on .used! I don't see downsides (I am aggregating over 14 days: https://uma.googleplex.com/p/chrome/variations/?sid=e6a06ce993e8b075af46c07915576ea9) I also looked at the "Heartbeat" and "Stability" tabs. All is fine. Rant below: What I find to be particularly sad is that a 10+% improvement on disk cache speed does not move the FCP. One more datapoint: we are currently experimenting with turning off preconnect predictor (also in Beta), and it also does not move FCP. I begin suspecting that the metric is not powerful enough to drive page load speed decisions (no ideas how to make it better .. hm .. okay .. making one with use of 'UKM' is easy, not sure a good idea though).
,
Nov 28 2017
Well, at least Experimental.PaintTiming.NavigationToFirstMeaningfulPaint shows statistically significant regression for switching to SimpleCache on Windows and Mac in the tail :( ... FCP isn't even statistically significant for that; and of course the gap going backwards for analysis is huge, too. Hmm, so when should I be applying for stable launch, anyway? Can I do that now? Hmm, should probably just ask in the launch bug or something.
,
Nov 28 2017
When we studied prerender-vs-nostateprefetch recently, FMP was too complicated to make sense of. So I expect FMP to be a bit too experimental to care about .. currently. For Stable you need launch bug and launch review. Please proceed asap, since the process is not particularly fast. +Kenji may be able to help you with the launch, he likes improving speed on the web :) Or maybe you have another PM to help.
,
Nov 28 2017
FWIW, crbug/762614 is the launch bug I used previously..
,
Nov 28 2017
As far as I can read the labels, it is launch-approved for M63 Stable!
,
Nov 28 2017
Yeah, but that was after me asking for beta-approval --- it's based on an e-mail approval, and that didn't mention stable anywhere...
,
Nov 28 2017
Ok, then just ping the LR thread with moving forward to a Stable experiment. If Emily did not intend to put the Stable label (which is unlikely), you will have new questions/instructions.
,
Nov 28 2017
Sorry for the slow response, I've been travelling. There appears to be a slight tail regression in the loading metrics. But it's not consistent across platforms. Overall it seems like a win to me. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jkarlin@chromium.org
, May 9 2017