idea: run zram in a 1-per-core worker (to preemptively swap faster) |
|||
Issue descriptionHere's the idea: 1. Create a worker for zram, one worker per core in the system. 2. At zram probe time, reserve 2 * MAX_CORES pages of memory. This is our compress queue. 3. When told to compress, zram should do: 3a) If space available in the compress queue, copy the page and return immediate. If workers aren't running, start them. 3b) If space is not available in the compress queue, sleep until space is available. --- The above system adds an extra copy per compression, but (IMHO) it's small compared to the cost of compression and ought to be worth it. === Why do this? It will allow kswapd to take advantage of all cores while swapping out. It also should be very easy.
,
Jun 27 2017
I like the concept but I don't understand a couple of things. 1. why is the extra copy needed? The compress queue should just contain pointers to the pages to be compressed. 2. what's the benefit of passing the buck to a separate worker? Why not let the allocating task do the work itself?
,
Jun 27 2017
To continue on comment 2, point 2: yes, if there are 4 CPUs and 4 tasks are currently reclaiming (for any value of 4), additional reclaim attempts should probably block and be woken up by the reclaims that complete. Of course one wonders if that's already happening when the task contend some shared resource in the reclaim path. What probably is also wrong is that tasks in the reclaim path might call schedule() because in the past they'd have to wait a long time for some disk operations to complete. That lets other tasks into the path, and may have a negative impact (or does it really matter?). I am not sure that there is any reason for tasks to sleep in that path. In any case, with the cost of swapping one page out at less that 10us, it's not clear that we can afford the overhead of passing the buck to another process and back.
,
Jun 28 2017
@2-@3: Just to be clear, this feature request is a little different than the idea of moving all direct reclaim to different tasks or otherwise messing with direct reclaim. This feature is _only_ about having zram use workers. The goal is to preemptively swap faster so we get into direct reclaim less often. Adjusting the title... -- With that in mind: the main goal here is to specifically allow kswapd (not direct reclaim) make use of all CPUs in the system. As I understand it: A) kswapd is supposed to kick off preemptively to start swapping stuff out when we get a little low. B) Right now, kswapd only uses one core. Thus the preemptive swapping is a bit handicapped. Yes, we will use more than one core when we fully run out of memory and each task starts doing direct reclaim. ...but it seems like we'd want to allow using more than one core (especially in a 6 core system) before that point. C) Using more than one core for kswapd will not only get things swapped out faster, it should do it with lower power. On big-LITTLE systems some of those other cores are "LITTLE" cores and can work more efficiently (even if slower) than the big counterparts. Even on non-big-LITTLE systems using more than one core should be more power efficient (race to idle). -- With the idea that we want to make kswapd use more than one core for compression, we could try to instantiate more than one copy of kswapd, but that sounds scary and hard. Instead, I'd prefer to be invisible to kswapd. To be invisible to kswapd, we can use a worker scheme like I propose above. AKA: Today: 1. kswapd IDs a page to swap out. 2. kswapd tells zram to swap out 3. zram compresses. 4. zram returns. 5. kswapd can now free the page. 6. go back to step #1. ...all single core. Proposed: 1. kswapd IDs a page to swap out. 2. kswapd tells zram to swap out. 3. zram copies page to its queue. 4. zram returns. 5. kswapd can now free the page. 6. go back to step #1. ...the kswapd queue is serviced by all the other cores. -- As you can see, the copy is important because it allows kswapd to free the page and move onto the next one. --- I suppose you could get fancier and modify the above to have space in the queue for "NCPUs - 1". Then if there's no space you could compress the page yourself. I'm not 100% sure it's worth it. I'd be really curious to know the speed difference between compressing a page vs. copying. Looking at the worst memcpy speeds on some old bug on kevin, I see the slowest case at 190.4 MiB/s. For 4K, that would be 1 s / 190.4 MiB => 5 ns per MiB. Assuming I have no math or logical errors, it seems like the copy would be insignificant compared to the 10 uS. --- You mentioned that kswapd can run more than one instance. I thought I looked into that in the past and it wasn't easy, but maybe you can point me at a way to do it easily...
,
Jun 28 2017
#4 thank you for the clarifications. About multiple kswapds: my bad---that only happens with multiple NUMA nodes, and likely RAM needs to be partitioned. Copying speed: your math may be a little off somewhere. 1s / 200 MiB = 5ms per MiB, not ns. With zram, we can sustain a rate of 10 us/page which is 100,000 pages/second which is 400 MiB/second which is faster than plain copying? That cannot be right.
,
Jun 28 2017
Ah, yes. My math is way off. Thanks. Maybe just need to test instead of looking at random old unrelated number.
,
Aug 4 2017
I spent a little time trying to prototype this, but I think I need to give up. :( My prototype seemed to perform worse... Basically: === I looked at doing this at the zram level, but it ended up getting a bit on the nasty side. - Once I added a queue of things waiting to compress I suddenly needed to think about "peeking" into that queue. Said another way: my idea was that the main zram 'write' would just stick requests into a queue and then workers (running on any core) would pull work off the queue and compress it. Things got a bit complex when I started thinking about reads that might happen to the same memory that was already queued up (or additional writes to the same area of memory). - zram implements a general block device and handles reads / writes of sizes that aren't exactly PAGE_SIZE. Thus my queue management got a bit more complex. That's not to say that it can't be done but that the complexity was getting above the "quick prototype for benchmarking" level. -- I looked at doing this at a higher level. I managed to do it (at least the system didn't crash), but either I missed some important details or my implementation was so inefficient that it didn't help (and it actually hurt). I posted the prototype here: https://chromium-review.googlesource.com/602881 WIP: RFC: mm: vmscan: Parallelize shrink_page_list() I didn't test tons, but it didn't seem obviously better in the quick tests I did (tab switching test or launch_balloons). For instance, on Kevin "./launchBalloons.sh 3 1500 4" went from taking 66 - 75 seconds (elapsed) to 160 - 170 seconds (elapsed). === In any case, maybe just close this and focus on other things for improving our low memory handling.
,
Aug 4 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by diand...@chromium.org
, Jun 27 2017