Make our squashfs better |
|||||||||||||
Issue descriptionAs part of bug #702707 I noticed a whole bunch of processes all blocked waiting for a squashfs cache entry. A bunch of digging found that we have this defined: CONFIG_SQUASHFS_DECOMP_SINGLE It seems like instead we should do: CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU --- The Kconfig says: choice prompt "Decompressor parallelisation options" depends on SQUASHFS help Squashfs now supports three parallelisation options for decompression. Each one exhibits various trade-offs between decompression performance and CPU and memory usage. If in doubt, select "Single threaded compression" config SQUASHFS_DECOMP_SINGLE bool "Single threaded compression" help Traditionally Squashfs has used single-threaded decompression. Only one block (data or metadata) can be decompressed at any one time. This limits CPU and memory usage to a minimum. config SQUASHFS_DECOMP_MULTI bool "Use multiple decompressors for parallel I/O" help By default Squashfs uses a single decompressor but it gives poor performance on parallel I/O workloads when using multiple CPU machines due to waiting on decompressor availability. If you have a parallel I/O workload and your system has enough memory, using this option may improve overall I/O performance. This decompressor implementation uses up to two parallel decompressors per core. It dynamically allocates decompressors on a demand basis. config SQUASHFS_DECOMP_MULTI_PERCPU bool "Use percpu multiple decompressors for parallel I/O" help By default Squashfs uses a single decompressor but it gives poor performance on parallel I/O workloads when using multiple CPU machines due to waiting on decompressor availability. This decompressor implementation uses a maximum of one decompressor per core. It uses percpu variables to ensure decompression is load-balanced across the cores. endchoice --- I don't know offhand, but I'd imagine that the extra memory used is very slight. I'd guess that since squashfs is often used on embedded systems that's why the default is to save memory... ...but this needs to be checked, and we need to benchmark how much faster this could make us. --- We'd want to apply this to various different kernels...
,
Mar 30 2017
Sorry for the delay. I tested ARC++ boot performance (with cheets_PerfBootServer autotest) with Kevin PVT + R59-9411.0.0. I chose cheets_PerfBootServer benchmark because booting ARC++ accesses most of files in the compressed system image.
Summary:
* The first two results were almost the same. This is probably because the autotest starts measurement when the (test) user signs in to Chrome
OS, but while the login screen is shown, /etc/init/arc-ureadahead.conf has already prefetched files in the squashfs image with readahead(2).
Because of this, I guess ARC++ didn't really decompress files in the image during the benchmark.
* If we disable arc-ureadahead.conf (the last 2 results), the boot time was slightly faster with CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU.
* The memory usage was about the same with and without MULTI_PERCPU.
So.. although I didn't see any significant boot time improvement, I didn't see any negative effect either. I guess we should enable the config if it improves ARC++/ChromeOS behavior under memory pressure.
Detailed results:
----------------------------------------
* Boot time (in ms, # of iterations=10)
With CONFIG_SQUASHFS_DECOMP_SINGLE (original configuration):
mean 6253
median 6225
stddev 153
With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU:
mean 6222
median 6239
stddev 193
With CONFIG_SQUASHFS_DECOMP_SINGLE, without arc-ureadahead.conf:
mean 7180
median 7182
stddev 146
With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU, without arc-ureadahead.conf:
mean 7062
median 7031
stddev 133
----------------------------------------
* Memory usage
Steps:
1. Sign in (as a new user.)
2. Disable app sync.
3. Opt in to ARC++.
4. Once Play Store starts, observe memory usage.
5. Sign out/in.
6. Start Play Store and install Facebook.
7. Start Facebook, observe memory usage.
With CONFIG_SQUASHFS_DECOMP_SINGLE:
at step #4
total used free shared buffers cached
Mem: 3904940 3182580 722360 238804 351764 1859852
-/+ buffers/cache: 970964 2933976
Swap: 3813416 0 3813416
at step #7
total used free shared buffers cached
Mem: 3904940 3687064 217876 324072 383564 2276888
-/+ buffers/cache: 1026612 2878328
Swap: 3813416 4264 3809152
With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU:
at step #4
total used free shared buffers cached
Mem: 3904940 2916504 988436 183264 343416 1525128
-/+ buffers/cache: 1047960 2856980
Swap: 3813416 0 3813416
at step #7
total used free shared buffers cached
Mem: 3904940 3537112 367828 301792 368028 2149932
-/+ buffers/cache: 1019152 2885788
Swap: 3813416 0 3813416
----------------------------------------
,
Apr 3 2017
So putting printouts in the code, this doesn't look so great. It appears that each time we mount a squashfs filesystem we create (NCPUS * 2) buffers. On rk3399 this appears to be 6 * 2 = 12, but perhaps on other systems it could be more. Each buffer appears to be 128 K, at least that's how our squashfs appears to be configured (presumably we passed 128K blocks to mksquashfs) We appear to mount 5 squashfs filesystems: /dev/loop1 on /opt/google/containers/android/rootfs/root type squashfs (ro,nosuid,nodev,noexec,relatime,seclabel) /dev/loop2 on /opt/google/containers/android/rootfs/root/vendor type squashfs (ro,nosuid,nodev,noexec,relatime,seclabel) /dev/loop3 on /opt/google/containers/arc-removable-media/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel) /dev/loop4 on /opt/google/containers/arc-sdcard/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel) /dev/loop5 on /opt/google/containers/arc-obb-mounter/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel) Thus we're eating about 7680 KB (7864320 bytes) of memory, compared to 1280 KB (1310720 bytes). That's not a trivial amount of memory to eat unless we see a real performance gain somewhere. --- I checked to see how much time we spent waiting in squashfs_cache_get(). It looks like it varies quite a bit, but overall it doesn't seem to change by leaps and bounds. --- I'm going to close this as WontFix.
,
Apr 19 2017
Taking this again. It looks like it might actually matter. :) Using it as an umbrella for several things to make squash fs work better, especially in the presence of low memory.
,
Apr 19 2017
OK, this is a nice easy change and frees up 1MB. This will also make it much more feasible to enable multiple decompression streams... https://chrome-internal-review.googlesource.com/c/357804/
,
Apr 21 2017
News for they day:
* I think I have some tests that let me analyze the performance of various squashfs settings. I'll post them here when they're a bit cleaner and hopefully also get an autotest written. Basically I'm using "drop caches" and some random accesses to try to simulate things.
* My tests seem to be pointing to issues with loopback devices, specifically when doing random accesses with a cold cache. I can mostly fix this with with:
for dev in /sys/devices/virtual/block/loop*; do echo 0 > ${dev}/queue/read_ahead_kb
...some of my tests get _worse_ with this, but overall it looks plausible it will be a big win in the low memory case. I still need to dig into exactly what is happening here.
* I'm told that we've actually already moved squashfs to gzip but that we're not 100% sure of the performance impact here. I'll measure that as part of this.
,
Apr 21 2017
Need to take a break for this to work on a very urgent failure on veyron, but for now posting my (ugly, wip) scripts that I've been using to test, plus early results. Quick summary of my (early!) results: -- What we had (lzo 128K, one compression stream): (SIMULATED) fetching patches when low on memory: Loop 0 - Done in 15.18 seconds Loop 1 - Done in 15.13 seconds Loop 2 - Done in 15.34 seconds (SIMULATED) fetching pages when not too low on memory: Loop 0 - Done in 1.67 seconds Loop 1 - Done in 1.67 seconds Loop 2 - Done in 1.58 seconds -- What we have now (gzip 128K, one compression stream): (SIMULATED) fetching patches when low on memory: Loop 0 - Done in 20.71 seconds Loop 1 - Done in 20.80 seconds Loop 2 - Done in 20.67 seconds (SIMULATED) fetching pages when not too low on memory: Loop 0 - Done in 3.51 seconds Loop 1 - Done in 3.50 seconds Loop 2 - Done in 3.66 seconds -- What I think might be a sane option (gzip 16K, multiple compression streams, no readahead) NOTE: grows rootfs by 18MB compared to gzip, but still 13MB smaller than lzo (SIMULATED) fetching patches when low on memory: Loop 0 - Done in 6.12 seconds Loop 1 - Done in 6.07 seconds Loop 2 - Done in 6.13 seconds (SIMULATED) fetching pages when not too low on memory: Loop 0 - Done in 2.64 seconds Loop 1 - Done in 2.51 seconds Loop 2 - Done in 2.53 seconds --- Another option (xz 8K, multiple compression streams, no readahead): NOTE: grows rootfs by 3 MB compared to gzip (SIMULATED) fetching patches when low on memory: Loop 0 - Done in 4.85 seconds Loop 1 - Done in 4.96 seconds Loop 2 - Done in 4.97 seconds (SIMULATED) fetching pages when not too low on memory: Loop 0 - Done in 2.31 seconds Loop 1 - Done in 2.31 seconds Loop 2 - Done in 2.43 seconds === Open questions: * How do we really properly readahead on loopback devices? A udev rule? Some code? Something else? * Some test cases go down when you disable readahead on loopback devices, but overall it seems like a big win. In theory we should already be doing readahead on the emmc block device, so in theory it seems silly to have it in two places. * Is it OK to grow the root filesystem a tad bit compared with gzip? * Need to test this with veyron (presumably the slowest device w/ ARC++) and confirm that xz performs OK there. Need to turn on xz everywhere. ==
,
Apr 21 2017
> Need to turn on xz everywhere. Android's mksquashfs currently only supports gzip and lz4. This is a minor point, but we should probably add xz support there too if we switch to xz. ARC developers usually build system.raw.img with squashfs+gzip outside Chrome OS chroot, and push it to the device to test it. I think ARC devs probably want to use xz for this workflow too.
,
Apr 21 2017
,
Apr 21 2017
,
Apr 21 2017
XZ and gzip are both bit-aligned compressions and have serious effects on the delta payload size (4x - 8x increase in the payload size). We have a solution for gzip to reduce it to about 2x, but no solution for XZ at this point.
,
Apr 22 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/project-cheets-private/+/123c3907a7017438d1a6e55eacf7a2422257a725 commit 123c3907a7017438d1a6e55eacf7a2422257a725 Author: Douglas Anderson <dianders@chromium.org> Date: Sat Apr 22 04:57:49 2017
,
Apr 27 2017
I'm looking at this for Doug -- I'd also like to turn on CONFIG_SQUASHFS_4K_DEVBLK_SIZE
This is the Kconfig description:
config SQUASHFS_4K_DEVBLK_SIZE
bool "Use 4K device block size?"
depends on SQUASHFS
help
By default Squashfs sets the dev block size (sb_min_blocksize)
to 1K or the smallest block size supported by the block device
(if larger). This, because blocks are packed together and
unaligned in Squashfs, should reduce latency.
This, however, gives poor performance on MTD NAND devices where
the optimal I/O size is 4K (even though the devices can support
smaller block sizes).
Using a 4K device block size may also improve overall I/O
performance for some file access patterns (e.g. sequential
accesses of files in filesystem order) on all media.
Setting this option will force Squashfs to use a 4K device block
size by default.
If unsure, say N.
I think this will also help us under memory pressure because we will have far fewer buffer_head objects to manage
,
Apr 27 2017
,
Apr 27 2017
fyi I ran cheets_PerfBootServer on Caroline with the default and the change in #13 and it looks like it may boot roughly the same or slightly faster (within 1 std devation) -- the numbers are fairly noisy I'm not sure exactly which metric is best but I looked at boot_progress_system_run, boot_progress_start, boot_progress_enable_screen boot_progress_start: Without SQUASHFS_4K_DEVBLK_SIZE (default): mean: 1952 median: 1951 st dev: 138 With SQUASHFS_4K_DEVBLK_SIZE: mean: 1950 median: 1942 st dev: 95 boot_progress_system_run: Without SQUASHFS_4K_DEVBLK_SIZE (default): mean: 3430 median: 3383 std dev: 131 With SQUASHFS_4K_DEVBLK_SIZE: mean: 3341 median: 3314 std dev: 108 boot_progress_enable_screen: Without SQUASHFS_4K_DEVBLK_SIZE (default): mean: 5425 median: 5345 std dev: 357 With SQUASHFS_4K_DEVBLK_SIZE: mean: 5261 median: 5274 std dev: 145
,
Apr 27 2017
Looking at reef R60-9500.0.0 test image:
-rw-r--r-- 1 root root 466333696 Apr 27 07:57 r/opt/google/containers/android/system.raw.img
-rw-r--r-- 1 root root 50577408 Apr 27 07:57 r/opt/google/containers/android/vendor.raw.img
---
Here are options I see.
bytes - method - speed on my tests (lower better) - desc
========= ========= ================================ ============================================
562364416 - lzo 128K - 13.08 / 1.88 - old ToT FS (with multistream, no readahead)
568029184 - gzip 8K - 5.09 / 2.53
547364864 - gzip 16K - 6.11 / 2.56
532475904 - gzip 32K - 7.47 / 2.23
523735040 - xz 8K - 4.93 / 2.35
516911104 - gzip 128K - 12.10 / 1.84 - current ToT (with multistream, no readahead)
492236800 - xz 16K - 6.01 / 2.61
NOTE: "speed" numbers still haven't been proven to fully relate to real world. The first number is supposed to represent how quickly we can page in something from squashfs when memory is low (and we've thrown away all caches). The 2nd number is where some of the page cache is actually getting to take effect and is more representative of what paging things in might look like when memory is not low.
---
Since I'm proposing that short term is to switch ToT from gzip 128K to gzip 16K, that means we'd lose 29.04 MB.
---
Helpful scripts:
COMP=...
BLK=...
NAME=system.raw.img
mkdir -p /tmp/experiment
sudo mount -o loop "r/opt/google/containers/android/${NAME}" /tmp/experiment
sudo rm -f "${NAME}"
sudo mksquashfs /tmp/experiment "${NAME}" -b "${BLK}" -comp "${COMP}"
sudo umount /tmp/experiment
NAME=vendor.raw.img
mkdir -p /tmp/experiment
sudo mount -o loop "r/opt/google/containers/android/${NAME}" /tmp/experiment
sudo rm -f "${NAME}"
sudo mksquashfs /tmp/experiment "${NAME}" -b "${BLK}" -comp "${COMP}"
sudo umount /tmp/experiment
ls -al *.img
---
COMP=gzip
BLK=128K
(sanity check)
-rw-r--r-- 1 root root 466333696 Apr 27 14:28 system.raw.img
-rw-r--r-- 1 root root 50577408 Apr 27 14:28 vendor.raw.img
---
COMP=lzo
BLK=128K
-rw-r--r-- 1 root root 507125760 Apr 27 14:27 system.raw.img
-rw-r--r-- 1 root root 55238656 Apr 27 14:27 vendor.raw.img
---
COMP=gzip
BLK=32K
-rw-r--r-- 1 root root 480292864 Apr 27 14:35 system.raw.img
-rw-r--r-- 1 root root 52183040 Apr 27 14:35 vendor.raw.img
---
COMP=gzip
BLK=16K
-rw-r--r-- 1 root root 493735936 Apr 27 14:25 system.raw.img
-rw-r--r-- 1 root root 53628928 Apr 27 14:25 vendor.raw.img
---
COMP=gzip
BLK=8K
-rw-r--r-- 1 root root 512339968 Apr 27 14:26 system.raw.img
-rw-r--r-- 1 root root 55689216 Apr 27 14:26 vendor.raw.img
---
COMP=xz
BLK=16K
-rw-r--r-- 1 root root 447016960 Apr 27 14:29 system.raw.img
-rw-r--r-- 1 root root 45219840 Apr 27 14:29 vendor.raw.img
---
COMP=xz
BLK=8K
-rw-r--r-- 1 root root 474955776 Apr 27 14:30 system.raw.img
-rw-r--r-- 1 root root 48779264 Apr 27 14:30 vendor.raw.img
,
Apr 28 2017
,
Apr 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/1975a30f67da89238cdb24611e89fb0da676dbb1 commit 1975a30f67da89238cdb24611e89fb0da676dbb1 Author: Sonny Rao <sonnyrao@chromium.org> Date: Fri Apr 28 03:28:42 2017 CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE This changes squashfs to use 4K buffers instead of 1K buffers if the device is able, and will reduce some kernel overhead by reducing the number of buffer_head objects needed. BUG=chromium:706538 TEST=run cheets_* autotests with slabtop running and observe fewer buffer_head objecs Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2 Signed-off-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/487737 Tested-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> [modify] https://crrev.com/1975a30f67da89238cdb24611e89fb0da676dbb1/chromeos/config/base.config
,
Apr 28 2017
BTW: I ran my tests on minnie too, just to confirm that on the slower CPU things were similar. It looks like they are for the most part. I think the minnie disk is a bit slower, though.
,
May 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4211a5ac70fadfe40f238d13c592726b1bed6528 commit 4211a5ac70fadfe40f238d13c592726b1bed6528 Author: Sonny Rao <sonnyrao@chromium.org> Date: Mon May 01 19:20:08 2017 CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE This changes squashfs to use 4K buffers instead of 1K buffers if the device is able, and will reduce some kernel overhead by reducing the number of buffer_head objects needed. BUG=chromium:706538 TEST=run cheets_* autotests with slabtop running and observe fewer buffer_head objecs Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2 Signed-off-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/487737 Tested-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> (cherry picked from commit 1975a30f67da89238cdb24611e89fb0da676dbb1) Reviewed-on: https://chromium-review.googlesource.com/490834 Commit-Ready: Douglas Anderson <dianders@chromium.org> [modify] https://crrev.com/4211a5ac70fadfe40f238d13c592726b1bed6528/chromeos/config/base.config
,
May 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/17524a8874a5cfec8978cb1a356eea9d36e16642 commit 17524a8874a5cfec8978cb1a356eea9d36e16642 Author: Sonny Rao <sonnyrao@chromium.org> Date: Mon May 01 19:20:07 2017 CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE This changes squashfs to use 4K buffers instead of 1K buffers if the device is able, and will reduce some kernel overhead by reducing the number of buffer_head objects needed. BUG=chromium:706538 TEST=run cheets_* autotests with slabtop running and observe fewer buffer_head objecs Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2 Signed-off-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/487737 Tested-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> (cherry picked from commit 1975a30f67da89238cdb24611e89fb0da676dbb1) Reviewed-on: https://chromium-review.googlesource.com/490833 Commit-Ready: Douglas Anderson <dianders@chromium.org> [modify] https://crrev.com/17524a8874a5cfec8978cb1a356eea9d36e16642/chromeos/config/base.config
,
May 8 2017
Do you have any update? Did we actually change the block size of the image?
,
May 8 2017
@23: We haven't yet. We're still trying to truly confirm that we're seeing bad performance. At the moment we think the problem may be mitigated for the time because we bumped up "min_filelist_kbytes" a whole bunch... I'm still trying to poke at it more, though...
,
May 26 2017
,
Jul 17 2017
BTW: just so it doesn't get lost, someone pointed to a Doc: https://docs.google.com/a/google.com/document/d/1dWEIn-2KZa9HIqR7LwOmFMLKiSCvULWjM4OUJa0Dh6U/edit?usp=sharing ...where someone had done some general thinking about squashfs.
,
Sep 28
Triage nag: This Chrome OS bug has an owner but no component. Please add a component so that this can be tracked by the relevant team.
,
Nov 8
<UI triage> Bug owners, please add the appropriate component to your bug. Thanks!
,
Nov 9
It still seems like we ought to do something about this, but I'm not actively working on it. |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by diand...@chromium.org
, Mar 29 2017