New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 706538 link

Starred by 6 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Make our squashfs better

Project Member Reported by diand...@chromium.org, Mar 29 2017

Issue description

As part of  bug #702707  I noticed a whole bunch of processes all blocked waiting for a squashfs cache entry.

A bunch of digging found that we have this defined:

  CONFIG_SQUASHFS_DECOMP_SINGLE

It seems like instead we should do:

  CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU

---

The Kconfig says:

choice
	prompt "Decompressor parallelisation options"
	depends on SQUASHFS
	help
	  Squashfs now supports three parallelisation options for
	  decompression.  Each one exhibits various trade-offs between
	  decompression performance and CPU and memory usage.

	  If in doubt, select "Single threaded compression"

config SQUASHFS_DECOMP_SINGLE
	bool "Single threaded compression"
	help
	  Traditionally Squashfs has used single-threaded decompression.
	  Only one block (data or metadata) can be decompressed at any
	  one time.  This limits CPU and memory usage to a minimum.

config SQUASHFS_DECOMP_MULTI
	bool "Use multiple decompressors for parallel I/O"
	help
	  By default Squashfs uses a single decompressor but it gives
	  poor performance on parallel I/O workloads when using multiple CPU
	  machines due to waiting on decompressor availability.

	  If you have a parallel I/O workload and your system has enough memory,
	  using this option may improve overall I/O performance.

	  This decompressor implementation uses up to two parallel
	  decompressors per core.  It dynamically allocates decompressors
	  on a demand basis.

config SQUASHFS_DECOMP_MULTI_PERCPU
	bool "Use percpu multiple decompressors for parallel I/O"
	help
	  By default Squashfs uses a single decompressor but it gives
	  poor performance on parallel I/O workloads when using multiple CPU
	  machines due to waiting on decompressor availability.

	  This decompressor implementation uses a maximum of one
	  decompressor per core.  It uses percpu variables to ensure
	  decompression is load-balanced across the cores.

endchoice

---

I don't know offhand, but I'd imagine that the extra memory used is very slight.  I'd guess that since squashfs is often used on embedded systems that's why the default is to save memory...

...but this needs to be checked, and we need to benchmark how much faster this could make us.

---

We'd want to apply this to various different kernels...

 
Here's the change:

  https://chromium-review.googlesource.com/462477 CHROMIUM: config: Switch to percpu squashfs        

I have confirmed that ARC++ still works with this change, but I haven't done any serious benchmarking or checking of memory.
Sorry for the delay. I tested ARC++ boot performance (with cheets_PerfBootServer autotest) with Kevin PVT + R59-9411.0.0. I chose cheets_PerfBootServer benchmark because booting ARC++ accesses most of files in the compressed system image.

Summary:
* The first two results were almost the same. This is probably because the autotest starts measurement when the (test) user signs in to Chrome
  OS, but while the login screen is shown, /etc/init/arc-ureadahead.conf has already prefetched files in the squashfs image with readahead(2).
  Because of this, I guess ARC++ didn't really decompress files in the image during the benchmark.
* If we disable arc-ureadahead.conf (the last 2 results), the boot time was slightly faster with CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU.
* The memory usage was about the same with and without MULTI_PERCPU.

So.. although I didn't see any significant boot time improvement, I didn't see any negative effect either. I guess we should enable the config if it improves ARC++/ChromeOS behavior under memory pressure.



Detailed results:
----------------------------------------
* Boot time (in ms, # of iterations=10)

With CONFIG_SQUASHFS_DECOMP_SINGLE (original configuration):

mean	6253
median	6225
stddev	 153

With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU:

mean	6222
median	6239
stddev	 193

With CONFIG_SQUASHFS_DECOMP_SINGLE, without arc-ureadahead.conf:

mean	7180
median	7182
stddev	 146

With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU, without arc-ureadahead.conf:

mean	7062
median	7031
stddev	 133

----------------------------------------
* Memory usage

Steps:
1. Sign in (as a new user.)
2. Disable app sync.
3. Opt in to ARC++.
4. Once Play Store starts, observe memory usage.
5. Sign out/in.
6. Start Play Store and install Facebook.
7. Start Facebook, observe memory usage.

With CONFIG_SQUASHFS_DECOMP_SINGLE:

at step #4
             total       used       free     shared    buffers     cached
Mem:       3904940    3182580     722360     238804     351764    1859852
-/+ buffers/cache:     970964    2933976
Swap:      3813416          0    3813416


at step #7
             total       used       free     shared    buffers     cached
Mem:       3904940    3687064     217876     324072     383564    2276888
-/+ buffers/cache:    1026612    2878328
Swap:      3813416       4264    3809152


With CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU:

at step #4
             total       used       free     shared    buffers     cached
Mem:       3904940    2916504     988436     183264     343416    1525128
-/+ buffers/cache:    1047960    2856980
Swap:      3813416          0    3813416

at step #7
             total       used       free     shared    buffers     cached
Mem:       3904940    3537112     367828     301792     368028    2149932
-/+ buffers/cache:    1019152    2885788
Swap:      3813416          0    3813416
----------------------------------------



Status: WontFix (was: Untriaged)
So putting printouts in the code, this doesn't look so great.

It appears that each time we mount a squashfs filesystem we create (NCPUS * 2) buffers.  On rk3399 this appears to be 6 * 2 = 12, but perhaps on other systems it could be more.

Each buffer appears to be 128 K, at least that's how our squashfs appears to be configured (presumably we passed 128K blocks to mksquashfs)

We appear to mount 5 squashfs filesystems:

/dev/loop1 on /opt/google/containers/android/rootfs/root type squashfs (ro,nosuid,nodev,noexec,relatime,seclabel)
/dev/loop2 on /opt/google/containers/android/rootfs/root/vendor type squashfs (ro,nosuid,nodev,noexec,relatime,seclabel)
/dev/loop3 on /opt/google/containers/arc-removable-media/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel)
/dev/loop4 on /opt/google/containers/arc-sdcard/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel)
/dev/loop5 on /opt/google/containers/arc-obb-mounter/mountpoints/container-root type squashfs (ro,nosuid,noexec,relatime,seclabel)

Thus we're eating about 7680 KB (7864320 bytes) of memory, compared to 1280 KB (1310720 bytes).  That's not a trivial amount of memory to eat unless we see a real performance gain somewhere.  

---

I checked to see how much time we spent waiting in squashfs_cache_get().  It looks like it varies quite a bit, but overall it doesn't seem to change by leaps and bounds.

---

I'm going to close this as WontFix.
Cc: sonnyrao@chromium.org adlr@chromium.org bccheng@chromium.org yusukes@chromium.org dtor@chromium.org
Owner: diand...@chromium.org
Status: Assigned (was: WontFix)
Summary: Make our squashfs better (was: Investigate CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU across all ARC++ kernels)
Taking this again.  It looks like it might actually matter.  :)  Using it as an umbrella for several things to make squash fs work better, especially in the presence of low memory.
OK, this is a nice easy change and frees up 1MB.  This will also make it much more feasible to enable multiple decompression streams...

https://chrome-internal-review.googlesource.com/c/357804/
News for they day:

* I think I have some tests that let me analyze the performance of various squashfs settings.  I'll post them here when they're a bit cleaner and hopefully also get an autotest written.  Basically I'm using "drop caches" and some random accesses to try to simulate things.

* My tests seem to be pointing to issues with loopback devices, specifically when doing random accesses with a cold cache.  I can mostly fix this with with:
   for dev in /sys/devices/virtual/block/loop*; do echo 0 > ${dev}/queue/read_ahead_kb

...some of my tests get _worse_ with this, but overall it looks plausible it will be a big win in the low memory case.  I still need to dig into exactly what is happening here.

* I'm told that we've actually already moved squashfs to gzip but that we're not 100% sure of the performance impact here.  I'll measure that as part of this.
Need to take a break for this to work on a very urgent failure on veyron, but for now posting my (ugly, wip) scripts that I've been using to test, plus early results.

Quick summary of my (early!) results:

--

What we had (lzo 128K, one compression stream):

(SIMULATED) fetching patches when low on memory:

Loop 0 - Done in 15.18 seconds
Loop 1 - Done in 15.13 seconds
Loop 2 - Done in 15.34 seconds

(SIMULATED) fetching pages when not too low on memory:

Loop 0 - Done in 1.67 seconds
Loop 1 - Done in 1.67 seconds
Loop 2 - Done in 1.58 seconds

--

What we have now (gzip 128K, one compression stream):

(SIMULATED) fetching patches when low on memory:

Loop 0 - Done in 20.71 seconds
Loop 1 - Done in 20.80 seconds
Loop 2 - Done in 20.67 seconds

(SIMULATED) fetching pages when not too low on memory:

Loop 0 - Done in 3.51 seconds
Loop 1 - Done in 3.50 seconds
Loop 2 - Done in 3.66 seconds

--

What I think might be a sane option (gzip 16K, multiple compression streams, no readahead)
NOTE: grows rootfs by 18MB compared to gzip, but still 13MB smaller than lzo

(SIMULATED) fetching patches when low on memory:

Loop 0 - Done in 6.12 seconds
Loop 1 - Done in 6.07 seconds
Loop 2 - Done in 6.13 seconds

(SIMULATED) fetching pages when not too low on memory:

Loop 0 - Done in 2.64 seconds
Loop 1 - Done in 2.51 seconds
Loop 2 - Done in 2.53 seconds

---

Another option (xz 8K, multiple compression streams, no readahead):
NOTE: grows rootfs by 3 MB compared to gzip

(SIMULATED) fetching patches when low on memory:

Loop 0 - Done in 4.85 seconds
Loop 1 - Done in 4.96 seconds
Loop 2 - Done in 4.97 seconds

(SIMULATED) fetching pages when not too low on memory:

Loop 0 - Done in 2.31 seconds
Loop 1 - Done in 2.31 seconds
Loop 2 - Done in 2.43 seconds

===

Open questions:

* How do we really properly readahead on loopback devices?  A udev rule?  Some code?  Something else?

* Some test cases go down when you disable readahead on loopback devices, but overall it seems like a big win.  In theory we should already be doing readahead on the emmc block device, so in theory it seems silly to have it in two places.

* Is it OK to grow the root filesystem a tad bit compared with gzip?

* Need to test this with veyron (presumably the slowest device w/ ARC++) and confirm that xz performs OK there.  Need to turn on xz everywhere.

==
experiment.sh
2.3 KB View Download
randomaccess.py
3.4 KB View Download
log_decompsingle.txt
8.6 KB View Download
log_decompmulti.txt
8.6 KB View Download
log_noreadahead_decompsingle.txt
9.0 KB View Download
log_noreadahead_decompmulti.txt
9.0 KB View Download
> Need to turn on xz everywhere.

Android's mksquashfs currently only supports gzip and lz4. This is a minor point, but we should probably add xz support there too if we switch to xz.

ARC developers usually build system.raw.img with squashfs+gzip outside Chrome OS chroot, and push it to the device to test it. I think ARC devs probably want to use xz for this workflow too.

Cc: ahass...@chromium.org
Cc: xiaochu@chromium.org
XZ and gzip are both bit-aligned compressions and have serious effects on the delta payload size (4x - 8x increase in the payload size). We have a solution for gzip to reduce it to about 2x, but no solution for XZ at this point.
Project Member

Comment 12 by bugdroid1@chromium.org, Apr 22 2017

Cc: -dtor@chromium.org gwendal@chromium.org
I'm looking at this for Doug -- I'd also like to turn on CONFIG_SQUASHFS_4K_DEVBLK_SIZE

This is the Kconfig description:
config SQUASHFS_4K_DEVBLK_SIZE
        bool "Use 4K device block size?"
        depends on SQUASHFS
        help
          By default Squashfs sets the dev block size (sb_min_blocksize)
          to 1K or the smallest block size supported by the block device
          (if larger).  This, because blocks are packed together and
          unaligned in Squashfs, should reduce latency.

          This, however, gives poor performance on MTD NAND devices where
          the optimal I/O size is 4K (even though the devices can support
          smaller block sizes).

          Using a 4K device block size may also improve overall I/O
          performance for some file access patterns (e.g. sequential
          accesses of files in filesystem order) on all media.

          Setting this option will force Squashfs to use a 4K device block
          size by default.

          If unsure, say N.

I think this will also help us under memory pressure because we will have far fewer buffer_head objects to manage

Comment 14 Deleted

Cc: gwendal@chromium.org
fyi I ran cheets_PerfBootServer on Caroline with the default and the change in #13 and it looks like it may boot roughly the same or slightly faster (within 1 std devation) -- the numbers are fairly noisy

I'm not sure exactly which metric is best but  I looked at boot_progress_system_run, boot_progress_start, boot_progress_enable_screen

boot_progress_start:

Without SQUASHFS_4K_DEVBLK_SIZE (default):
  mean:    1952
  median:  1951
  st dev:  138

With SQUASHFS_4K_DEVBLK_SIZE:
  mean:    1950
  median:  1942
  st dev:  95


boot_progress_system_run:

Without SQUASHFS_4K_DEVBLK_SIZE (default):
  mean:    3430
  median:  3383
  std dev: 131

With SQUASHFS_4K_DEVBLK_SIZE:
  mean:    3341
  median:  3314
  std dev: 108

boot_progress_enable_screen:

Without SQUASHFS_4K_DEVBLK_SIZE (default):
  mean:    5425
  median:  5345
  std dev: 357

With SQUASHFS_4K_DEVBLK_SIZE:
  mean:    5261
  median:  5274
  std dev: 145
Looking at reef R60-9500.0.0 test image:

  -rw-r--r-- 1 root root 466333696 Apr 27 07:57 r/opt/google/containers/android/system.raw.img
  -rw-r--r-- 1 root root  50577408 Apr 27 07:57 r/opt/google/containers/android/vendor.raw.img

---

Here are options I see.

bytes     - method    - speed on my tests (lower better) - desc
=========   =========   ================================   ============================================

562364416 - lzo 128K  - 13.08 / 1.88                     - old ToT FS (with multistream, no readahead)
568029184 - gzip 8K   -  5.09 / 2.53
547364864 - gzip 16K  -  6.11 / 2.56
532475904 - gzip 32K  -  7.47 / 2.23
523735040 - xz 8K     -  4.93 / 2.35
516911104 - gzip 128K - 12.10 / 1.84                     - current ToT (with multistream, no readahead)
492236800 - xz 16K    -  6.01 / 2.61


NOTE: "speed" numbers still haven't been proven to fully relate to real world.  The first number is supposed to represent how quickly we can page in something from squashfs when memory is low (and we've thrown away all caches).  The 2nd number is where some of the page cache is actually getting to take effect and is more representative of what paging things in might look like when memory is not low.

---

Since I'm proposing that short term is to switch ToT from gzip 128K to gzip 16K, that means we'd lose 29.04 MB.

---

Helpful scripts:

COMP=...
BLK=...

NAME=system.raw.img
mkdir -p /tmp/experiment
sudo mount -o loop "r/opt/google/containers/android/${NAME}" /tmp/experiment
sudo rm -f "${NAME}"
sudo mksquashfs /tmp/experiment "${NAME}" -b "${BLK}" -comp "${COMP}"
sudo umount /tmp/experiment

NAME=vendor.raw.img
mkdir -p /tmp/experiment
sudo mount -o loop "r/opt/google/containers/android/${NAME}" /tmp/experiment
sudo rm -f "${NAME}"
sudo mksquashfs /tmp/experiment "${NAME}" -b "${BLK}" -comp "${COMP}"
sudo umount /tmp/experiment

ls -al *.img

---

COMP=gzip
BLK=128K
(sanity check)

-rw-r--r-- 1 root root 466333696 Apr 27 14:28 system.raw.img
-rw-r--r-- 1 root root  50577408 Apr 27 14:28 vendor.raw.img

---

COMP=lzo
BLK=128K

-rw-r--r-- 1 root root 507125760 Apr 27 14:27 system.raw.img
-rw-r--r-- 1 root root  55238656 Apr 27 14:27 vendor.raw.img

---

COMP=gzip
BLK=32K

-rw-r--r-- 1 root root 480292864 Apr 27 14:35 system.raw.img
-rw-r--r-- 1 root root  52183040 Apr 27 14:35 vendor.raw.img

---

COMP=gzip
BLK=16K

-rw-r--r-- 1 root root 493735936 Apr 27 14:25 system.raw.img
-rw-r--r-- 1 root root  53628928 Apr 27 14:25 vendor.raw.img

---

COMP=gzip
BLK=8K

-rw-r--r-- 1 root root 512339968 Apr 27 14:26 system.raw.img
-rw-r--r-- 1 root root  55689216 Apr 27 14:26 vendor.raw.img

---

COMP=xz
BLK=16K
-rw-r--r-- 1 root root 447016960 Apr 27 14:29 system.raw.img
-rw-r--r-- 1 root root  45219840 Apr 27 14:29 vendor.raw.img

---

COMP=xz
BLK=8K

-rw-r--r-- 1 root root 474955776 Apr 27 14:30 system.raw.img
-rw-r--r-- 1 root root  48779264 Apr 27 14:30 vendor.raw.img





Cc: elijahtaylor@chromium.org
Project Member

Comment 19 by bugdroid1@chromium.org, Apr 28 2017

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/1975a30f67da89238cdb24611e89fb0da676dbb1

commit 1975a30f67da89238cdb24611e89fb0da676dbb1
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Fri Apr 28 03:28:42 2017

CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE

This changes squashfs to use 4K buffers instead of 1K buffers if the
device is able, and will reduce some kernel overhead by reducing the
number of buffer_head objects needed.

BUG=chromium:706538
TEST=run cheets_* autotests with slabtop running and observe fewer
  buffer_head objecs

Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/487737
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/1975a30f67da89238cdb24611e89fb0da676dbb1/chromeos/config/base.config

BTW: I ran my tests on minnie too, just to confirm that on the slower CPU things were similar.  It looks like they are for the most part.  I think the minnie disk is a bit slower, though.
log_minnie_170427.txt
11.8 KB View Download
Project Member

Comment 21 by bugdroid1@chromium.org, May 1 2017

Labels: merge-merged-chromeos-3.14
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4211a5ac70fadfe40f238d13c592726b1bed6528

commit 4211a5ac70fadfe40f238d13c592726b1bed6528
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Mon May 01 19:20:08 2017

CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE

This changes squashfs to use 4K buffers instead of 1K buffers if the
device is able, and will reduce some kernel overhead by reducing the
number of buffer_head objects needed.

BUG=chromium:706538
TEST=run cheets_* autotests with slabtop running and observe fewer
  buffer_head objecs

Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/487737
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 1975a30f67da89238cdb24611e89fb0da676dbb1)
Reviewed-on: https://chromium-review.googlesource.com/490834
Commit-Ready: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/4211a5ac70fadfe40f238d13c592726b1bed6528/chromeos/config/base.config

Project Member

Comment 22 by bugdroid1@chromium.org, May 1 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/17524a8874a5cfec8978cb1a356eea9d36e16642

commit 17524a8874a5cfec8978cb1a356eea9d36e16642
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Mon May 01 19:20:07 2017

CHROMIUM: config: enable CONFIG_SQUASHFS_4K_DEVBLK_SIZE

This changes squashfs to use 4K buffers instead of 1K buffers if the
device is able, and will reduce some kernel overhead by reducing the
number of buffer_head objects needed.

BUG=chromium:706538
TEST=run cheets_* autotests with slabtop running and observe fewer
  buffer_head objecs

Change-Id: I1bdeac2a4c0ae77766ea10755528f63ae3d821e2
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/487737
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 1975a30f67da89238cdb24611e89fb0da676dbb1)
Reviewed-on: https://chromium-review.googlesource.com/490833
Commit-Ready: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/17524a8874a5cfec8978cb1a356eea9d36e16642/chromeos/config/base.config

Do you have any update? Did we actually change the block size of the image?
@23: We haven't yet.  We're still trying to truly confirm that we're seeing bad performance.  At the moment we think the problem may be mitigated for the time because we bumped up "min_filelist_kbytes" a whole bunch...

I'm still trying to poke at it more, though...
Cc: semenzato@chromium.org
BTW: just so it doesn't get lost, someone pointed to a Doc:

https://docs.google.com/a/google.com/document/d/1dWEIn-2KZa9HIqR7LwOmFMLKiSCvULWjM4OUJa0Dh6U/edit?usp=sharing

...where someone had done some general thinking about squashfs.
Triage nag: This Chrome OS bug has an owner but no component. Please add a component so that this can be tracked by the relevant team.
<UI triage> Bug owners, please add the appropriate component to your bug. Thanks!
Components: OS>Kernel
Labels: -Pri-2 Pri-3
Owner: ----
Status: Available (was: Assigned)
It still seems like we ought to do something about this, but I'm not actively working on it.

Sign in to add a comment