New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 902929 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Switch to BFQ by default on newer kernels

Project Member Reported by dianders@google.com, Nov 7

Issue description

There is a proposal to switch to the BFQ IO scheduler for Chrome OS.  

It seems like it would make sense since we generally care much more about latency over bandwidth.

Presumably we should do some benchmarks to confirm that it's faster in cases we care about.
 
* https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1324615 
  CHROMIUM: config: Switch to IOSCHED_BFQ everywhere

Owner: gwendal@chromium.org
Gwendal: are you the right person to pick the benchmarks to run and run them?  If not, can you suggest someone?

...if nothing else we should probably run bootperf.
Cc: asavery@chromium.org
We come across BFQ when trying to find a good case for NVMe.

Talk: BFQ, fairness and low latency in storage I/O
TITLE: BFQ, fairness and low latency in storage I/O
SPEAKER: Paolo Valente, Universita` di Modena e Reggio Emilia, Italy

URL: http://algo.ing.unimo.it/people/paolo/disk_sched/

DEMO: https://youtu.be/ANfqNiJVoVE

[I can not find the recording for the talk]

For NVMe, due to its large queue depth, we are not using any scheduler. We do allow merge on blk-mq [http://kernel.dk/blk-mq.pdf] by leaving the io a short time on the queue, but we don't try to be smart relative to ordering.

bootperf is a good first try, but SSD latency should have been taken care by ureadahead.

We could repro with an autotrest the use case in the demo: measuring video playback junkiness while doing filesystem operation.
Hi,
in case some of the recipients doesn't know me, I'm the author of the demo that you report here and of BFQ. And of many other demos, showing the same problems with other kinds of storage (HDDs, SSDs, ...) and systems:
https://www.youtube.com/user/valentepaolo/

In particular, the problems shown in those demos occur also with NVMe. I can point you to plenty of results in this respect, produced by my team or by third parties.

The best tool to measure system-level latencies, i.e., the latencies perceived by users is the
S benchmark suite: https://github.com/Algodev-github/S

By just invoking one command, the scripts in that suite measure all the figures of merits shown in the demo (start-up time of applications, drop rate in video playback), and more. Should it be more practical for you, you can find the core of the suite also in the Phoronix benchmark suite, in mmtests and now also in the OpenEmbedded master branch.

I'd be glad to repeat these tests, and measure performance on your systems of interest, with and without BFQ. Considering that I'm in Linaro, is there a way I can get, for testing, the machine you care mostly about? If it is feasible, and makes sense in this case, I'm ok also with a remote access to a machine. Or I can guide some of you with these tests. In short, I'm ok with any sensible solution that works for you :)
Cc: mike.hol...@linaro.org
Hi all,
for some reason this thread stopped after my last comment. So this is basically a reboot attempt :)

In this respect, I have news: Mike (Holmes) is trying to get me a Chromebook. If he succeeds, I'll use it, first, to show you the responsiveness problems it suffers from.

Any comment or update from your side is welcome!
@6: I think we did need some help actually turning this on.  The CL <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1324615/> enables the config, but as per the comments in the CL it won't switch over to it.  I think folks here are mostly busy on other stuff, but if you give some extra hints to us on how we'd actually enable BFQ it's help save us the research / investigation.  ;-)
We'll send a Chromebook to Paolo to do some testing for us.

This has to be enabled on by writing "bfq" into /sys/block/<device>/queue/scheduler. Also, blk-mq needs to be enabled (CONFIG_SCSI_MQ_DEFAULT=y).



I don't know whether the sending of the Chromebook already means that, but, as I already wrote, I'm willing even to reverse the collaboration scheme from "I help you in what you cannot do with bfq" to "I do it for you, and you help me only on what I cannot do with a Chromium". In this respect, I hope that a little guidance will be enough for me to be able to put bfq in your OS.

As for your attempt to test bfq on your own, pay attention to the following facts:
1) in 4.14 the whole blk-mq is still definitely immature, and may be rather low performing (mmc support for blk-mq is at its early stage too)
2) in 4.19 both blk-mq and bfq are in a good shape, but, as maybe you know, from 4.19 blk-mq has a scary bug in it: it corrupts storage data. Core block-layer devs have made a fix that is supposed to solve the problem, but the fix will be available only from some of the next stable versions of 4.19.
Sorry, I forgot to answer a question you might have: "how do we do then with 4.14?". A simple option is to just add the bfq commits I made after 4.14. Further details after I'll have made some progress with the Chromebook you're sending me.
The blk-mq data corruption problem has already been fixed in chromeos-4.19.

It should be possible to install chromeos-4.19 onto the Chromebook you are getting. It would be optimal if we can get some data with chromeos-4.19; this would help us decide if it makes sense to backport the bfq changes to chromeos-4.14.

I've started to play with the Chromebook you sent me. Expectedly, I'm new to everything but the kernel. So I'd have a few trivial initial questions to ask, so as to start much more quickly the actual work.

I don't know whether this issue is the right place for asking. If it is, please some of you give me an ack. Otherwise, if possible, could you give me some contact (maybe some of you guys?) to ask my bunch of questions?
@Paolo: Just ask me (groeck@chromium.org) directly. I can always get others involved if needed.

Hi guys,
after receiving Guenter blessing on the first use case I've chosen for
my tests, I'd like to share my first results with you too.

My results are in the form of a short video of what happens to the
Chromebook if I
- emulate a download of a 2GB file on a high-speed network; in this
respect, the speed reachable by the Wireless network card of the
device, namely an Intel Dual Band Wireless-AC 7265, is several times
higher than the 20 MB/s max write speed of the storage, so I could
emulate the download with just the writing of a 2GB file with dd
- try to start the Facebook app while the download is in progress

Unfortunately I had to shoot this video with my phone (sorry for the
poor quality), because it was impossible to also record the screen
with the emulated download in progress (which, in itself is already a
test result :) ).  The video is here:
https://photos.app.goo.gl/CM3c18AaofPEqTZD7

As you will see, Facebook won't start at all, until I stop the writing
of the file.  Actually, even if it is hard for you to notice it, the
whole systems is unresponsive at times.  I stoped the emulated
download relatively soon, because responsiveness becomes worse and
worse as time passes, and, if I waited for too long, I would have be
obliged to learn how to force a power down of the device :)

Two notes, and then next steps:
- a download causes the lightest I/O workload among all use cases
  where background I/O is involved (app updates, file copies, ...),
  because a download causes only writes; so responsiveness is
  even worse in the other use cases
- 2GB is purposely an extremely large size, to make the problem last
  at will.  There is still loss of responsiveness with smaller files; it
  will simply last less than infinity :)

Unless some of you has concerns on this use case and results, I will
now proceed with the hardest task for me: installing a new kernel,
with bfq.  Then, if everything goes well, I'll show you that the above
problem disappears with bfq.
re #14 -- thanks for the video illustrating the problem.  Just for the record, which Chromebook is this (eve?) and how much memory does it have?
#15: 'sand' if I recall correctly.

I don't know the acronyms you are using, but in system info I see, e.g.,

CHROME_OS_FIRMWARE_VERSION		Google_Sand.9042.306.0

or

HWID		SAND D4A-E2A-I3F-I9V-A98

So I guess it's 'sand'.

As for memory, if you meant RAM, it's 4GB, if you meant flash storage it's 30GB.

If this device is a slow one, and you are rightly wondering whether this responsiveness/livelock problem goes away on a faster device, then the answer is unfortunately no. If needed, I'm willing to guide you through the very easy steps for proving it on any faster device (or to do it for you).

I'll wait a little bit more for possible further feedback on the relevance of the use case and of the issue in my video, then I'll undertake my journey through changing the kernel.
Paolo: Note that we sent you a device with slow mmc on purpose, to see the maximum impact. I think we can already conclude that the current scheduler is, politely said, not the best choice. Open question is now which is the best scheduler to select instead. Assuming that is BFQ, the next question would be if there is another alternative, other than BFQ, for older Chromebooks (running chromeos-3.18 and older).

As for the relevance of your test case, don't bother. You already made your point. Arguing about the validity of test cases would, at this point, just be bike shedding.

If you have other test cases available and can run those easily, it would of course help to have data from more than one test case for comparison. However, at this point, it would be most interesting to see how the system behaves with the same test case and BFQ active.

Status: Assigned (was: Untriaged)
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.
Guenter:

- Yes, I guessed you sent me a slow device purposely, and rightly

- For older Chromebooks, the answer depends on your degree of freedom.
(a) If you can at least add a module, then it is technically feasible to have BFQ in kernels as old as 3.18 (BFQ came to life on a 2.6.32!). The only remaining problem is your willingness to use my out-of-tree BFQ for legacy blk (identical to the mainline version of BFQ for blk-mq).
(b) If you cannot even add a module, then I'm sorry but there is not chance for you to improve anything: legacy I/O schedulers are all as bad, in terms of responsiveness and latency

- Thanks for your feedback on my test case. I'll try to install a new kernel as next step.

Sign in to add a comment