From: Rob Landley <rob@landley.net>
To: Mel Gorman <mgorman@suse.de>
Cc: Jan Kara <jack@suse.cz>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
"Theodore Ts'o" <tytso@mit.edu>,
"Artem S. Tashkinov" <t.artem@lycos.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Tue, 19 Nov 2013 11:17:03 -0600 [thread overview]
Message-ID: <1384881423.1974.277@driftwood> (raw)
In-Reply-To: <20131030120152.GM2400@suse.de> (from mgorman@suse.de on Wed Oct 30 07:01:52 2013)
On 10/30/2013 07:01:52 AM, Mel Gorman wrote:
> We talked about this a
> few months ago but I still suspect that we will have to bite the
> bullet and
> tune based on "do not dirty more data than it takes N seconds to
> writeback"
> using per-bdi writeback estimations. It's just not that trivial to
> implement
> as the writeback speeds can change for a variety of reasons (multiple
> IO
> sources, random vs sequential etc).
Record "block writes finished this second" into an 8 entry ring buffer,
with a flag saying "device was partly idle this period" so you can
ignore those entries. Keep a high water mark, which should converge to
the device's linear write capacity.
This gives you recent thrashing speed and max capacity, and some
weighted average of the two lets you avoid queuing up 10 minutes of
writes all at once like 3.0 would to a terabyte USB2 disk. (And then
vim calls sync() and hangs...)
The first tricky bit is the high water mark, but it's not too bad. If
the device reads and writes at the same rate you can populate it from
that, but even starting it with just one block should converge really
fast because A) the round trip time should be well under a second, B)
if you're submitting more than one period's worth of data (you can
dirty enough to keep disk busy for 2 seconds), then it'll queue up 2
blocks at a time, then 4, then 8, and increase exponentially until you
hit the high water mark. (Which is measured so it won't overshoot.)
The second tricky bit is weighting the average, but presumably counting
the high water mark as one, then adding in all the "device did not
actually go idle during this period" entries, and dividing by the
number of entries considered... Reasonable first guess?
Obvious optimizations: instead of recording the "disk went idle" flag
in the ring buffer, just don't advance the ring buffer at the end of
that second, but zero out the entry and re-accumulate it. That way the
ring buffer should always have 7 seconds of measured activity, even if
it's not necessarily recent. And of course you don't have to wake
anything up when there was no I/O, so it's nicely quiescent when the
system is...
Lowering the high water mark in the case of a transient spurious
reading (maybe clock skew during suspend or virtualization glitch or
some such) is fun, and could give you a 4 billion block bad reading,
but if you always decrement the high water mark by 25% (x-=(x>>2)) each
second the disk didn't go idle (rounding up) and then queue up more
than one period's worth of data (but no more than say 8 seconds worth),
such glitches should fix themselves and it'll work its way back up or
down to a reasonably accurate value. (Keep in mind you're averaging the
high water mark back down with 7 seconds of measured data from the ring
buffer. Maybe you can cap the high water mark at the sum of all the
measured values in the ring buffer as an extra check? You're already
calculating it to do the average, so...)
This is assuming your hard drive _itself_ doesn't have bufferbloat, but
http://spritesmods.com/?art=hddhack&f=rss implies they don't, and
tagged command queueing lets you see through that anyway so your
"actually committed" numbers could presumably still be accurate if the
manufacturers aren't totally lying.
Given how far behind I am on my email, I assume somebody's already
suggested this by now. :)
Rob
next prev parent reply other threads:[~2013-11-20 3:16 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-25 7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25 8:18 ` Linus Torvalds
2013-10-25 8:30 ` Artem S. Tashkinov
2013-10-25 8:43 ` Linus Torvalds
2013-10-25 9:15 ` Karl Kiniger
2013-10-29 20:30 ` Jan Kara
2013-10-29 20:43 ` Andrew Morton
2013-10-29 21:30 ` Jan Kara
2013-10-29 21:36 ` Linus Torvalds
2013-10-31 14:26 ` Karl Kiniger
2013-11-01 14:25 ` Maxim Patlasov
2013-11-01 14:31 ` [PATCH] mm: add strictlimit knob Maxim Patlasov
2013-11-04 22:01 ` Andrew Morton
2013-11-06 14:30 ` Maxim Patlasov
2013-11-06 15:05 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
2013-11-07 12:26 ` Henrique de Moraes Holschuh
2013-11-22 23:45 ` Andrew Morton
2013-10-25 11:28 ` Disabling in-memory write cache for x86-64 in Linux II David Lang
2013-10-25 9:18 ` Theodore Ts'o
2013-10-25 9:29 ` Andrew Morton
2013-10-25 9:32 ` Linus Torvalds
2013-10-26 11:32 ` Pavel Machek
2013-10-26 20:03 ` Linus Torvalds
2013-10-29 20:57 ` Jan Kara
2013-10-29 21:33 ` Linus Torvalds
2013-10-29 22:13 ` Jan Kara
2013-10-29 22:42 ` Linus Torvalds
2013-11-01 17:22 ` Fengguang Wu
2013-11-04 12:19 ` Pavel Machek
2013-11-04 12:26 ` Pavel Machek
2013-10-30 12:01 ` Mel Gorman
2013-11-19 17:17 ` Rob Landley [this message]
2013-11-20 20:52 ` One Thousand Gnomes
2013-10-25 22:37 ` Fengguang Wu
2013-10-25 23:05 ` Fengguang Wu
2013-10-25 23:37 ` Theodore Ts'o
2013-10-29 20:40 ` Jan Kara
2013-10-30 10:07 ` Artem S. Tashkinov
2013-10-30 15:12 ` Jan Kara
2013-11-05 0:50 ` Andreas Dilger
2013-11-05 4:12 ` Dave Chinner
2013-11-07 13:48 ` Jan Kara
2013-11-11 3:22 ` Dave Chinner
2013-11-11 19:31 ` Jan Kara
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26 ` David Lang
2013-10-25 18:26 ` Artem S. Tashkinov
2013-10-25 19:40 ` Diego Calleja
2013-10-25 23:32 ` Fengguang Wu
2013-11-15 15:48 ` Diego Calleja
2013-10-25 20:43 ` NeilBrown
2013-10-25 21:03 ` Artem S. Tashkinov
2013-10-25 22:11 ` NeilBrown
[not found] ` <CAF7GXvpJVLYDS5NfH-NVuN9bOJjAS5c1MQqSTjoiVBHJt6bWcw@mail.gmail.com>
2013-11-05 1:47 ` David Lang
2013-11-05 2:08 ` NeilBrown
2013-10-29 20:49 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1384881423.1974.277@driftwood \
--to=rob@landley.net \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=t.artem@lycos.com \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox