From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Andreas Dilger <adilger@dilger.ca>,
"Artem S. Tashkinov" <t.artem@lycos.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>, linux-mm <linux-mm@kvack.org>
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Mon, 11 Nov 2013 20:31:47 +0100 [thread overview]
Message-ID: <20131111193147.GC24867@quack.suse.cz> (raw)
In-Reply-To: <20131111032211.GT6188@dastard>
On Mon 11-11-13 14:22:11, Dave Chinner wrote:
> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > >
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here. Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > >
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> > I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
>
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
>
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).
> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
>
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.
> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
>
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
>
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
I guess this begs for real data. We agree in principle but differ in
constants :).
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).
> > Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
>
> That's true, but....
>
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
>
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
>
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> > speed
> >
> > won't go a long way into solving our current problems without too much
> > complication...
>
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
Agreed.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-11-11 19:51 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-25 7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25 8:18 ` Linus Torvalds
2013-11-05 0:50 ` Andreas Dilger
2013-11-05 4:12 ` Dave Chinner
2013-11-07 13:48 ` Jan Kara
2013-11-11 3:22 ` Dave Chinner
2013-11-11 19:31 ` Jan Kara [this message]
2013-11-05 6:32 ` Figo.zhang
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26 ` David Lang
2013-10-25 18:26 ` Artem S. Tashkinov
2013-10-25 19:40 ` Diego Calleja
2013-10-25 23:32 ` Fengguang Wu
2013-11-15 15:48 ` Diego Calleja
2013-10-25 20:43 ` NeilBrown
2013-10-25 21:03 ` Artem S. Tashkinov
2013-10-25 22:11 ` NeilBrown
2013-11-05 1:40 ` Figo.zhang
2013-11-05 1:47 ` David Lang
2013-11-05 2:08 ` NeilBrown
2013-10-29 20:49 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131111193147.GC24867@quack.suse.cz \
--to=jack@suse.cz \
--cc=adilger@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=fengguang.wu@intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=t.artem@lycos.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).