From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Andreas Dilger <adilger@dilger.ca>,
"Artem S. Tashkinov" <t.artem@lycos.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>, linux-mm <linux-mm@kvack.org>
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Mon, 11 Nov 2013 20:31:47 +0100 [thread overview]
Message-ID: <20131111193147.GC24867@quack.suse.cz> (raw)
In-Reply-To: <20131111032211.GT6188@dastard>
On Mon 11-11-13 14:22:11, Dave Chinner wrote:
> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > >
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here. Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > >
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> > I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
>
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
>
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).
> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
>
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.
> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
>
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
>
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
I guess this begs for real data. We agree in principle but differ in
constants :).
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).
> > Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
>
> That's true, but....
>
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
>
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
>
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> > speed
> >
> > won't go a long way into solving our current problems without too much
> > complication...
>
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
Agreed.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Andreas Dilger <adilger@dilger.ca>,
"Artem S. Tashkinov" <t.artem@lycos.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>, linux-mm <linux-mm@kvack.org>
Subject: Re: Disabling in-memory write cache for x86-64 in Linux II
Date: Mon, 11 Nov 2013 20:31:47 +0100 [thread overview]
Message-ID: <20131111193147.GC24867@quack.suse.cz> (raw)
In-Reply-To: <20131111032211.GT6188@dastard>
On Mon 11-11-13 14:22:11, Dave Chinner wrote:
> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > >
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here. Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > >
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> > I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
>
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
>
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).
> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
>
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.
> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
>
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
>
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
I guess this begs for real data. We agree in principle but differ in
constants :).
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).
> > Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
>
> That's true, but....
>
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
>
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
>
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> > speed
> >
> > won't go a long way into solving our current problems without too much
> > complication...
>
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
Agreed.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2013-11-11 19:31 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-25 7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25 7:25 ` Artem S. Tashkinov
2013-10-25 8:18 ` Linus Torvalds
2013-10-25 8:18 ` Linus Torvalds
2013-10-25 8:30 ` Artem S. Tashkinov
2013-10-25 8:43 ` Linus Torvalds
2013-10-25 9:15 ` Karl Kiniger
2013-10-29 20:30 ` Jan Kara
2013-10-29 20:43 ` Andrew Morton
2013-10-29 21:30 ` Jan Kara
2013-10-29 21:36 ` Linus Torvalds
2013-10-31 14:26 ` Karl Kiniger
2013-11-01 14:25 ` Maxim Patlasov
2013-11-01 14:31 ` [PATCH] mm: add strictlimit knob Maxim Patlasov
2013-11-01 14:31 ` Maxim Patlasov
2013-11-04 22:01 ` Andrew Morton
2013-11-04 22:01 ` Andrew Morton
2013-11-06 14:30 ` Maxim Patlasov
2013-11-06 14:30 ` Maxim Patlasov
2013-11-06 15:05 ` [PATCH] mm: add strictlimit knob -v2 Maxim Patlasov
2013-11-06 15:05 ` Maxim Patlasov
2013-11-07 12:26 ` Henrique de Moraes Holschuh
2013-11-07 12:26 ` Henrique de Moraes Holschuh
2013-11-22 23:45 ` Andrew Morton
2013-11-22 23:45 ` Andrew Morton
2013-10-25 11:28 ` Disabling in-memory write cache for x86-64 in Linux II David Lang
2013-10-25 9:18 ` Theodore Ts'o
2013-10-25 9:29 ` Andrew Morton
2013-10-25 9:32 ` Linus Torvalds
2013-10-26 11:32 ` Pavel Machek
2013-10-26 20:03 ` Linus Torvalds
2013-10-29 20:57 ` Jan Kara
2013-10-29 21:33 ` Linus Torvalds
2013-10-29 22:13 ` Jan Kara
2013-10-29 22:42 ` Linus Torvalds
2013-11-01 17:22 ` Fengguang Wu
2013-11-04 12:19 ` Pavel Machek
2013-11-04 12:26 ` Pavel Machek
2013-10-30 12:01 ` Mel Gorman
2013-11-19 17:17 ` Rob Landley
2013-11-20 20:52 ` One Thousand Gnomes
2013-10-25 22:37 ` Fengguang Wu
2013-10-25 23:05 ` Fengguang Wu
2013-10-25 23:37 ` Theodore Ts'o
2013-10-29 20:40 ` Jan Kara
2013-10-30 10:07 ` Artem S. Tashkinov
2013-10-30 15:12 ` Jan Kara
2013-11-05 0:50 ` Andreas Dilger
2013-11-05 0:50 ` Andreas Dilger
2013-11-05 4:12 ` Dave Chinner
2013-11-05 4:12 ` Dave Chinner
2013-11-05 4:12 ` Dave Chinner
2013-11-07 13:48 ` Jan Kara
2013-11-07 13:48 ` Jan Kara
2013-11-07 13:48 ` Jan Kara
2013-11-11 3:22 ` Dave Chinner
2013-11-11 3:22 ` Dave Chinner
2013-11-11 3:22 ` Dave Chinner
2013-11-11 19:31 ` Jan Kara [this message]
2013-11-11 19:31 ` Jan Kara
2013-11-05 6:32 ` Figo.zhang
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26 ` David Lang
2013-10-25 11:26 ` David Lang
2013-10-25 18:26 ` Artem S. Tashkinov
2013-10-25 18:26 ` Artem S. Tashkinov
2013-10-25 19:40 ` Diego Calleja
2013-10-25 19:40 ` Diego Calleja
2013-10-25 23:32 ` Fengguang Wu
2013-10-25 23:32 ` Fengguang Wu
2013-10-25 23:32 ` Fengguang Wu
2013-11-15 15:48 ` Diego Calleja
2013-11-15 15:48 ` Diego Calleja
2013-10-25 20:43 ` NeilBrown
2013-10-25 21:03 ` Artem S. Tashkinov
2013-10-25 21:03 ` Artem S. Tashkinov
2013-10-25 22:11 ` NeilBrown
2013-11-05 1:40 ` Figo.zhang
2013-11-05 1:47 ` David Lang
2013-11-05 1:47 ` David Lang
2013-11-05 2:08 ` NeilBrown
2013-10-29 20:49 ` Jan Kara
2013-10-29 20:49 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131111193147.GC24867@quack.suse.cz \
--to=jack@suse.cz \
--cc=adilger@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=fengguang.wu@intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=t.artem@lycos.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.