All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Chris Mason <chris.mason@oracle.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Artem Bityutskiy <dedekind1@gmail.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com, hch@infradead.org,
	akpm@linux-foundation.org, jack@suse.cz,
	"Theodore Ts'o" <tytso@mit.edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Wed, 9 Sep 2009 17:29:01 +0800	[thread overview]
Message-ID: <20090909092901.GA24185@localhost> (raw)
In-Reply-To: <20090908162936.GA2975@think>

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > > Hi,
> > > 
> > > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > > From: Theodore Ts'o<tytso@mit.edu>
> > > >
> > > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > > comment previously.)  This doesn't make sense now because the only
> > > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > > that case we need to write out all of the data anyway.  Previously
> > > > there may have been other code paths that waited on I_SYNC, but not
> > > > any more.
> > > >
> > > > According to Christoph, the current writeback size is way too small,
> > > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > > value was also problematic for ext4 as well, as it caused large files
> > > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > > the nr_to_write by a factor of two).
> > > >
> > > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > > >
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > > >
> > > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > > 
> > > Would be nice to update doc files like
> > > 
> > > Documentation/sysctl/vm.txt
> > > Documentation/filesystems/proc.txt
> > 
> > I'm still not convinced this knob is worth the patch and I'm inclined to
> > flat out NAK it..
> > 
> > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > dirty stats again and not write out too much.
> 
> The problem is that 'too much' is a very abstract thing.  When a process
> is stuck in balance_dirty_pages, we want them to do the minimal amount
> of work (or waiting) required to get them safely back inside file_write().

It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).

So I feel that we could just increase MAX_WRITEBACK_PAGES.  It won't
lead to bumpy throttled writes. It does affect fairness of background
writes, ie. small files will have to wait more time for large files.
But I'm fine with MAX_WRITEBACK_PAGES=64MB, which means for desktop
that a large file may only delay others for 1 second, which is small
enough comparing to the 30 second dirty expire time.

On the other hand, I find that the (ratelimit_pages + ratelimit_pages / 2)
used for balance_dirty_pages() may fall below the real number of
dirtied pages, which is not safe if some filesystem choose to dirty
2 * ratelimit_pages before calling balance_dirty_pages_ratelimited_nr().

So, how about this patch?

Thanks,
Fengguang
---

writeback: balance_dirty_pages() shall write more than dirtied pages

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-09 16:53:15.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 17:05:14.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -481,7 +481,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -489,7 +490,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -634,9 +634,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

WARNING: multiple messages have this Message-ID (diff)
From: Wu Fengguang <fengguang.wu@intel.com>
To: Chris Mason <chris.mason@oracle.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Artem Bityutskiy <dedekind1@gmail.com>,
	Jens Axboe <jens.axboe@oracle.com>, linux-kernel@vger.kernel.org,
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Wed, 9 Sep 2009 17:29:01 +0800	[thread overview]
Message-ID: <20090909092901.GA24185@localhost> (raw)
In-Reply-To: <20090908162936.GA2975@think>

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > > Hi,
> > > 
> > > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > > From: Theodore Ts'o<tytso@mit.edu>
> > > >
> > > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > > comment previously.)  This doesn't make sense now because the only
> > > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > > that case we need to write out all of the data anyway.  Previously
> > > > there may have been other code paths that waited on I_SYNC, but not
> > > > any more.
> > > >
> > > > According to Christoph, the current writeback size is way too small,
> > > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > > value was also problematic for ext4 as well, as it caused large files
> > > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > > the nr_to_write by a factor of two).
> > > >
> > > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > > >
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > > >
> > > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > > 
> > > Would be nice to update doc files like
> > > 
> > > Documentation/sysctl/vm.txt
> > > Documentation/filesystems/proc.txt
> > 
> > I'm still not convinced this knob is worth the patch and I'm inclined to
> > flat out NAK it..
> > 
> > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > dirty stats again and not write out too much.
> 
> The problem is that 'too much' is a very abstract thing.  When a process
> is stuck in balance_dirty_pages, we want them to do the minimal amount
> of work (or waiting) required to get them safely back inside file_write().

It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).

So I feel that we could just increase MAX_WRITEBACK_PAGES.  It won't
lead to bumpy throttled writes. It does affect fairness of background
writes, ie. small files will have to wait more time for large files.
But I'm fine with MAX_WRITEBACK_PAGES=64MB, which means for desktop
that a large file may only delay others for 1 second, which is small
enough comparing to the 30 second dirty expire time.

On the other hand, I find that the (ratelimit_pages + ratelimit_pages / 2)
used for balance_dirty_pages() may fall below the real number of
dirtied pages, which is not safe if some filesystem choose to dirty
2 * ratelimit_pages before calling balance_dirty_pages_ratelimited_nr().

So, how about this patch?

Thanks,
Fengguang
---

writeback: balance_dirty_pages() shall write more than dirtied pages

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-09 16:53:15.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 17:05:14.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -481,7 +481,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -489,7 +490,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -634,9 +634,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

  parent reply	other threads:[~2009-09-09  9:29 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27   ` Artem Bityutskiy
2009-09-08 10:27     ` Artem Bityutskiy
2009-09-08 10:41     ` Jens Axboe
2009-09-08 10:52       ` Artem Bityutskiy
2009-09-08 10:57         ` Jens Axboe
2009-09-08 11:01           ` Artem Bityutskiy
2009-09-08 11:01             ` Artem Bityutskiy
2009-09-08 11:05             ` Jens Axboe
2009-09-08 11:31               ` Artem Bityutskiy
2009-09-08 11:31                 ` Artem Bityutskiy
2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46   ` Daniel Walker
2009-09-08 14:21     ` Jens Axboe
2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37   ` Artem Bityutskiy
2009-09-08 10:37     ` Artem Bityutskiy
2009-09-08 16:06     ` Peter Zijlstra
2009-09-08 16:29       ` Chris Mason
2009-09-08 16:56         ` Peter Zijlstra
2009-09-08 17:28           ` Chris Mason
2009-09-08 17:46             ` Peter Zijlstra
2009-09-08 17:55               ` Peter Zijlstra
2009-09-08 18:32                 ` Peter Zijlstra
2009-09-09 14:23                   ` Jan Kara
2009-09-09 14:37                     ` Wu Fengguang
2009-09-10 15:49                     ` Peter Zijlstra
2009-09-14 11:17                       ` Jan Kara
2009-09-24  8:33                         ` Wu Fengguang
2009-09-24 15:38                           ` Peter Zijlstra
2009-09-25  1:33                             ` Wu Fengguang
2009-09-29 17:35                           ` Jan Kara
2009-09-30  1:24                             ` Wu Fengguang
2009-09-30 11:55                               ` Jan Kara
2009-09-30 12:10                                 ` Jens Axboe
2009-10-01 15:17                                   ` Wu Fengguang
2009-10-01 13:36                                 ` Wu Fengguang
2009-10-01 14:22                                   ` Jan Kara
2009-10-01 14:54                                     ` Wu Fengguang
2009-10-01 21:35                                       ` Jan Kara
2009-10-02  2:25                                         ` Wu Fengguang
2009-10-02  9:54                                           ` Jan Kara
2009-10-02 10:34                                             ` Wu Fengguang
2009-09-08 18:35                 ` Chris Mason
2009-09-08 17:57               ` Chris Mason
2009-09-08 18:28                 ` Peter Zijlstra
2009-09-09  1:53           ` Dave Chinner
2009-09-09  3:52             ` Wu Fengguang
2009-09-08 18:06         ` Theodore Tso
2009-09-08 18:06           ` Theodore Tso
2009-09-08 18:19           ` Christoph Hellwig
2009-09-08 19:34             ` Theodore Tso
2009-09-09  9:29         ` Wu Fengguang [this message]
2009-09-09  9:29           ` Wu Fengguang
2009-09-09 12:28           ` Christoph Hellwig
2009-09-09 12:32             ` Wu Fengguang
2009-09-09 12:36               ` Artem Bityutskiy
2009-09-09 12:36                 ` Artem Bityutskiy
2009-09-09 12:37               ` Jens Axboe
2009-09-09 12:43                 ` Christoph Hellwig
2009-09-09 12:44                   ` Jens Axboe
2009-09-09 12:51                     ` Christoph Hellwig
2009-09-09 12:57                 ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04  7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28   ` Richard Kennedy
2009-09-05 13:26     ` Jamie Lokier
2009-09-05 16:18       ` Richard Kennedy
2009-09-05 16:46     ` Theodore Tso
2009-09-07 19:09   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090909092901.GA24185@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dedekind1@gmail.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.