From: Wu Fengguang <fengguang.wu@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Chris Mason <chris.mason@oracle.com>,
Artem Bityutskiy <dedekind1@gmail.com>,
Jens Axboe <jens.axboe@oracle.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"hch@infradead.org" <hch@infradead.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"jack@suse.cz" <jack@suse.cz>, Theodore Ts'o <tytso@mit.edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Wed, 9 Sep 2009 11:52:48 +0800 [thread overview]
Message-ID: <20090909035248.GA21494@localhost> (raw)
In-Reply-To: <20090909015359.GC7146@discord.disaster>
On Wed, Sep 09, 2009 at 09:53:59AM +0800, Dave Chinner wrote:
> On Tue, Sep 08, 2009 at 06:56:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 12:29 -0400, Chris Mason wrote:
> > > Either way, if pdflush or the bdi thread or whoever ends up switching to
> > > another file during a big streaming write, the end result is that we
> > > fragment. We may fragment the file (ext4) or we may fragment the
> > > writeback (xfs), but the end result isn't good.
> >
> > OK, so what we want is for a way to re-enter the whole
> > writeback_inodes() path onto the same file, right?
>
> No, that would take use back to the Bad Old Days where one large
> file write can starve out the other 10,000 small files that need to
> be written. The old writeback code used to end up in this way
> because it didn't rotate large files to the back of the dirty inode
> queue once wbc->nr_to_write was exhausted. This could cause files
> not to be written back for tens of minutes....
Problem is, there is no per-file writeback quota.
Here is a quick demo of idea to continue writeback of the last file if
its quota has not been exceeded. It also fixes the premature abortion
on congestions problem. The end result is, writeback of big files
won't reduce to small chunks because of intermixing small files or
congestion condition.
Thanks,
Fengguang
---
writeback: ensure big files are written in MAX_WRITEBACK_PAGES chunks
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 39 ++++++++++++++++++++++++++++++++++--
include/linux/writeback.h | 11 ++++++++++
mm/page-writeback.c | 9 --------
3 files changed, 48 insertions(+), 11 deletions(-)
--- linux.orig/fs/fs-writeback.c 2009-09-09 10:02:30.000000000 +0800
+++ linux/fs/fs-writeback.c 2009-09-09 11:42:19.000000000 +0800
@@ -218,6 +218,19 @@ static void requeue_io(struct inode *ino
list_move(&inode->i_list, &inode->i_sb->s_more_io);
}
+/*
+ * continue io on this inode on next writeback if
+ * it has not accumulated large enough writeback io chunk
+ */
+static void requeue_partial_io(struct writeback_control *wbc, struct inode *inode)
+{
+ if (wbc->last_file_written == 0 ||
+ wbc->last_file_written >= MAX_WRITEBACK_PAGES)
+ return requeue_io(inode);
+
+ list_move_tail(&inode->i_list, &inode->i_sb->s_io);
+}
+
static void inode_sync_complete(struct inode *inode)
{
/*
@@ -311,6 +324,8 @@ writeback_single_inode(struct inode *ino
{
struct address_space *mapping = inode->i_mapping;
int wait = wbc->sync_mode == WB_SYNC_ALL;
+ long last_file_written;
+ long nr_to_write;
unsigned dirty;
int ret;
@@ -348,8 +363,21 @@ writeback_single_inode(struct inode *ino
spin_unlock(&inode_lock);
+ if (wbc->last_file != inode->i_ino)
+ last_file_written = 0;
+ else
+ last_file_written = wbc->last_file_written;
+ wbc->nr_to_write -= last_file_written;
+ nr_to_write = wbc->nr_to_write;
+
ret = do_writepages(mapping, wbc);
+ if (wbc->last_file != inode->i_ino) {
+ wbc->last_file = inode->i_ino;
+ wbc->last_file_written = nr_to_write - wbc->nr_to_write;
+ } else
+ wbc->last_file_written += nr_to_write - wbc->nr_to_write;
+
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
int err = write_inode(inode, wait);
@@ -378,11 +406,16 @@ writeback_single_inode(struct inode *ino
* sometimes bales out without doing anything.
*/
inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
+ if (wbc->encountered_congestion) {
+ /*
+ * keep and retry after congestion
+ */
+ requeue_partial_io(wbc, inode);
+ } else if (wbc->nr_to_write <= 0) {
/*
* slice used up: queue for next turn
*/
- requeue_io(inode);
+ requeue_partial_io(wbc, inode);
} else {
/*
* somehow blocked: retry later
@@ -402,6 +435,8 @@ writeback_single_inode(struct inode *ino
}
}
inode_sync_complete(inode);
+ wbc->nr_to_write += last_file_written;
+
return ret;
}
--- linux.orig/include/linux/writeback.h 2009-09-09 11:13:43.000000000 +0800
+++ linux/include/linux/writeback.h 2009-09-09 11:41:40.000000000 +0800
@@ -25,6 +25,15 @@ static inline int task_is_pdflush(struct
#define current_is_pdflush() task_is_pdflush(current)
/*
+ * The maximum number of pages to writeout in a single bdflush/kupdate
+ * operation. We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode. Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES 1024
+
+/*
* fs/fs-writeback.c
*/
enum writeback_sync_modes {
@@ -45,6 +54,8 @@ struct writeback_control {
older than this */
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
+ unsigned long last_file; /* Inode number of last written file */
+ long last_file_written; /* Total pages written for last file */
long pages_skipped; /* Pages which were not written */
/*
--- linux.orig/mm/page-writeback.c 2009-09-09 10:05:02.000000000 +0800
+++ linux/mm/page-writeback.c 2009-09-09 11:41:01.000000000 +0800
@@ -36,15 +36,6 @@
#include <linux/pagevec.h>
/*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation. We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode. Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES 1024
-
-/*
* After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
* will look to see if it needs to force writeback or throttling.
*/
next prev parent reply other threads:[~2009-09-09 3:53 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-08 9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08 9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27 ` Artem Bityutskiy
2009-09-08 10:41 ` Jens Axboe
2009-09-08 10:52 ` Artem Bityutskiy
2009-09-08 10:57 ` Jens Axboe
2009-09-08 11:01 ` Artem Bityutskiy
2009-09-08 11:05 ` Jens Axboe
2009-09-08 11:31 ` Artem Bityutskiy
2009-09-08 9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08 9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46 ` Daniel Walker
2009-09-08 14:21 ` Jens Axboe
2009-09-08 9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08 9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08 9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08 9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08 9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37 ` Artem Bityutskiy
2009-09-08 16:06 ` Peter Zijlstra
2009-09-08 16:29 ` Chris Mason
2009-09-08 16:56 ` Peter Zijlstra
2009-09-08 17:28 ` Chris Mason
2009-09-08 17:46 ` Peter Zijlstra
2009-09-08 17:55 ` Peter Zijlstra
2009-09-08 18:32 ` Peter Zijlstra
2009-09-09 14:23 ` Jan Kara
2009-09-09 14:37 ` Wu Fengguang
2009-09-10 15:49 ` Peter Zijlstra
2009-09-14 11:17 ` Jan Kara
2009-09-24 8:33 ` Wu Fengguang
2009-09-24 15:38 ` Peter Zijlstra
2009-09-25 1:33 ` Wu Fengguang
2009-09-29 17:35 ` Jan Kara
2009-09-30 1:24 ` Wu Fengguang
2009-09-30 11:55 ` Jan Kara
2009-09-30 12:10 ` Jens Axboe
2009-10-01 15:17 ` Wu Fengguang
2009-10-01 13:36 ` Wu Fengguang
2009-10-01 14:22 ` Jan Kara
2009-10-01 14:54 ` Wu Fengguang
2009-10-01 21:35 ` Jan Kara
2009-10-02 2:25 ` Wu Fengguang
2009-10-02 9:54 ` Jan Kara
2009-10-02 10:34 ` Wu Fengguang
2009-09-08 18:35 ` Chris Mason
2009-09-08 17:57 ` Chris Mason
2009-09-08 18:28 ` Peter Zijlstra
2009-09-09 1:53 ` Dave Chinner
2009-09-09 3:52 ` Wu Fengguang [this message]
2009-09-08 18:06 ` Theodore Tso
[not found] ` <20090908181937.GA11545@infradead.org>
2009-09-08 19:34 ` Theodore Tso
2009-09-09 9:29 ` Wu Fengguang
2009-09-09 12:28 ` Christoph Hellwig
2009-09-09 12:32 ` Wu Fengguang
2009-09-09 12:36 ` Artem Bityutskiy
2009-09-09 12:37 ` Jens Axboe
2009-09-09 12:43 ` Christoph Hellwig
2009-09-09 12:44 ` Jens Axboe
2009-09-09 12:51 ` Christoph Hellwig
2009-09-09 12:57 ` Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2009-09-04 7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04 7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28 ` Richard Kennedy
2009-09-05 13:26 ` Jamie Lokier
2009-09-05 16:18 ` Richard Kennedy
2009-09-05 16:46 ` Theodore Tso
2009-09-07 19:09 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090909035248.GA21494@localhost \
--to=fengguang.wu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=dedekind1@gmail.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).