[PATCH] Improve buffered streaming write ordering

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Mason <chris.mason@oracle.com>
To: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: [PATCH] Improve buffered streaming write ordering
Date: Wed, 01 Oct 2008 14:40:51 -0400	[thread overview]
Message-ID: <1222886451.9158.34.camel@think.oraclecorp.com> (raw)

Hello everyone,

write_cache_pages can use the address space writeback_index field to
try and pick up where it left off between calls.  pdflush and
balance_dirty_pages both enable this mode in hopes of having writeback
evenly walk down the file instead of just servicing pages at the
start of the address space.

But, there is no locking around this field, and concurrent callers of
write_cache_pages on the same inode can get some very strange results.
pdflush uses writeback_acquire function to make sure that only one
pdflush process is servicing a given backing device, but
balance_dirty_pages does not.

When there are a small number of dirty inodes in the system,
balance_dirty_pages is likely to run in parallel with pdflush on one or
two of them, leading to somewhat random updates of the writeback_index
field in struct address space.

The end result is very seeky writeback during streaming IO.  A 4 drive
hardware raid0 array here can do 317MB/s streaming O_DIRECT writes on
ext4.  This is creating a new file, so O_DIRECT is really just a way to
bypass write_cache_pages.

If I do buffered writes instead, XFS does 205MB/s, and ext4 clocks in at
81.7MB/s.  Looking at the buffered IO traces for each one, we can see a
lot of seeks.

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-nopatch.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-nopatch.png

The patch below changes write_cache_pages to only use writeback_index
when current_is_pdflush().  The basic idea is that pdflush is the only
one who has concurrency control against the bdi, so it is the only one
who can safely use and update writeback_index.

The performance changes quite a bit:

        patched        unpatched
XFS     247MB/s        205MB/s
Ext4    246MB/s        81.7MB/s

The graphs after the patch:

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-patched.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-patched.png

The ext4 graph really does look strange.  What's happening there is the
lazy inode table init has dirtied a whole bunch of pages on the block
device inode.  I don't have much of an answer for why my patch makes all
of this writeback happen up front, other then writeback_index is no
longer bouncing all over the address space.

It is also worth noting that before the patch, filefrag shows ext4 using
about 4000 extents on the file.  After the patch it is around 400.  XFS
uses 2 extents both patched and unpatched.

This is just one benchmark, and I'm not convinced this patch is right.
The ordering of pdflush vs balance_dirty pages is very tricky so I
definitely think we need more thought on this one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24de8b6..d799f03 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -884,7 +884,11 @@ int write_cache_pages(struct address_space *mapping,

 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		/* start from previous offset done by pdflush */
+		if (current_is_pdflush())
+			index = mapping->writeback_index;
+		else
+			index = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
@@ -958,7 +962,8 @@ retry:
 		index = 0;
 		goto retry;
 	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+	if (current_is_pdflush() &&
+	    (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)))
 		mapping->writeback_index = index;

 	if (wbc->range_cont)

next             reply	other threads:[~2008-10-01 18:40 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-01 18:40 Chris Mason [this message]
2008-10-02  4:52 ` [PATCH] Improve buffered streaming write ordering Andrew Morton
2008-10-02 12:20   ` Chris Mason
2008-10-02 16:12     ` Chris Mason
2008-10-02 18:18     ` Aneesh Kumar K.V
2008-10-02 19:44       ` Andrew Morton
2008-10-02 23:43       ` Dave Chinner
2008-10-03 19:45         ` Chris Mason
2008-10-06 10:16           ` Aneesh Kumar K.V
2008-10-06 14:21             ` Chris Mason
2008-10-07  8:45               ` Aneesh Kumar K.V
2008-10-07  9:05                 ` Christoph Hellwig
2008-10-07 10:02                   ` Aneesh Kumar K.V
2008-10-07 13:29                     ` Theodore Tso
2008-10-07 13:36                       ` Christoph Hellwig
2008-10-07 14:46                         ` Nick Piggin
2008-10-07 13:55                     ` Peter Staubach
2008-10-07 14:38                       ` Chuck Lever
2008-10-09 15:11         ` Chris Mason
2008-10-10  5:13           ` Dave Chinner
2008-10-03  1:11       ` Chris Mason
2008-10-03  2:43         ` Nick Piggin
2008-10-03 12:07           ` Chris Mason
2008-10-02 18:08 ` Aneesh Kumar K.V

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:24de8b6 dfblob:d799f03 )
 OR (
bs:"[PATCH] Improve buffered streaming write ordering" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1222886451.9158.34.camel@think.oraclecorp.com \
    --to=chris.mason@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).