linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH] Improve buffered streaming write ordering
Date: Thu, 2 Oct 2008 23:38:21 +0530	[thread overview]
Message-ID: <20081002180821.GA29613@skywalker> (raw)
In-Reply-To: <1222886451.9158.34.camel@think.oraclecorp.com>

On Wed, Oct 01, 2008 at 02:40:51PM -0400, Chris Mason wrote:
> Hello everyone,
> 
> write_cache_pages can use the address space writeback_index field to
> try and pick up where it left off between calls.  pdflush and
> balance_dirty_pages both enable this mode in hopes of having writeback
> evenly walk down the file instead of just servicing pages at the
> start of the address space.
> 
> But, there is no locking around this field, and concurrent callers of
> write_cache_pages on the same inode can get some very strange results.
> pdflush uses writeback_acquire function to make sure that only one
> pdflush process is servicing a given backing device, but
> balance_dirty_pages does not.
> 
> When there are a small number of dirty inodes in the system,
> balance_dirty_pages is likely to run in parallel with pdflush on one or
> two of them, leading to somewhat random updates of the writeback_index
> field in struct address space.
> 
> The end result is very seeky writeback during streaming IO.  A 4 drive
> hardware raid0 array here can do 317MB/s streaming O_DIRECT writes on
> ext4.  This is creating a new file, so O_DIRECT is really just a way to
> bypass write_cache_pages.
> 
> If I do buffered writes instead, XFS does 205MB/s, and ext4 clocks in at
> 81.7MB/s.  Looking at the buffered IO traces for each one, we can see a
> lot of seeks.
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-nopatch.png
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-nopatch.png
> 
> The patch below changes write_cache_pages to only use writeback_index
> when current_is_pdflush().  The basic idea is that pdflush is the only
> one who has concurrency control against the bdi, so it is the only one
> who can safely use and update writeback_index.
> 
> The performance changes quite a bit:
> 
>         patched        unpatched
> XFS     247MB/s        205MB/s
> Ext4    246MB/s        81.7MB/s


That is nice.

> 
> The graphs after the patch:
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-patched.png
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-patched.png
> 
> The ext4 graph really does look strange.  What's happening there is the
> lazy inode table init has dirtied a whole bunch of pages on the block
> device inode.  I don't have much of an answer for why my patch makes all
> of this writeback happen up front, other then writeback_index is no
> longer bouncing all over the address space.
> 
> It is also worth noting that before the patch, filefrag shows ext4 using
> about 4000 extents on the file.  After the patch it is around 400.  XFS
> uses 2 extents both patched and unpatched.
> 

Ext4 do block allocation in ext4_da_writepages. So if we are feeding the
block allocation with different(highly bouncing) index values we may end up with larger
number of extents. Although the new mballoc block allocator should
perform better because it reserve space based on logical block number
in the file.

-aneesh

      parent reply	other threads:[~2008-10-02 18:08 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-01 18:40 [PATCH] Improve buffered streaming write ordering Chris Mason
2008-10-02  4:52 ` Andrew Morton
2008-10-02 12:20   ` Chris Mason
2008-10-02 16:12     ` Chris Mason
2008-10-02 18:18     ` Aneesh Kumar K.V
2008-10-02 19:44       ` Andrew Morton
2008-10-02 23:43       ` Dave Chinner
2008-10-03 19:45         ` Chris Mason
2008-10-06 10:16           ` Aneesh Kumar K.V
2008-10-06 14:21             ` Chris Mason
2008-10-07  8:45               ` Aneesh Kumar K.V
2008-10-07  9:05                 ` Christoph Hellwig
2008-10-07 10:02                   ` Aneesh Kumar K.V
2008-10-07 13:29                     ` Theodore Tso
2008-10-07 13:36                       ` Christoph Hellwig
2008-10-07 14:46                         ` Nick Piggin
2008-10-07 13:55                     ` Peter Staubach
2008-10-07 14:38                       ` Chuck Lever
2008-10-09 15:11         ` Chris Mason
2008-10-10  5:13           ` Dave Chinner
2008-10-03  1:11       ` Chris Mason
2008-10-03  2:43         ` Nick Piggin
2008-10-03 12:07           ` Chris Mason
2008-10-02 18:08 ` Aneesh Kumar K.V [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081002180821.GA29613@skywalker \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).