All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Chris Mason <chris.mason@oracle.com>
Cc: Chuck Ebbert <cebbert@redhat.com>, linux-kernel@vger.kernel.org
Subject: Re: filesystem benchmarking fun
Date: Tue, 22 May 2007 11:21:20 -0700	[thread overview]
Message-ID: <20070522112120.4a5c6a5d.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070522163511.GB6138@think.oraclecorp.com>

On Tue, 22 May 2007 12:35:11 -0400
Chris Mason <chris.mason@oracle.com> wrote:

> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > On Wed, 16 May 2007 16:14:14 -0400
> > Chris Mason <chris.mason@oracle.com> wrote:
> > 
> > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > The good news is that if you let it run long enough, the times
> > > > > stabilize.  The bad news is:
> > > > > 
> > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > > > 
> > > > well hang on.  Doesn't this just mean that the first few runs were writing
> > > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > > > 
> > > > Or do you have a sync in there?
> > > > 
> > > There's no sync,  but if you watch vmstat you can clearly see the log
> > > flushes, even when the overall create times are 11MB/s.  vmstat goes
> > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
> > 
> > How do you know that it is a log flush rather than, say, pdflush
> > hitting the blockdev inode and doing a big seeky write?
> 
> Ok, I did some more work to split out the two cases (block device inode
> writeback and log flushing).
> 
> I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> write in a radix tree, then send them all down in order at the end.

Side note: we already have all of that capability in the kernel:
sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
blockdev.

It could be that as a quick diddle, running sync_inode() in
do-block-on-queue-congestion mode prior to doing the checkpoint would have
some benefit.

> The elevator should be helping here, but jbd is sending down 2,000
> to 3,000 blocks during the checkpoint and upping nr_requests alone
> didn't seem to be doing the trick.
> 
> Unpatched ext3 would break down into seeks after 8 kernel trees are
> created (222MB each).  With the radix sorting, the first 15 kernel trees
> are created quickly, and then we slow down.
> 
> So I waited until around the 25th kernel tree was created, hit ctrl-c
> and ran sync.  vmstat showed writes going at 2MB/s, and sysrq-w showed
> sync was running the block device inode for most of the 2MB/s period.
> 
> It looks as though the dirty pages on the block device inode are spread
> out far enough that we're not getting good streaming writes.  Mark
> Fasheh ran on a bigger raid array, where performance was consistently
> good for the whole run.  I'm assuming the larger write cache on the
> array was able to group the data writes with the metadata on disk, while
> my poor little sata drive wasn't.  Dave Chinner hinted that xfs is
> probably suffering a similar problem, which is usually fixed by backing
> the FS with stripes and big raid.
> 
> My vaporware FS is able to maintain speed through the run because the
> allocator tries to keep data and metadata grouped into 256mb chunks,
> and so they don't end up mingling on disk until things get full.
> 
> At any rate, it may be worth putzing with the writeback routines to try
> and find dirty pages close by in the block dev inode when doing data
> writeback.  My guess is that ext3 should be going 1.5x to 2x faster for
> this particular run, but that's a huge amount of complexity added so I'm
> not convinced it is a great idea.

Yes, this is a distinct disadvantage of the whole per-address-space
writeback scheme - we're leaving IO scheduling optimisations on the floor,
especially wrt the blockdev inode, but probably also wrt regular-file
versus regular-file.  Even if one makes the request queue tremendously
huge, that won't help if there's dirty data close-by the disk head which
hasn't even been put into the queue yet.


  parent reply	other threads:[~2007-05-22 18:24 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-16 14:42 filesystem benchmarking fun Chris Mason
2007-05-16 16:01 ` Chuck Ebbert
2007-05-16 17:11   ` Chris Mason
2007-05-16 18:25     ` Andrew Morton
2007-05-16 19:13       ` Chris Mason
2007-05-16 19:33         ` Andrew Morton
2007-05-16 19:53           ` Chris Mason
2007-05-16 20:04             ` Andrew Morton
2007-05-16 20:14               ` Chris Mason
2007-05-16 20:37                 ` Andrew Morton
2007-05-16 21:02                   ` Chris Mason
2007-05-24 17:29                     ` Vara Prasad
2007-05-22 16:35                   ` Chris Mason
2007-05-22 17:50                     ` John Stoffel
2007-05-22 18:12                       ` Chris Mason
2007-05-22 18:21                     ` Andrew Morton [this message]
2007-05-22 18:39                       ` Chris Mason
2007-05-22 21:25                       ` Matt Mackall
2007-05-25  7:14                         ` Jens Axboe
2007-05-16 18:12 ` Jan Engelhardt
2007-05-16 19:12   ` Jeff Garzik
2007-05-16 19:16     ` Jeffrey Hundstad
2007-05-16 19:21       ` Jan Engelhardt
2007-05-18  3:32     ` Eric Sandeen
2007-05-16 19:25   ` Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2007-05-16 21:01 Al Boldi
2007-05-17 11:52 Xu CanHao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070522112120.4a5c6a5d.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=cebbert@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.