Re: [RFC] page-writeback: move indoes from one superblock together

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: "Li, Shaohua" <shaohua.li@intel.com>,
	lkml <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chris Mason <chris.mason@oracle.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>
Subject: Re: [RFC] page-writeback: move indoes from one superblock together
Date: Thu, 24 Sep 2009 21:46:25 +0800	[thread overview]
Message-ID: <20090924134625.GA2507@localhost> (raw)
In-Reply-To: <20090924132949.GH23126@kernel.dk>

On Thu, Sep 24, 2009 at 09:29:50PM +0800, Jens Axboe wrote:
> On Thu, Sep 24 2009, Wu Fengguang wrote:
> > On Thu, Sep 24, 2009 at 08:35:19PM +0800, Jens Axboe wrote:
> > > On Thu, Sep 24 2009, Wu Fengguang wrote:
> > > > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > > > > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > > > > several partitions, writeback might keep spindle moving between partitions.
> > > > > To reduce the move, better write big chunk of one partition and then move to
> > > > > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > > > > from one fs together should reduce spindle move. This patch tries to address
> > > > > this. Before per-bdi writeback is added, the behavior is write indoes
> > > > > from one fs first and then another, so the patch restores previous behavior.
> > > > > The loop in the patch is a bit ugly, should we add a dirty list for each
> > > > > superblock in bdi_writeback?
> > > > > 
> > > > > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > > > > improvement.
> > > > 
> > > > A side note: given the noticeable performance gain, I wonder if it
> > > > deserves to generalize the idea to do whole disk location ordered
> > > > writeback. That should benefit many small file workloads more than
> > > > 10%. Because this patch only sorted 2 partitions and inodes in 5s
> > > > time window, while the below patch will roughly divide the disk into
> > > > 5 areas and sort inodes in a larger 25s time window.
> > > > 
> > > >         http://lkml.org/lkml/2007/8/27/45
> > > > 
> > > > Judging from this old patch, the complexity cost would be about 250
> > > > lines of code (need a rbtree).
> > > 
> > > First of all, nice patch, I'll add it to the current tree. I too was
> > 
> > You mean Shaohua's patch? It should be a good addition for 2.6.32.
> 
> Yes indeed, the parent patch.
> 
> > In long term move_expired_inodes() needs some rework.  Because it
> > could be time consuming to move around all the inodes in a large
> > system, and thus hold inode_lock() for too long time (and this patch
> > scales up the locked time).
> 
> It does. As mentioned in my reply, for 100 inodes or less, it will still
> be faster than eg using an rbtree. But the more "reliable" runtime of an
> rbtree based solution is appealing, though. It's not hugely critical,
> though.

Agreed. Desktops are not big worries; servers rarely do many
partitions per disk.

> > So would need to split the list moves into smaller pieces in future,
> > or to change data structure.
> 
> Yes, those are the two options.
> 
> > > pondering using an rbtree for sb+dirty_time insertion and extraction.

Note that dirty_time may not be unique, so need some workaround.  And
the resulted rbtree implementation may not be more efficient than
several list traversals even for a very large list (as long as
superblocks numbers are low).

The good side is, once sb+dirty_time rbtree is implemented, it should
be trivial to switch the key to sb+inode_number (also may not be unique),
and to do location ordered writeback ;)

Thanks,
Fengguang

> > FYI Michael Rubin did some work on a rbtree implementation, just
> > in case you are interested:
> > 
> >         http://lkml.org/lkml/2008/1/15/25
> 
> Thanks, I'll take a look.
> 
> -- 
> Jens Axboe

next prev parent reply	other threads:[~2009-09-24 13:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1253775260.10618.10.camel@sli10-desk.sh.intel.com>
2009-09-24  7:14 ` [RFC] page-writeback: move indoes from one superblock together Wu Fengguang
2009-09-24  7:29   ` Arjan van de Ven
2009-09-24  7:36     ` Wu Fengguang
2009-09-24  7:44   ` Shaohua Li
2009-09-24 13:17     ` Jens Axboe
2009-09-24 13:29       ` Wu Fengguang
2009-09-24 10:01 ` Wu Fengguang
2009-09-24 12:35   ` Jens Axboe
2009-09-24 13:22     ` Wu Fengguang
2009-09-24 13:29       ` Jens Axboe
2009-09-24 13:46         ` Wu Fengguang [this message]
2009-09-24 13:52           ` Arjan van de Ven
2009-09-24 14:09             ` Wu Fengguang
2009-09-25  4:16               ` Dave Chinner
2009-09-25  5:09                 ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090924134625.GA2507@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shaohua.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).