All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Chris Mason <chris.mason@oracle.com>
Cc: David Chinner <dgc@sgi.com>, Andrew Morton <akpm@osdl.org>,
	Ken Chen <kenchen@google.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3
Date: Fri, 24 Aug 2007 20:56:43 +0800	[thread overview]
Message-ID: <387960203.03917@ustc.edu.cn> (raw)
Message-ID: <20070824125643.GB7933@mail.ustc.edu.cn> (raw)
In-Reply-To: <20070823081341.27807ad0@think.oraclecorp.com>

On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <dgc@sgi.com> wrote:
> 
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers.  We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning.  But perhaps that's
> > > what you're saying...
> > > 
> > > At any rate, we've got two types of lists now.  One keeps track of
> > > age and the other two keep track of what is currently being
> > > written.  I would try two things:
> > > 
> > > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode).  Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > > 
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > > also sent down.
> > 
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
> 
> Usually it won't be less wrong than clustering by time.
> 
> > 
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
> 
> I agree we'll want a way to let the FS provide the clustering key.  But
> for the first cut on the patch, I would suggest keeping it simple.
> 
> > 
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> > 
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
> 
> Yes, also a good idea after things are working.
> 
> > 
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > >     filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > In general, it is a better assumption than sorting by time.  It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> > 
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
> 
> So, my feature creep idea would have been more data clustering.  I'm
> mainly trying to solve this graph:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
> 
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees.  My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
> 
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky.  So, I wasn't going to
> suggest it until the basic machinery was working.
> 
> Fengguang, this isn't a small project ;)  But, lots of people will be
> interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang


  reply	other threads:[~2007-08-24 12:57 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-12  9:11 [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-12  9:11 ` Fengguang Wu
2007-08-22  0:23   ` Chris Mason
2007-08-22  1:18     ` Fengguang Wu
2007-08-22  1:18       ` Fengguang Wu
2007-08-22 12:42         ` Chris Mason
2007-08-23  2:47           ` David Chinner
2007-08-23 12:13             ` Chris Mason
2007-08-24 12:56               ` Fengguang Wu [this message]
2007-08-24 12:56                 ` Fengguang Wu
2007-08-24 13:24           ` Fengguang Wu
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 14:36               ` Chris Mason
2007-08-23  2:33       ` David Chinner
2007-08-24 13:55         ` Fengguang Wu
2007-08-24 13:55           ` Fengguang Wu
2007-08-28 14:55           ` David Chinner
2007-08-28 15:08             ` Chris Mason
2007-08-28 16:33               ` David Chinner
2007-08-28 16:57                 ` Chris Mason
2007-08-29  7:53                 ` Fengguang Wu
2007-08-29  7:53                   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-13  1:03   ` David Chinner
2007-08-13 10:30     ` Fengguang Wu
2007-08-13 10:30       ` Fengguang Wu
2007-08-17  7:13       ` Fengguang Wu
2007-08-17  7:13         ` Fengguang Wu
2007-08-17  7:13         ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=387960203.03917@ustc.edu.cn \
    --to=wfg@mail.ustc.edu.cn \
    --cc=akpm@osdl.org \
    --cc=chris.mason@oracle.com \
    --cc=dgc@sgi.com \
    --cc=jens.axboe@oracle.com \
    --cc=kenchen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.