linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Chris Mason <chris.mason@oracle.com>
Cc: David Chinner <dgc@sgi.com>, Andrew Morton <akpm@osdl.org>,
	Ken Chen <kenchen@google.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3
Date: Fri, 24 Aug 2007 20:56:43 +0800	[thread overview]
Message-ID: <387960203.03917@ustc.edu.cn> (raw)
Message-ID: <20070824125643.GB7933@mail.ustc.edu.cn> (raw)
In-Reply-To: <20070823081341.27807ad0@think.oraclecorp.com>

On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <dgc@sgi.com> wrote:
> 
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers.  We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning.  But perhaps that's
> > > what you're saying...
> > > 
> > > At any rate, we've got two types of lists now.  One keeps track of
> > > age and the other two keep track of what is currently being
> > > written.  I would try two things:
> > > 
> > > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode).  Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > > 
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > > also sent down.
> > 
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
> 
> Usually it won't be less wrong than clustering by time.
> 
> > 
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
> 
> I agree we'll want a way to let the FS provide the clustering key.  But
> for the first cut on the patch, I would suggest keeping it simple.
> 
> > 
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> > 
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
> 
> Yes, also a good idea after things are working.
> 
> > 
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > >     filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > In general, it is a better assumption than sorting by time.  It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> > 
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
> 
> So, my feature creep idea would have been more data clustering.  I'm
> mainly trying to solve this graph:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
> 
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees.  My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
> 
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky.  So, I wasn't going to
> suggest it until the basic machinery was working.
> 
> Fengguang, this isn't a small project ;)  But, lots of people will be
> interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang


  reply	other threads:[~2007-08-24 12:56 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20070812091120.189651872@mail.ustc.edu.cn>
2007-08-12  9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-12  9:11 ` Fengguang Wu
2007-08-22  0:23   ` Chris Mason
     [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
2007-08-22  1:18       ` Fengguang Wu
2007-08-22  1:18       ` Fengguang Wu
2007-08-22 12:42         ` Chris Mason
2007-08-23  2:47           ` David Chinner
2007-08-23 12:13             ` Chris Mason
     [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
2007-08-24 12:56                 ` Fengguang Wu [this message]
2007-08-24 12:56                 ` Fengguang Wu
     [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 14:36               ` Chris Mason
2007-08-24 13:24             ` Fengguang Wu
2007-08-23  2:33       ` David Chinner
     [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
2007-08-24 13:55           ` Fengguang Wu
2007-08-24 13:55           ` Fengguang Wu
     [not found]           ` <20070828145530.GD61154114@sgi.com>
     [not found]             ` <20070828110820.542bbd67@think.oraclecorp.com>
     [not found]               ` <20070828163308.GE61154114@sgi.com>
     [not found]                 ` <20070829075330.GA5960@mail.ustc.edu.cn>
2007-08-29  7:53                   ` Fengguang Wu
2007-08-29  7:53                   ` Fengguang Wu
     [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-13  1:03   ` David Chinner
     [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
2007-08-13 10:30       ` Fengguang Wu
2007-08-13 10:30       ` Fengguang Wu
     [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
2007-08-17  7:13         ` Fengguang Wu
     [not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=387960203.03917@ustc.edu.cn \
    --to=wfg@mail.ustc.edu.cn \
    --cc=akpm@osdl.org \
    --cc=chris.mason@oracle.com \
    --cc=dgc@sgi.com \
    --cc=jens.axboe@oracle.com \
    --cc=kenchen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).