From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Chris Mason <chris.mason@oracle.com>
Cc: David Chinner <dgc@sgi.com>, Andrew Morton <akpm@osdl.org>,
Ken Chen <kenchen@google.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3
Date: Fri, 24 Aug 2007 20:56:43 +0800 [thread overview]
Message-ID: <387960203.03917@ustc.edu.cn> (raw)
Message-ID: <20070824125643.GB7933@mail.ustc.edu.cn> (raw)
In-Reply-To: <20070823081341.27807ad0@think.oraclecorp.com>
On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <dgc@sgi.com> wrote:
>
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers. We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning. But perhaps that's
> > > what you're saying...
> > >
> > > At any rate, we've got two types of lists now. One keeps track of
> > > age and the other two keep track of what is currently being
> > > written. I would try two things:
> > >
> > > 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode). Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > >
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started. Any nearby inodes in s_io are
> > > also sent down.
> >
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
>
> Usually it won't be less wrong than clustering by time.
>
> >
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
>
> I agree we'll want a way to let the FS provide the clustering key. But
> for the first cut on the patch, I would suggest keeping it simple.
>
> >
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> >
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
>
> Yes, also a good idea after things are working.
>
> >
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > >
> > > In general, it is a better assumption than sorting by time. It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> >
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
>
> So, my feature creep idea would have been more data clustering. I'm
> mainly trying to solve this graph:
>
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
>
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees. My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
>
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky. So, I wasn't going to
> suggest it until the basic machinery was working.
>
> Fengguang, this isn't a small project ;) But, lots of people will be
> interested in the results.
Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.
-fengguang
next prev parent reply other threads:[~2007-08-24 12:56 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20070812091120.189651872@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
2007-08-22 0:23 ` Chris Mason
[not found] ` <20070822011841.GA8090@mail.ustc.edu.cn>
2007-08-22 1:18 ` Fengguang Wu
2007-08-22 1:18 ` Fengguang Wu
2007-08-22 12:42 ` Chris Mason
2007-08-23 2:47 ` David Chinner
2007-08-23 12:13 ` Chris Mason
[not found] ` <20070824125643.GB7933@mail.ustc.edu.cn>
2007-08-24 12:56 ` Fengguang Wu [this message]
2007-08-24 12:56 ` Fengguang Wu
[not found] ` <20070824132458.GC7933@mail.ustc.edu.cn>
2007-08-24 13:24 ` Fengguang Wu
2007-08-24 14:36 ` Chris Mason
2007-08-24 13:24 ` Fengguang Wu
2007-08-23 2:33 ` David Chinner
[not found] ` <20070824135504.GA9029@mail.ustc.edu.cn>
2007-08-24 13:55 ` Fengguang Wu
2007-08-24 13:55 ` Fengguang Wu
[not found] ` <20070828145530.GD61154114@sgi.com>
[not found] ` <20070828110820.542bbd67@think.oraclecorp.com>
[not found] ` <20070828163308.GE61154114@sgi.com>
[not found] ` <20070829075330.GA5960@mail.ustc.edu.cn>
2007-08-29 7:53 ` Fengguang Wu
2007-08-29 7:53 ` Fengguang Wu
[not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
[not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
[not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
2007-08-13 1:03 ` David Chinner
[not found] ` <20070813103000.GA8520@mail.ustc.edu.cn>
2007-08-13 10:30 ` Fengguang Wu
2007-08-13 10:30 ` Fengguang Wu
[not found] ` <20070817071317.GA8965@mail.ustc.edu.cn>
2007-08-17 7:13 ` Fengguang Wu
[not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
[not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
[not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
2007-08-12 9:11 ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12 9:11 ` Fengguang Wu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=387960203.03917@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=akpm@osdl.org \
--cc=chris.mason@oracle.com \
--cc=dgc@sgi.com \
--cc=jens.axboe@oracle.com \
--cc=kenchen@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).