linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Andrew Morton <akpm@osdl.org>, Ken Chen <kenchen@google.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3
Date: Wed, 22 Aug 2007 08:42:01 -0400	[thread overview]
Message-ID: <20070822084201.2c4eceb6@think.oraclecorp.com> (raw)
In-Reply-To: <387745522.02814@ustc.edu.cn>

On Wed, 22 Aug 2007 09:18:41 +0800
Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > On Sun, 12 Aug 2007 17:11:20 +0800
> > Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > 
> > > Andrew and Ken,
> > > 
> > > Here are some more experiments on the writeback stuff.
> > > Comments are highly welcome~ 
> > 
> > I've been doing benchmarks lately to try and trigger fragmentation,
> > and one of them is a simulation of make -j N.  It takes a list of
> > all the .o files in the kernel tree, randomly sorts them and then
> > creates bogus files with the same names and sizes in clean kernel
> > trees.
> > 
> > This is basically creating a whole bunch of files in random order
> > in a whole bunch of subdirectories.
> > 
> > The results aren't pretty:
> > 
> > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> > 
> > The top graph shows one dot for each write over time.  It shows that
> > ext3 is basically writing all over the place the whole time.  But,
> > ext3 actually wins the read phase, so the layout isn't horrible.
> > My guess is that if we introduce some write clustering by sending a
> > group of inodes down at the same time, it'll go much much better.
> > 
> > Andrew has mentioned bringing a few radix trees into the writeback
> > paths before, it seems like file servers and other general uses
> > will benefit from better clustering here.
> > 
> > I'm hoping to talk you into trying it out ;)
> 
> Thank you for the description of problem. So far I have a similar one
> in mind: if we are to delay writeback of atime-dirty-only inodes to
> above 1 hour, some grouping/piggy-backing scenario would be
> beneficial.  (Which I guess does not deserve the complexity now that
> we have Ingo's make-reltime-default patch.)

Good clustering would definitely help some delayed atime writeback
scheme.

> 
> My vague idea is to
> - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> queue.
> - convert s_dirty to some radix-tree/rbtree based data structure.
>   It would have dual functions: delayed-writeback and
> clustered-writeback. 
> clustered-writeback:
> - Use inode number as clue of locality, hence the key for the sorted
>   tree.
> - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
>   but do it in the ascending order of inode number instead of
>   ->dirtied_when. 
> 
> delayed-writeback:
> - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
>   dirty_expire_interval.

I think we should assume a full scan of s_dirty is impossible in the
presence of concurrent writers.  We want to be able to pick a start
time (right now) and find all the inodes older than that start time.
New things will come in while we're scanning.  But perhaps that's what
you're saying...

At any rate, we've got two types of lists now.  One keeps track of age
and the other two keep track of what is currently being written.  I
would try two things:

1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
indexes by inode number (or some arbitrary field the FS can set in the
inode).  Radix tree tags are used to indicate which things in s_io are
already in progress or are pending (hand waving because I'm not sure
exactly).

inodes are pulled off s_dirty and the corresponding slot in s_io is
tagged to indicate IO has started.  Any nearby inodes in s_io are also
sent down.

2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
sequence number that corresponds to age.  It is treated as a big
circular indexed list that can wrap around over time.  Radix tree tags
are used both on s_dirty and s_io to flag which inodes are in progress.

> 
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
>     filesystems other than ext2/3/4. Or parent dir?

In general, it is a better assumption than sorting by time.  It may
make sense to one day let the FS provide a clustering hint
(corresponding to the first block in the file?), but for starters it
makes sense to just go with the inode number.

> (2) It duplicates some function of elevators. Why is it necessary?
>     Maybe we have no clue on the exact data location at this time?

The elevator can only sort the pending IO, and we send down a
relatively small window of all the dirty pages at a time.  If we sent
down all the dirty pages and let the elevator sort it out, we wouldn't
need this clustering at all.

But, that has other issues ;)

-chris



  reply	other threads:[~2007-08-22 12:43 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20070812091120.189651872@mail.ustc.edu.cn>
2007-08-12  9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-22  0:23   ` Chris Mason
     [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
2007-08-22  1:18       ` Fengguang Wu
2007-08-22 12:42         ` Chris Mason [this message]
2007-08-23  2:47           ` David Chinner
2007-08-23 12:13             ` Chris Mason
     [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
2007-08-24 12:56                 ` Fengguang Wu
2007-08-24 12:56                 ` Fengguang Wu
     [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 14:36               ` Chris Mason
2007-08-22  1:18       ` Fengguang Wu
2007-08-23  2:33       ` David Chinner
     [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
2007-08-24 13:55           ` Fengguang Wu
2007-08-24 13:55           ` Fengguang Wu
     [not found]           ` <20070828145530.GD61154114@sgi.com>
     [not found]             ` <20070828110820.542bbd67@think.oraclecorp.com>
     [not found]               ` <20070828163308.GE61154114@sgi.com>
     [not found]                 ` <20070829075330.GA5960@mail.ustc.edu.cn>
2007-08-29  7:53                   ` Fengguang Wu
2007-08-29  7:53                   ` Fengguang Wu
2007-08-12  9:11 ` Fengguang Wu
     [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-13  1:03   ` David Chinner
     [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
2007-08-13 10:30       ` Fengguang Wu
2007-08-13 10:30       ` Fengguang Wu
     [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
2007-08-17  7:13         ` Fengguang Wu
     [not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070822084201.2c4eceb6@think.oraclecorp.com \
    --to=chris.mason@oracle.com \
    --cc=akpm@osdl.org \
    --cc=jens.axboe@oracle.com \
    --cc=kenchen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=wfg@mail.ustc.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).