[PATCH 0/6] writeback time order/delay fixes take 3

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] writeback time order/delay fixes take 3
       [not found] <20070812091120.189651872@mail.ustc.edu.cn>
@ 2007-08-12  9:11 ` Fengguang Wu
  2007-08-22  0:23   ` Chris Mason
  2007-08-12  9:11 ` Fengguang Wu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

Andrew and Ken,

Here are some more experiments on the writeback stuff.
Comments are highly welcome~ 

writeback fixes:
[PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8   
[PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()                     
[PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_pa

debug codes:
[PATCH 4/6] check dirty inode list                                             
[PATCH 5/6] prevent time-ordering warnings                                     
[PATCH 6/6] track redirty_tail() calls                                         

Thank you,
Fengguang
-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
  2007-08-12  9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
@ 2007-08-22  0:23   ` Chris Mason
       [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Mason @ 2007-08-22  0:23 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel

On Sun, 12 Aug 2007 17:11:20 +0800
Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> Andrew and Ken,
> 
> Here are some more experiments on the writeback stuff.
> Comments are highly welcome~ 

I've been doing benchmarks lately to try and trigger fragmentation, and
one of them is a simulation of make -j N.  It takes a list of all
the .o files in the kernel tree, randomly sorts them and then
creates bogus files with the same names and sizes in clean kernel trees.

This is basically creating a whole bunch of files in random order in a
whole bunch of subdirectories.

The results aren't pretty:

http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png

The top graph shows one dot for each write over time.  It shows that
ext3 is basically writing all over the place the whole time.  But, ext3
actually wins the read phase, so the layout isn't horrible.  My guess
is that if we introduce some write clustering by sending a group of
inodes down at the same time, it'll go much much better.

Andrew has mentioned bringing a few radix trees into the writeback paths
before, it seems like file servers and other general uses will benefit
from better clustering here.

I'm hoping to talk you into trying it out ;)

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070822011841.GA8090@mail.ustc.edu.cn>]

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
@ 2007-08-22  1:18       ` Fengguang Wu
  2007-08-22 12:42         ` Chris Mason
  2007-08-22  1:18       ` Fengguang Wu
  2007-08-23  2:33       ` David Chinner
  2 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-08-22  1:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> On Sun, 12 Aug 2007 17:11:20 +0800
> Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > Andrew and Ken,
> > 
> > Here are some more experiments on the writeback stuff.
> > Comments are highly welcome~ 
> 
> I've been doing benchmarks lately to try and trigger fragmentation, and
> one of them is a simulation of make -j N.  It takes a list of all
> the .o files in the kernel tree, randomly sorts them and then
> creates bogus files with the same names and sizes in clean kernel trees.
> 
> This is basically creating a whole bunch of files in random order in a
> whole bunch of subdirectories.
> 
> The results aren't pretty:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> 
> The top graph shows one dot for each write over time.  It shows that
> ext3 is basically writing all over the place the whole time.  But, ext3
> actually wins the read phase, so the layout isn't horrible.  My guess
> is that if we introduce some write clustering by sending a group of
> inodes down at the same time, it'll go much much better.
> 
> Andrew has mentioned bringing a few radix trees into the writeback paths
> before, it seems like file servers and other general uses will benefit
> from better clustering here.
> 
> I'm hoping to talk you into trying it out ;)

Thank you for the description of problem. So far I have a similar one
in mind: if we are to delay writeback of atime-dirty-only inodes to
above 1 hour, some grouping/piggy-backing scenario would be
beneficial.  (Which I guess does not deserve the complexity now that
we have Ingo's make-reltime-default patch.)

My vague idea is to
- keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue.
- convert s_dirty to some radix-tree/rbtree based data structure.
  It would have dual functions: delayed-writeback and clustered-writeback.
  
clustered-writeback:
- Use inode number as clue of locality, hence the key for the sorted
  tree.
- Drain some more s_dirty inodes into s_io on every kupdate wakeup,
  but do it in the ascending order of inode number instead of
  ->dirtied_when. 

delayed-writeback:
- Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
  dirty_expire_interval.

Notes:
(1) I'm not sure inode number is correlated to disk location in
    filesystems other than ext2/3/4. Or parent dir?
(2) It duplicates some function of elevators. Why is it necessary?
    Maybe we have no clue on the exact data location at this time?

Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
  2007-08-22  1:18       ` Fengguang Wu
@ 2007-08-22 12:42         ` Chris Mason
  2007-08-23  2:47           ` David Chinner
       [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
  0 siblings, 2 replies; 34+ messages in thread
From: Chris Mason @ 2007-08-22 12:42 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Wed, 22 Aug 2007 09:18:41 +0800
Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > On Sun, 12 Aug 2007 17:11:20 +0800
> > Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > 
> > > Andrew and Ken,
> > > 
> > > Here are some more experiments on the writeback stuff.
> > > Comments are highly welcome~ 
> > 
> > I've been doing benchmarks lately to try and trigger fragmentation,
> > and one of them is a simulation of make -j N.  It takes a list of
> > all the .o files in the kernel tree, randomly sorts them and then
> > creates bogus files with the same names and sizes in clean kernel
> > trees.
> > 
> > This is basically creating a whole bunch of files in random order
> > in a whole bunch of subdirectories.
> > 
> > The results aren't pretty:
> > 
> > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> > 
> > The top graph shows one dot for each write over time.  It shows that
> > ext3 is basically writing all over the place the whole time.  But,
> > ext3 actually wins the read phase, so the layout isn't horrible.
> > My guess is that if we introduce some write clustering by sending a
> > group of inodes down at the same time, it'll go much much better.
> > 
> > Andrew has mentioned bringing a few radix trees into the writeback
> > paths before, it seems like file servers and other general uses
> > will benefit from better clustering here.
> > 
> > I'm hoping to talk you into trying it out ;)
> 
> Thank you for the description of problem. So far I have a similar one
> in mind: if we are to delay writeback of atime-dirty-only inodes to
> above 1 hour, some grouping/piggy-backing scenario would be
> beneficial.  (Which I guess does not deserve the complexity now that
> we have Ingo's make-reltime-default patch.)

Good clustering would definitely help some delayed atime writeback
scheme.

> 
> My vague idea is to
> - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> queue.
> - convert s_dirty to some radix-tree/rbtree based data structure.
>   It would have dual functions: delayed-writeback and
> clustered-writeback. 
> clustered-writeback:
> - Use inode number as clue of locality, hence the key for the sorted
>   tree.
> - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
>   but do it in the ascending order of inode number instead of
>   ->dirtied_when. 
> 
> delayed-writeback:
> - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
>   dirty_expire_interval.

I think we should assume a full scan of s_dirty is impossible in the
presence of concurrent writers.  We want to be able to pick a start
time (right now) and find all the inodes older than that start time.
New things will come in while we're scanning.  But perhaps that's what
you're saying...

At any rate, we've got two types of lists now.  One keeps track of age
and the other two keep track of what is currently being written.  I
would try two things:

1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
indexes by inode number (or some arbitrary field the FS can set in the
inode).  Radix tree tags are used to indicate which things in s_io are
already in progress or are pending (hand waving because I'm not sure
exactly).

inodes are pulled off s_dirty and the corresponding slot in s_io is
tagged to indicate IO has started.  Any nearby inodes in s_io are also
sent down.

2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
sequence number that corresponds to age.  It is treated as a big
circular indexed list that can wrap around over time.  Radix tree tags
are used both on s_dirty and s_io to flag which inodes are in progress.

> 
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
>     filesystems other than ext2/3/4. Or parent dir?

In general, it is a better assumption than sorting by time.  It may
make sense to one day let the FS provide a clustering hint
(corresponding to the first block in the file?), but for starters it
makes sense to just go with the inode number.

> (2) It duplicates some function of elevators. Why is it necessary?
>     Maybe we have no clue on the exact data location at this time?

The elevator can only sort the pending IO, and we send down a
relatively small window of all the dirty pages at a time.  If we sent
down all the dirty pages and let the elevator sort it out, we wouldn't
need this clustering at all.

But, that has other issues ;)

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
  2007-08-22 12:42         ` Chris Mason
@ 2007-08-23  2:47           ` David Chinner
  2007-08-23 12:13             ` Chris Mason
       [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
  1 sibling, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-08-23  2:47 UTC (permalink / raw)
  To: Chris Mason
  Cc: Fengguang Wu, Andrew Morton, Ken Chen, linux-kernel,
	linux-fsdevel, Jens Axboe

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers.  We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning.  But perhaps that's what
> you're saying...
> 
> At any rate, we've got two types of lists now.  One keeps track of age
> and the other two keep track of what is currently being written.  I
> would try two things:
> 
> 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode).  Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
> 
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started.  Any nearby inodes in s_io are also
> sent down.

the problem with this approach is that it only looks at inode locality.
Data locality is ignored completely here and the data for all the
inodes that are close together could be splattered all over the drive.
In that case, clustering by inode location is exactly the wrong
thing to do.

For example, XFs changes allocation strategy at 1TB for 32bit inode
filesystems which makes the data get placed way away from the inodes.
i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
by inode number for data writeback is mostly useless in the >1TB
case.

The inode32 for <1Tb and inode64 allocators both try to keep data
close to the inode (i.e. in the same AG) so clustering by inode number
might work better here.

Also, it might be worthwhile allowing the filesystem to supply a
hint or mask for "closeness" for inode clustering. This would help
the gernic code only try to cluster inode writes to inodes that
fall into the same cluster as the first inode....

> > Notes:
> > (1) I'm not sure inode number is correlated to disk location in
> >     filesystems other than ext2/3/4. Or parent dir?
> 
> In general, it is a better assumption than sorting by time.  It may
> make sense to one day let the FS provide a clustering hint
> (corresponding to the first block in the file?), but for starters it
> makes sense to just go with the inode number.

Perhaps multiple hints are needed - one for data locality and one
for inode cluster locality.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
  2007-08-23  2:47           ` David Chinner
@ 2007-08-23 12:13             ` Chris Mason
       [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Mason @ 2007-08-23 12:13 UTC (permalink / raw)
  To: David Chinner
  Cc: Fengguang Wu, Andrew Morton, Ken Chen, linux-kernel,
	linux-fsdevel, Jens Axboe

On Thu, 23 Aug 2007 12:47:23 +1000
David Chinner <dgc@sgi.com> wrote:

> On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > I think we should assume a full scan of s_dirty is impossible in the
> > presence of concurrent writers.  We want to be able to pick a start
> > time (right now) and find all the inodes older than that start time.
> > New things will come in while we're scanning.  But perhaps that's
> > what you're saying...
> > 
> > At any rate, we've got two types of lists now.  One keeps track of
> > age and the other two keep track of what is currently being
> > written.  I would try two things:
> > 
> > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > indexes by inode number (or some arbitrary field the FS can set in
> > the inode).  Radix tree tags are used to indicate which things in
> > s_io are already in progress or are pending (hand waving because
> > I'm not sure exactly).
> > 
> > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > also sent down.
> 
> the problem with this approach is that it only looks at inode
> locality. Data locality is ignored completely here and the data for
> all the inodes that are close together could be splattered all over
> the drive. In that case, clustering by inode location is exactly the
> wrong thing to do.

Usually it won't be less wrong than clustering by time.

> 
> For example, XFs changes allocation strategy at 1TB for 32bit inode
> filesystems which makes the data get placed way away from the inodes.
> i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> by inode number for data writeback is mostly useless in the >1TB
> case.

I agree we'll want a way to let the FS provide the clustering key.  But
for the first cut on the patch, I would suggest keeping it simple.

> 
> The inode32 for <1Tb and inode64 allocators both try to keep data
> close to the inode (i.e. in the same AG) so clustering by inode number
> might work better here.
> 
> Also, it might be worthwhile allowing the filesystem to supply a
> hint or mask for "closeness" for inode clustering. This would help
> the gernic code only try to cluster inode writes to inodes that
> fall into the same cluster as the first inode....

Yes, also a good idea after things are working.

> 
> > > Notes:
> > > (1) I'm not sure inode number is correlated to disk location in
> > >     filesystems other than ext2/3/4. Or parent dir?
> > 
> > In general, it is a better assumption than sorting by time.  It may
> > make sense to one day let the FS provide a clustering hint
> > (corresponding to the first block in the file?), but for starters it
> > makes sense to just go with the inode number.
> 
> Perhaps multiple hints are needed - one for data locality and one
> for inode cluster locality.

So, my feature creep idea would have been more data clustering.  I'm
mainly trying to solve this graph:

http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png

Where background writing of the block device inode is making ext3 do
seeky writes while directory trees.  My simple idea was to kick
off a 'I've just written block X' call back to the FS, where it may
decide to send down dirty chunks of the block device inode that also
happen to be dirty.

But, maintaining the kupdate max dirty time and congestion limits in
the face of all this clustering gets tricky.  So, I wasn't going to
suggest it until the basic machinery was working.

Fengguang, this isn't a small project ;)  But, lots of people will be
interested in the results.

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070824125643.GB7933@mail.ustc.edu.cn>]

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
@ 2007-08-24 12:56                 ` Fengguang Wu
  2007-08-24 12:56                 ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 12:56 UTC (permalink / raw)
  To: Chris Mason
  Cc: David Chinner, Andrew Morton, Ken Chen, linux-kernel,
	linux-fsdevel, Jens Axboe

On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <dgc@sgi.com> wrote:
> 
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers.  We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning.  But perhaps that's
> > > what you're saying...
> > > 
> > > At any rate, we've got two types of lists now.  One keeps track of
> > > age and the other two keep track of what is currently being
> > > written.  I would try two things:
> > > 
> > > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode).  Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > > 
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > > also sent down.
> > 
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
> 
> Usually it won't be less wrong than clustering by time.
> 
> > 
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
> 
> I agree we'll want a way to let the FS provide the clustering key.  But
> for the first cut on the patch, I would suggest keeping it simple.
> 
> > 
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> > 
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
> 
> Yes, also a good idea after things are working.
> 
> > 
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > >     filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > In general, it is a better assumption than sorting by time.  It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> > 
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
> 
> So, my feature creep idea would have been more data clustering.  I'm
> mainly trying to solve this graph:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
> 
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees.  My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
> 
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky.  So, I wasn't going to
> suggest it until the basic machinery was working.
> 
> Fengguang, this isn't a small project ;)  But, lots of people will be
> interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
  2007-08-24 12:56                 ` Fengguang Wu
@ 2007-08-24 12:56                 ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 12:56 UTC (permalink / raw)
  To: Chris Mason
  Cc: David Chinner, Andrew Morton, Ken Chen, linux-kernel,
	linux-fsdevel, Jens Axboe

On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <dgc@sgi.com> wrote:
> 
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers.  We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning.  But perhaps that's
> > > what you're saying...
> > > 
> > > At any rate, we've got two types of lists now.  One keeps track of
> > > age and the other two keep track of what is currently being
> > > written.  I would try two things:
> > > 
> > > 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode).  Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > > 
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started.  Any nearby inodes in s_io are
> > > also sent down.
> > 
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
> 
> Usually it won't be less wrong than clustering by time.
> 
> > 
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
> 
> I agree we'll want a way to let the FS provide the clustering key.  But
> for the first cut on the patch, I would suggest keeping it simple.
> 
> > 
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> > 
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
> 
> Yes, also a good idea after things are working.
> 
> > 
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > >     filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > In general, it is a better assumption than sorting by time.  It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> > 
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
> 
> So, my feature creep idea would have been more data clustering.  I'm
> mainly trying to solve this graph:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
> 
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees.  My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
> 
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky.  So, I wasn't going to
> suggest it until the basic machinery was working.
> 
> Fengguang, this isn't a small project ;)  But, lots of people will be
> interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070824132458.GC7933@mail.ustc.edu.cn>]

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
@ 2007-08-24 13:24             ` Fengguang Wu
  2007-08-24 13:24             ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 13:24 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > My vague idea is to
> > - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> > queue.
> > - convert s_dirty to some radix-tree/rbtree based data structure.
> >   It would have dual functions: delayed-writeback and
> > clustered-writeback. 
> > clustered-writeback:
> > - Use inode number as clue of locality, hence the key for the sorted
> >   tree.
> > - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
> >   but do it in the ascending order of inode number instead of
> >   ->dirtied_when. 
> > 
> > delayed-writeback:
> > - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
> >   dirty_expire_interval.
> 
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers.  We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning.  But perhaps that's what
> you're saying...

Yeah, I was thinking about elevators :)
Or call it sweeping based on address-hint(inode number).

> At any rate, we've got two types of lists now.  One keeps track of age
> and the other two keep track of what is currently being written.  I
> would try two things:
> 
> 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode).  Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
> 
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started.  Any nearby inodes in s_io are also
> sent down.
> 
> 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
> sequence number that corresponds to age.  It is treated as a big
> circular indexed list that can wrap around over time.  Radix tree tags
> are used both on s_dirty and s_io to flag which inodes are in progress.

It's meaningless to convert s_io to radix tree. Because inodes on s_io
will normally be sent to block layer elevators at the same time.

Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
The more inodes, the more chances of good clustering. That's the
general rule.

s_dirty is the right place to do address-clustering.
As for the dirty_expire_interval parameter on dirty age,
we can apply a simple rule: do one full scan/sweep over the
fs-address-space in every 30s, syncing all inodes encountered,
and sparing those newly dirtied in less than 5s. With that rule,
any inode will get synced after being dirtied for 5-35 seconds.

-fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
  2007-08-24 13:24             ` Fengguang Wu
@ 2007-08-24 13:24             ` Fengguang Wu
  2007-08-24 14:36               ` Chris Mason
  1 sibling, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 13:24 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > My vague idea is to
> > - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> > queue.
> > - convert s_dirty to some radix-tree/rbtree based data structure.
> >   It would have dual functions: delayed-writeback and
> > clustered-writeback. 
> > clustered-writeback:
> > - Use inode number as clue of locality, hence the key for the sorted
> >   tree.
> > - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
> >   but do it in the ascending order of inode number instead of
> >   ->dirtied_when. 
> > 
> > delayed-writeback:
> > - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
> >   dirty_expire_interval.
> 
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers.  We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning.  But perhaps that's what
> you're saying...

Yeah, I was thinking about elevators :)
Or call it sweeping based on address-hint(inode number).

> At any rate, we've got two types of lists now.  One keeps track of age
> and the other two keep track of what is currently being written.  I
> would try two things:
> 
> 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode).  Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
> 
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started.  Any nearby inodes in s_io are also
> sent down.
> 
> 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
> sequence number that corresponds to age.  It is treated as a big
> circular indexed list that can wrap around over time.  Radix tree tags
> are used both on s_dirty and s_io to flag which inodes are in progress.

It's meaningless to convert s_io to radix tree. Because inodes on s_io
will normally be sent to block layer elevators at the same time.

Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
The more inodes, the more chances of good clustering. That's the
general rule.

s_dirty is the right place to do address-clustering.
As for the dirty_expire_interval parameter on dirty age,
we can apply a simple rule: do one full scan/sweep over the
fs-address-space in every 30s, syncing all inodes encountered,
and sparing those newly dirtied in less than 5s. With that rule,
any inode will get synced after being dirtied for 5-35 seconds.

-fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
  2007-08-24 13:24             ` Fengguang Wu
@ 2007-08-24 14:36               ` Chris Mason
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Mason @ 2007-08-24 14:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Fri, 24 Aug 2007 21:24:58 +0800
Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
 
> > 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by
> > a sequence number that corresponds to age.  It is treated as a big
> > circular indexed list that can wrap around over time.  Radix tree
> > tags are used both on s_dirty and s_io to flag which inodes are in
> > progress.
> 
> It's meaningless to convert s_io to radix tree. Because inodes on s_io
> will normally be sent to block layer elevators at the same time.

Not entirely, using a radix tree instead lets you tag things instead of
doing the current backflips across three lists.

> 
> Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
> The more inodes, the more chances of good clustering. That's the
> general rule.
> 
> s_dirty is the right place to do address-clustering.
> As for the dirty_expire_interval parameter on dirty age,
> we can apply a simple rule: do one full scan/sweep over the
> fs-address-space in every 30s, syncing all inodes encountered,
> and sparing those newly dirtied in less than 5s. With that rule,
> any inode will get synced after being dirtied for 5-35 seconds.

This gives you an O(inodes dirty) behavior instead of the current O(old
inodes).  It might not matter, but walking the radix tree is more
expensive than walking a list.

But, I look forward to your patches, we can tune from there.

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
  2007-08-22  1:18       ` Fengguang Wu
@ 2007-08-22  1:18       ` Fengguang Wu
  2007-08-23  2:33       ` David Chinner
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-22  1:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel, Jens Axboe

On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> On Sun, 12 Aug 2007 17:11:20 +0800
> Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > Andrew and Ken,
> > 
> > Here are some more experiments on the writeback stuff.
> > Comments are highly welcome~ 
> 
> I've been doing benchmarks lately to try and trigger fragmentation, and
> one of them is a simulation of make -j N.  It takes a list of all
> the .o files in the kernel tree, randomly sorts them and then
> creates bogus files with the same names and sizes in clean kernel trees.
> 
> This is basically creating a whole bunch of files in random order in a
> whole bunch of subdirectories.
> 
> The results aren't pretty:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> 
> The top graph shows one dot for each write over time.  It shows that
> ext3 is basically writing all over the place the whole time.  But, ext3
> actually wins the read phase, so the layout isn't horrible.  My guess
> is that if we introduce some write clustering by sending a group of
> inodes down at the same time, it'll go much much better.
> 
> Andrew has mentioned bringing a few radix trees into the writeback paths
> before, it seems like file servers and other general uses will benefit
> from better clustering here.
> 
> I'm hoping to talk you into trying it out ;)

Thank you for the description of problem. So far I have a similar one
in mind: if we are to delay writeback of atime-dirty-only inodes to
above 1 hour, some grouping/piggy-backing scenario would be
beneficial.  (Which I guess does not deserve the complexity now that
we have Ingo's make-reltime-default patch.)

My vague idea is to
- keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue.
- convert s_dirty to some radix-tree/rbtree based data structure.
  It would have dual functions: delayed-writeback and clustered-writeback.
  
clustered-writeback:
- Use inode number as clue of locality, hence the key for the sorted
  tree.
- Drain some more s_dirty inodes into s_io on every kupdate wakeup,
  but do it in the ascending order of inode number instead of
  ->dirtied_when. 

delayed-writeback:
- Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
  dirty_expire_interval.

Notes:
(1) I'm not sure inode number is correlated to disk location in
    filesystems other than ext2/3/4. Or parent dir?
(2) It duplicates some function of elevators. Why is it necessary?
    Maybe we have no clue on the exact data location at this time?

Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
  2007-08-22  1:18       ` Fengguang Wu
  2007-08-22  1:18       ` Fengguang Wu
@ 2007-08-23  2:33       ` David Chinner
       [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
  2 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-08-23  2:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Chris Mason, Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel,
	Jens Axboe

On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
>     filesystems other than ext2/3/4. Or parent dir?

The correspond to the exact location on disk on XFS. But, XFS has it's
own inode clustering (see xfs_iflush) and it can't be moved up
into the generic layers because of locking and integration into
the transaction subsystem.

> (2) It duplicates some function of elevators. Why is it necessary?

The elevators have no clue as to how the filesystem might treat adjacent
inodes. In XFS, inode clustering is a fundamental feature of the inode
reading and writing and that is something no elevator can hope to
acheive....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070824135504.GA9029@mail.ustc.edu.cn>]

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
@ 2007-08-24 13:55           ` Fengguang Wu
  2007-08-24 13:55           ` Fengguang Wu
       [not found]           ` <20070828145530.GD61154114@sgi.com>
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 13:55 UTC (permalink / raw)
  To: David Chinner
  Cc: Chris Mason, Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel,
	Jens Axboe

On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > Notes:
> > (1) I'm not sure inode number is correlated to disk location in
> >     filesystems other than ext2/3/4. Or parent dir?
> 
> The correspond to the exact location on disk on XFS. But, XFS has it's
> own inode clustering (see xfs_iflush) and it can't be moved up
> into the generic layers because of locking and integration into
> the transaction subsystem.
>
> > (2) It duplicates some function of elevators. Why is it necessary?
> 
> The elevators have no clue as to how the filesystem might treat adjacent
> inodes. In XFS, inode clustering is a fundamental feature of the inode
> reading and writing and that is something no elevator can hope to
> acheive....
 
Thank you. That explains the linear write curve(perfect!) in Chris' graph.

I wonder if XFS can benefit any more from the general writeback clustering.
How large would be a typical XFS cluster?

-fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
  2007-08-24 13:55           ` Fengguang Wu
@ 2007-08-24 13:55           ` Fengguang Wu
       [not found]           ` <20070828145530.GD61154114@sgi.com>
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-24 13:55 UTC (permalink / raw)
  To: David Chinner
  Cc: Chris Mason, Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel,
	Jens Axboe

On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > Notes:
> > (1) I'm not sure inode number is correlated to disk location in
> >     filesystems other than ext2/3/4. Or parent dir?
> 
> The correspond to the exact location on disk on XFS. But, XFS has it's
> own inode clustering (see xfs_iflush) and it can't be moved up
> into the generic layers because of locking and integration into
> the transaction subsystem.
>
> > (2) It duplicates some function of elevators. Why is it necessary?
> 
> The elevators have no clue as to how the filesystem might treat adjacent
> inodes. In XFS, inode clustering is a fundamental feature of the inode
> reading and writing and that is something no elevator can hope to
> acheive....
 
Thank you. That explains the linear write curve(perfect!) in Chris' graph.

I wonder if XFS can benefit any more from the general writeback clustering.
How large would be a typical XFS cluster?

-fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070828145530.GD61154114@sgi.com>]

[parent not found: <20070828110820.542bbd67@think.oraclecorp.com>]

[parent not found: <20070828163308.GE61154114@sgi.com>]

[parent not found: <20070829075330.GA5960@mail.ustc.edu.cn>]

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]                 ` <20070829075330.GA5960@mail.ustc.edu.cn>
@ 2007-08-29  7:53                   ` Fengguang Wu
  2007-08-29  7:53                   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-29  7:53 UTC (permalink / raw)
  To: David Chinner
  Cc: Chris Mason, Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel,
	Jens Axboe

On Wed, Aug 29, 2007 at 02:33:08AM +1000, David Chinner wrote:
> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > On Wed, 29 Aug 2007 00:55:30 +1000
> > David Chinner <dgc@sgi.com> wrote:
> > > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > > > Notes:
> > > > > > (1) I'm not sure inode number is correlated to disk location in
> > > > > >     filesystems other than ext2/3/4. Or parent dir?
> > > > > 
> > > > > The correspond to the exact location on disk on XFS. But, XFS has
> > > > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > > > up into the generic layers because of locking and integration into
> > > > > the transaction subsystem.
> > > > >
> > > > > > (2) It duplicates some function of elevators. Why is it
> > > > > > necessary?
> > > > > 
> > > > > The elevators have no clue as to how the filesystem might treat
> > > > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > > > feature of the inode reading and writing and that is something no
> > > > > elevator can hope to acheive....
> > > >  
> > > > Thank you. That explains the linear write curve(perfect!) in Chris'
> > > > graph.
> > > > 
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > > 
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible) to
> > > make it more generic.
> > 
> > When I talked to hch about this, he said the order file data pages got
> > written in XFS was still dictated by the order the higher layers sent
> > things down.
> 
> Sure, that's file data. I was talking about the inode writeback, not the
> data writeback.
> 
> > Shouldn't the clustering still help to have delalloc done
> > in inode order instead of in whatever random order pdflush sends things
> > down now?
> 
> Depends on how things are being allocated. if you've got inode32 allocation
> and >1TB filesytsem, then data is nowhere near the inodes. If you've got large
> allocation groups, then data is typically nowhere near the inodes, either. If
> you've got full AGs, data will be nowehere near the inodes. If you've got
> large files and lots of data to write, then clustering multiple files together
> for writing is not needed.  So in many cases, clustering delalloc writes by
> inode number doesn't provide any better I/o patterns than not clustering...
> 
> The only difference we may see is that if we flush all the data on inodes
> in a single cluster, we can get away with a single inode cluster write
> for all of the inodes....

So we end up with two major cases:

- small files: inode and its data are expected to be close enough,
  hence it can help I_DIRTY_SYNC and/or I_DIRTY_PAGES

- big files: inode and its data may or may not be separated
        - I_DIRTY_SYNC:  could be improved
        - I_DIRTY_PAGES: no better, no worse(it's big I/O, the seek
          cost is not relevant in any case)

Conclusion: _inode_ writeback clustering is enough.

Isn't it simple? ;-)

Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6] writeback time order/delay fixes take 3
       [not found]                 ` <20070829075330.GA5960@mail.ustc.edu.cn>
  2007-08-29  7:53                   ` Fengguang Wu
@ 2007-08-29  7:53                   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-29  7:53 UTC (permalink / raw)
  To: David Chinner
  Cc: Chris Mason, Andrew Morton, Ken Chen, linux-kernel, linux-fsdevel,
	Jens Axboe

On Wed, Aug 29, 2007 at 02:33:08AM +1000, David Chinner wrote:
> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > On Wed, 29 Aug 2007 00:55:30 +1000
> > David Chinner <dgc@sgi.com> wrote:
> > > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > > > Notes:
> > > > > > (1) I'm not sure inode number is correlated to disk location in
> > > > > >     filesystems other than ext2/3/4. Or parent dir?
> > > > > 
> > > > > The correspond to the exact location on disk on XFS. But, XFS has
> > > > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > > > up into the generic layers because of locking and integration into
> > > > > the transaction subsystem.
> > > > >
> > > > > > (2) It duplicates some function of elevators. Why is it
> > > > > > necessary?
> > > > > 
> > > > > The elevators have no clue as to how the filesystem might treat
> > > > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > > > feature of the inode reading and writing and that is something no
> > > > > elevator can hope to acheive....
> > > >  
> > > > Thank you. That explains the linear write curve(perfect!) in Chris'
> > > > graph.
> > > > 
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > > 
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible) to
> > > make it more generic.
> > 
> > When I talked to hch about this, he said the order file data pages got
> > written in XFS was still dictated by the order the higher layers sent
> > things down.
> 
> Sure, that's file data. I was talking about the inode writeback, not the
> data writeback.
> 
> > Shouldn't the clustering still help to have delalloc done
> > in inode order instead of in whatever random order pdflush sends things
> > down now?
> 
> Depends on how things are being allocated. if you've got inode32 allocation
> and >1TB filesytsem, then data is nowhere near the inodes. If you've got large
> allocation groups, then data is typically nowhere near the inodes, either. If
> you've got full AGs, data will be nowehere near the inodes. If you've got
> large files and lots of data to write, then clustering multiple files together
> for writing is not needed.  So in many cases, clustering delalloc writes by
> inode number doesn't provide any better I/o patterns than not clustering...
> 
> The only difference we may see is that if we flush all the data on inodes
> in a single cluster, we can get away with a single inode cluster write
> for all of the inodes....

So we end up with two major cases:

- small files: inode and its data are expected to be close enough,
  hence it can help I_DIRTY_SYNC and/or I_DIRTY_PAGES

- big files: inode and its data may or may not be separated
        - I_DIRTY_SYNC:  could be improved
        - I_DIRTY_PAGES: no better, no worse(it's big I/O, the seek
          cost is not relevant in any case)

Conclusion: _inode_ writeback clustering is enough.

Isn't it simple? ;-)

Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/6] writeback time order/delay fixes take 3
       [not found] <20070812091120.189651872@mail.ustc.edu.cn>
  2007-08-12  9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
@ 2007-08-12  9:11 ` Fengguang Wu
       [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

Andrew and Ken,

Here are some more experiments on the writeback stuff.
Comments are highly welcome~ 

writeback fixes:
[PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8   
[PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()                     
[PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_pa

debug codes:
[PATCH 4/6] check dirty inode list                                             
[PATCH 5/6] prevent time-ordering warnings                                     
[PATCH 6/6] track redirty_tail() calls                                         

Thank you,
Fengguang
-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092052.558804846@mail.ustc.edu.cn>]

* [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
       [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, Andrew Morton, linux-kernel, linux-fsdevel

[-- Attachment #1: inode-dirty-time-ordering-fix.patch --]
[-- Type: text/plain, Size: 5386 bytes --]

Fix the time ordering bug re-introduced by
writeback-fix-periodic-superblock-dirty-inode-flushing.patch.

The old logic moves not-yet-expired dirty inodes from s_dirty to s_io,
*only to* move them back. The move-inodes-back-and-forth thing is a mess,
which is eliminated by this patch.

Note that the line
	list_splice_init(&sb->s_more_io, &sb->s_io);
is also moved to queue_io(). Otherwise when there are big dirtied files,
s_io never becomes empty, preventing new expired inodes to get in.

Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   67 +++++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 28 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -118,7 +118,7 @@ void __mark_inode_dirty(struct inode *in
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty or s_io, don't
+		 * If the inode was already on s_dirty/s_io/s_more_io, don't
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
 		if (!was_dirty) {
@@ -172,6 +172,34 @@ static void requeue_io(struct inode *ino
 }
 
 /*
+ * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
+ */
+static void move_expired_inodes(struct list_head *delaying_queue,
+			       struct list_head *dispatch_queue,
+				unsigned long *older_than_this)
+{
+	while (!list_empty(delaying_queue)) {
+		struct inode *inode = list_entry(delaying_queue->prev,
+						struct inode, i_list);
+		if (older_than_this &&
+			time_after(inode->dirtied_when, *older_than_this))
+			break;
+		list_move(&inode->i_list, dispatch_queue);
+	}
+}
+
+/*
+ * Queue all expired dirty inodes for io, eldest first.
+ */
+static void queue_io(struct super_block *sb,
+				unsigned long *older_than_this)
+{
+	if (list_empty(&sb->s_io))
+		list_splice_init(&sb->s_more_io, &sb->s_io);
+	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+}
+
+/*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
  *
@@ -221,7 +249,7 @@ __sync_single_inode(struct inode *inode,
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode.  It is moved from s_io onto s_dirty.
+			 * the inode; Move it from s_io onto s_more_io/s_dirty.
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
@@ -234,10 +262,9 @@ __sync_single_inode(struct inode *inode,
 			 */
 			if (wbc->for_kupdate) {
 				/*
-				 * For the kupdate function we leave the inode
-				 * at the head of sb_dirty so it will get more
-				 * writeout as soon as the queue becomes
-				 * uncongested.
+				 * For the kupdate function we move the inode
+				 * to s_more_io so it will get more writeout as
+				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
 				requeue_io(inode);
@@ -295,10 +322,10 @@ __writeback_single_inode(struct inode *i
 
 		/*
 		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to the head of
-		 * s_dirty so that writeback can proceed with the other inodes
-		 * on s_io.  We'll have another go at writing back this inode
-		 * when the s_dirty iodes get moved back onto s_io.
+		 * doing writeback-for-data-integrity.  Move it to s_more_io so
+		 * that writeback can proceed with the other inodes on s_io.
+		 * We'll have another go at writing back this inode when we
+		 * completed a full scan of s_io.
 		 */
 		requeue_io(inode);
 
@@ -362,10 +389,8 @@ __writeback_single_inode(struct inode *i
 static void
 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
 {
-	const unsigned long start = jiffies;	/* livelock avoidance */
-
 	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		list_splice_init(&sb->s_dirty, &sb->s_io);
+		queue_io(sb, wbc->older_than_this);
 
 	while (!list_empty(&sb->s_io)) {
 		struct inode *inode = list_entry(sb->s_io.prev,
@@ -406,17 +431,6 @@ sync_sb_inodes(struct super_block *sb, s
 			continue;		/* blockdev has wrong queue */
 		}
 
-		/* Was this inode dirtied after sync_sb_inodes was called? */
-		if (time_after(inode->dirtied_when, start))
-			break;
-
-		/* Was this inode dirtied too recently? */
-		if (wbc->older_than_this && time_after(inode->dirtied_when,
-						*wbc->older_than_this)) {
-			list_splice_init(&sb->s_io, sb->s_dirty.prev);
-			break;
-		}
-
 		/* Is another pdflush already flushing this queue? */
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
@@ -446,9 +460,6 @@ sync_sb_inodes(struct super_block *sb, s
 			break;
 	}
 
-	if (list_empty(&sb->s_io))
-		list_splice_init(&sb->s_more_io, &sb->s_io);
-
 	return;		/* Leave any unwritten inodes on s_io */
 }
 
@@ -458,7 +469,7 @@ sync_sb_inodes(struct super_block *sb, s
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
  * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until both the ->s_dirty and ->s_io lists are
+ * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
       [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, Andrew Morton, linux-kernel, linux-fsdevel

[-- Attachment #1: inode-dirty-time-ordering-fix.patch --]
[-- Type: text/plain, Size: 5386 bytes --]

Fix the time ordering bug re-introduced by
writeback-fix-periodic-superblock-dirty-inode-flushing.patch.

The old logic moves not-yet-expired dirty inodes from s_dirty to s_io,
*only to* move them back. The move-inodes-back-and-forth thing is a mess,
which is eliminated by this patch.

Note that the line
	list_splice_init(&sb->s_more_io, &sb->s_io);
is also moved to queue_io(). Otherwise when there are big dirtied files,
s_io never becomes empty, preventing new expired inodes to get in.

Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   67 +++++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 28 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -118,7 +118,7 @@ void __mark_inode_dirty(struct inode *in
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty or s_io, don't
+		 * If the inode was already on s_dirty/s_io/s_more_io, don't
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
 		if (!was_dirty) {
@@ -172,6 +172,34 @@ static void requeue_io(struct inode *ino
 }
 
 /*
+ * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
+ */
+static void move_expired_inodes(struct list_head *delaying_queue,
+			       struct list_head *dispatch_queue,
+				unsigned long *older_than_this)
+{
+	while (!list_empty(delaying_queue)) {
+		struct inode *inode = list_entry(delaying_queue->prev,
+						struct inode, i_list);
+		if (older_than_this &&
+			time_after(inode->dirtied_when, *older_than_this))
+			break;
+		list_move(&inode->i_list, dispatch_queue);
+	}
+}
+
+/*
+ * Queue all expired dirty inodes for io, eldest first.
+ */
+static void queue_io(struct super_block *sb,
+				unsigned long *older_than_this)
+{
+	if (list_empty(&sb->s_io))
+		list_splice_init(&sb->s_more_io, &sb->s_io);
+	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+}
+
+/*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
  *
@@ -221,7 +249,7 @@ __sync_single_inode(struct inode *inode,
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode.  It is moved from s_io onto s_dirty.
+			 * the inode; Move it from s_io onto s_more_io/s_dirty.
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
@@ -234,10 +262,9 @@ __sync_single_inode(struct inode *inode,
 			 */
 			if (wbc->for_kupdate) {
 				/*
-				 * For the kupdate function we leave the inode
-				 * at the head of sb_dirty so it will get more
-				 * writeout as soon as the queue becomes
-				 * uncongested.
+				 * For the kupdate function we move the inode
+				 * to s_more_io so it will get more writeout as
+				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
 				requeue_io(inode);
@@ -295,10 +322,10 @@ __writeback_single_inode(struct inode *i
 
 		/*
 		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to the head of
-		 * s_dirty so that writeback can proceed with the other inodes
-		 * on s_io.  We'll have another go at writing back this inode
-		 * when the s_dirty iodes get moved back onto s_io.
+		 * doing writeback-for-data-integrity.  Move it to s_more_io so
+		 * that writeback can proceed with the other inodes on s_io.
+		 * We'll have another go at writing back this inode when we
+		 * completed a full scan of s_io.
 		 */
 		requeue_io(inode);
 
@@ -362,10 +389,8 @@ __writeback_single_inode(struct inode *i
 static void
 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
 {
-	const unsigned long start = jiffies;	/* livelock avoidance */
-
 	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		list_splice_init(&sb->s_dirty, &sb->s_io);
+		queue_io(sb, wbc->older_than_this);
 
 	while (!list_empty(&sb->s_io)) {
 		struct inode *inode = list_entry(sb->s_io.prev,
@@ -406,17 +431,6 @@ sync_sb_inodes(struct super_block *sb, s
 			continue;		/* blockdev has wrong queue */
 		}
 
-		/* Was this inode dirtied after sync_sb_inodes was called? */
-		if (time_after(inode->dirtied_when, start))
-			break;
-
-		/* Was this inode dirtied too recently? */
-		if (wbc->older_than_this && time_after(inode->dirtied_when,
-						*wbc->older_than_this)) {
-			list_splice_init(&sb->s_io, sb->s_dirty.prev);
-			break;
-		}
-
 		/* Is another pdflush already flushing this queue? */
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
@@ -446,9 +460,6 @@ sync_sb_inodes(struct super_block *sb, s
 			break;
 	}
 
-	if (list_empty(&sb->s_io))
-		list_splice_init(&sb->s_more_io, &sb->s_io);
-
 	return;		/* Leave any unwritten inodes on s_io */
 }
 
@@ -458,7 +469,7 @@ sync_sb_inodes(struct super_block *sb, s
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
  * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until both the ->s_dirty and ->s_io lists are
+ * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092052.704326603@mail.ustc.edu.cn>]

* [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
       [not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cc: Ken Chen, Anton Altaparmakov, Andrew Morton, linux-kernel,
	linux-fsdevel

[-- Attachment #1: nfs-dirty-inodes.patch --]
[-- Type: text/plain, Size: 2520 bytes --]

NTFS's if-condition on dirty inodes is not complete.
Fix it with sb_has_dirty_inodes().

Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
---
 fs/fs-writeback.c  |   10 +++++++++-
 fs/ntfs/super.c    |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 12 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/ntfs/super.c
+++ linux-2.6.23-rc2-mm2/fs/ntfs/super.c
@@ -2381,14 +2381,14 @@ static void ntfs_put_super(struct super_
 	 */
 	ntfs_commit_inode(vol->mft_ino);
 	write_inode_now(vol->mft_ino, 1);
-	if (!list_empty(&sb->s_dirty)) {
+	if (sb_has_dirty_inodes(sb)) {
 		const char *s1, *s2;
 
 		mutex_lock(&vol->mft_ino->i_mutex);
 		truncate_inode_pages(vol->mft_ino->i_mapping, 0);
 		mutex_unlock(&vol->mft_ino->i_mutex);
 		write_inode_now(vol->mft_ino, 1);
-		if (!list_empty(&sb->s_dirty)) {
+		if (sb_has_dirty_inodes(sb)) {
 			static const char *_s1 = "inodes";
 			static const char *_s2 = "";
 			s1 = _s1;
--- linux-2.6.23-rc2-mm2.orig/include/linux/fs.h
+++ linux-2.6.23-rc2-mm2/include/linux/fs.h
@@ -1712,6 +1712,7 @@ extern int bdev_read_only(struct block_d
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
+extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -199,6 +199,14 @@ static void queue_io(struct super_block 
 	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
 }
 
+int sb_has_dirty_inodes(struct super_block *sb)
+{
+	return !list_empty(&sb->s_dirty) ||
+	       !list_empty(&sb->s_io) ||
+	       !list_empty(&sb->s_more_io);
+}
+EXPORT_SYMBOL(sb_has_dirty_inodes);
+
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -492,7 +500,7 @@ writeback_inodes(struct writeback_contro
 restart:
 	sb = sb_entry(super_blocks.prev);
 	for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
-		if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+		if (sb_has_dirty_inodes(sb)) {
 			/* we're making our own get_super here */
 			sb->s_count++;
 			spin_unlock(&sb_lock);

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
       [not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cc: Ken Chen, Anton Altaparmakov, Andrew Morton, linux-kernel,
	linux-fsdevel

[-- Attachment #1: nfs-dirty-inodes.patch --]
[-- Type: text/plain, Size: 2520 bytes --]

NTFS's if-condition on dirty inodes is not complete.
Fix it with sb_has_dirty_inodes().

Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
---
 fs/fs-writeback.c  |   10 +++++++++-
 fs/ntfs/super.c    |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 12 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/ntfs/super.c
+++ linux-2.6.23-rc2-mm2/fs/ntfs/super.c
@@ -2381,14 +2381,14 @@ static void ntfs_put_super(struct super_
 	 */
 	ntfs_commit_inode(vol->mft_ino);
 	write_inode_now(vol->mft_ino, 1);
-	if (!list_empty(&sb->s_dirty)) {
+	if (sb_has_dirty_inodes(sb)) {
 		const char *s1, *s2;
 
 		mutex_lock(&vol->mft_ino->i_mutex);
 		truncate_inode_pages(vol->mft_ino->i_mapping, 0);
 		mutex_unlock(&vol->mft_ino->i_mutex);
 		write_inode_now(vol->mft_ino, 1);
-		if (!list_empty(&sb->s_dirty)) {
+		if (sb_has_dirty_inodes(sb)) {
 			static const char *_s1 = "inodes";
 			static const char *_s2 = "";
 			s1 = _s1;
--- linux-2.6.23-rc2-mm2.orig/include/linux/fs.h
+++ linux-2.6.23-rc2-mm2/include/linux/fs.h
@@ -1712,6 +1712,7 @@ extern int bdev_read_only(struct block_d
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
+extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -199,6 +199,14 @@ static void queue_io(struct super_block 
 	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
 }
 
+int sb_has_dirty_inodes(struct super_block *sb)
+{
+	return !list_empty(&sb->s_dirty) ||
+	       !list_empty(&sb->s_io) ||
+	       !list_empty(&sb->s_more_io);
+}
+EXPORT_SYMBOL(sb_has_dirty_inodes);
+
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -492,7 +500,7 @@ writeback_inodes(struct writeback_contro
 restart:
 	sb = sb_entry(super_blocks.prev);
 	for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
-		if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+		if (sb_has_dirty_inodes(sb)) {
 			/* we're making our own get_super here */
 			sb->s_count++;
 			spin_unlock(&sb_lock);

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092052.983296733@mail.ustc.edu.cn>]

* [PATCH 4/6] check dirty inode list
       [not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cc: Ken Chen, Mike Waychison, Andrew Morton, linux-kernel,
	linux-fsdevel

[-- Attachment #1: check_dirty_inode_list.patch --]
[-- Type: text/plain, Size: 6004 bytes --]

 From: Andrew Morton <akpm@linux-foundation.org>

The per-superblock dirty-inode list super_block.s_dirty is supposed to be
sorted in reverse order of each inode's time-of-first-dirtying.  This is so
that the kupdate function can avoid having to walk all the dirty inodes on the
list: it terminates the search as soon as it finds an inode which was dirtied
less than 30 seconds ago (dirty_expire_centisecs).

We have a bunch of several-year-old bugs which cause that list to not be in
the correct reverse-time-order.  The result of this is that under certain
obscure circumstances, inodes get stuck and basically never get written back. 
It has been reported a couple of times, but nobody really cared much because
most people use ordered-mode journalling filesystems, which take care of the
writeback independently.  Plus we will _eventually_ get onto these inodes even
when the list is out of order, and a /bin/sync will still work OK.

However this is a pretty important data-integrity issue for filesystems such
as ext2.


As preparation for fixing these bugs, this patch adds a pile of fantastically
expensive debugging code which checks the sanity of the s_dirty list all over
the place, so we find out as soon as it goes bad.

The debugging code is controlled by /proc/sys/fs/inode_debug, which defaults
to off.  The debugging will disable itself whenever it detects a misordering,
to avoid log spew.

We can remove all this code later.

Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs-writeback.c         |   77 ++++++++++++++++++++++++++++++++++++
 include/linux/writeback.h |    1 
 kernel/sysctl.c           |    8 +++
 3 files changed, 86 insertions(+)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -24,6 +24,75 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+int sysctl_inode_debug __read_mostly;
+
+static int __check(struct list_head *head, int print_stuff)
+{
+	struct list_head *cursor = head;
+	unsigned long dirtied_when = 0;
+
+	while ((cursor = cursor->prev) != head) {
+		struct inode *inode = list_entry(cursor, struct inode, i_list);
+		if (print_stuff) {
+			printk("%p:%lu\n", inode, inode->dirtied_when);
+		} else {
+			if (dirtied_when &&
+			    time_before(inode->dirtied_when, dirtied_when))
+				return 1;
+			dirtied_when = inode->dirtied_when;
+		}
+	}
+	return 0;
+}
+
+static void __check_dirty_inode_list(struct super_block *sb,
+			struct inode *inode, const char *file, int line)
+{
+	if (!sysctl_inode_debug)
+		return;
+
+	if (__check(&sb->s_dirty, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_dirty got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_dirty got screwed up\n", file, line);
+		__check(&sb->s_dirty, 1);
+	}
+	if (__check(&sb->s_io, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_io got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_io got screwed up\n", file, line);
+		__check(&sb->s_io, 1);
+	}
+	if (__check(&sb->s_more_io, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_more_io got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_more_io got screwed up\n", file, line);
+		__check(&sb->s_more_io, 1);
+	}
+}
+
+#define check_dirty_inode_list(sb)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+		__check_dirty_inode_list(sb, NULL, __FILE__, __LINE__);	\
+	} while (0)
+
+#define check_dirty_inode(inode)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+			__check_dirty_inode_list(inode->i_sb, inode,	\
+						__FILE__, __LINE__);	\
+	} while (0)
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -122,8 +191,10 @@ void __mark_inode_dirty(struct inode *in
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
 		if (!was_dirty) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 	}
 out:
@@ -152,6 +223,7 @@ static void redirty_tail(struct inode *i
 {
 	struct super_block *sb = inode->i_sb;
 
+	check_dirty_inode(inode);
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
@@ -161,6 +233,7 @@ static void redirty_tail(struct inode *i
 			inode->dirtied_when = jiffies;
 	}
 	list_move(&inode->i_list, &sb->s_dirty);
+	check_dirty_inode(inode);
 }
 
 /*
@@ -168,7 +241,9 @@ static void redirty_tail(struct inode *i
  */
 static void requeue_io(struct inode *inode)
 {
+	check_dirty_inode(inode);
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+	check_dirty_inode(inode);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -463,8 +538,10 @@ int generic_sync_sb_inodes(struct super_
 		if (!ret)
 			ret = err;
 		if (wbc->sync_mode == WB_SYNC_HOLD) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -140,5 +140,6 @@ void writeback_set_ratelimit(void);
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */
 
+extern int sysctl_inode_debug;
 
 #endif		/* WRITEBACK_H */
--- linux-2.6.23-rc2-mm2.orig/kernel/sysctl.c
+++ linux-2.6.23-rc2-mm2/kernel/sysctl.c
@@ -1238,6 +1238,14 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "inode_debug",
+		.data		= &sysctl_inode_debug,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/6] check dirty inode list
       [not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 4/6] check dirty inode list Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cc: Ken Chen, Mike Waychison, Andrew Morton, linux-kernel,
	linux-fsdevel

[-- Attachment #1: check_dirty_inode_list.patch --]
[-- Type: text/plain, Size: 6004 bytes --]

 From: Andrew Morton <akpm@linux-foundation.org>

The per-superblock dirty-inode list super_block.s_dirty is supposed to be
sorted in reverse order of each inode's time-of-first-dirtying.  This is so
that the kupdate function can avoid having to walk all the dirty inodes on the
list: it terminates the search as soon as it finds an inode which was dirtied
less than 30 seconds ago (dirty_expire_centisecs).

We have a bunch of several-year-old bugs which cause that list to not be in
the correct reverse-time-order.  The result of this is that under certain
obscure circumstances, inodes get stuck and basically never get written back. 
It has been reported a couple of times, but nobody really cared much because
most people use ordered-mode journalling filesystems, which take care of the
writeback independently.  Plus we will _eventually_ get onto these inodes even
when the list is out of order, and a /bin/sync will still work OK.

However this is a pretty important data-integrity issue for filesystems such
as ext2.


As preparation for fixing these bugs, this patch adds a pile of fantastically
expensive debugging code which checks the sanity of the s_dirty list all over
the place, so we find out as soon as it goes bad.

The debugging code is controlled by /proc/sys/fs/inode_debug, which defaults
to off.  The debugging will disable itself whenever it detects a misordering,
to avoid log spew.

We can remove all this code later.

Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs-writeback.c         |   77 ++++++++++++++++++++++++++++++++++++
 include/linux/writeback.h |    1 
 kernel/sysctl.c           |    8 +++
 3 files changed, 86 insertions(+)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -24,6 +24,75 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+int sysctl_inode_debug __read_mostly;
+
+static int __check(struct list_head *head, int print_stuff)
+{
+	struct list_head *cursor = head;
+	unsigned long dirtied_when = 0;
+
+	while ((cursor = cursor->prev) != head) {
+		struct inode *inode = list_entry(cursor, struct inode, i_list);
+		if (print_stuff) {
+			printk("%p:%lu\n", inode, inode->dirtied_when);
+		} else {
+			if (dirtied_when &&
+			    time_before(inode->dirtied_when, dirtied_when))
+				return 1;
+			dirtied_when = inode->dirtied_when;
+		}
+	}
+	return 0;
+}
+
+static void __check_dirty_inode_list(struct super_block *sb,
+			struct inode *inode, const char *file, int line)
+{
+	if (!sysctl_inode_debug)
+		return;
+
+	if (__check(&sb->s_dirty, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_dirty got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_dirty got screwed up\n", file, line);
+		__check(&sb->s_dirty, 1);
+	}
+	if (__check(&sb->s_io, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_io got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_io got screwed up\n", file, line);
+		__check(&sb->s_io, 1);
+	}
+	if (__check(&sb->s_more_io, 0)) {
+		sysctl_inode_debug = 0;
+		if (inode)
+			printk("%s:%d: s_more_io got screwed up.  inode=%p:%lu\n",
+					file, line, inode, inode->dirtied_when);
+		else
+			printk("%s:%d: s_more_io got screwed up\n", file, line);
+		__check(&sb->s_more_io, 1);
+	}
+}
+
+#define check_dirty_inode_list(sb)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+		__check_dirty_inode_list(sb, NULL, __FILE__, __LINE__);	\
+	} while (0)
+
+#define check_dirty_inode(inode)					\
+	do {								\
+		if (unlikely(sysctl_inode_debug))			\
+			__check_dirty_inode_list(inode->i_sb, inode,	\
+						__FILE__, __LINE__);	\
+	} while (0)
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -122,8 +191,10 @@ void __mark_inode_dirty(struct inode *in
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
 		if (!was_dirty) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 	}
 out:
@@ -152,6 +223,7 @@ static void redirty_tail(struct inode *i
 {
 	struct super_block *sb = inode->i_sb;
 
+	check_dirty_inode(inode);
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
@@ -161,6 +233,7 @@ static void redirty_tail(struct inode *i
 			inode->dirtied_when = jiffies;
 	}
 	list_move(&inode->i_list, &sb->s_dirty);
+	check_dirty_inode(inode);
 }
 
 /*
@@ -168,7 +241,9 @@ static void redirty_tail(struct inode *i
  */
 static void requeue_io(struct inode *inode)
 {
+	check_dirty_inode(inode);
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+	check_dirty_inode(inode);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -463,8 +538,10 @@ int generic_sync_sb_inodes(struct super_
 		if (!ret)
 			ret = err;
 		if (wbc->sync_mode == WB_SYNC_HOLD) {
+			check_dirty_inode(inode);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &sb->s_dirty);
+			check_dirty_inode(inode);
 		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -140,5 +140,6 @@ void writeback_set_ratelimit(void);
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */
 
+extern int sysctl_inode_debug;
 
 #endif		/* WRITEBACK_H */
--- linux-2.6.23-rc2-mm2.orig/kernel/sysctl.c
+++ linux-2.6.23-rc2-mm2/kernel/sysctl.c
@@ -1238,6 +1238,14 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "inode_debug",
+		.data		= &sysctl_inode_debug,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092053.113127445@mail.ustc.edu.cn>]

* [PATCH 5/6] prevent time-ordering warnings
       [not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

[-- Attachment #1: dirty-order-fix.patch --]
[-- Type: text/plain, Size: 814 bytes --]

Just to make the inode list time ordering check logic comfortable.
Otherwise meaningless.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -224,14 +224,7 @@ static void redirty_tail(struct inode *i
 	struct super_block *sb = inode->i_sb;
 
 	check_dirty_inode(inode);
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
-
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (!time_after_eq(inode->dirtied_when,
-				tail_inode->dirtied_when))
-			inode->dirtied_when = jiffies;
-	}
+	inode->dirtied_when = jiffies;
 	list_move(&inode->i_list, &sb->s_dirty);
 	check_dirty_inode(inode);
 }

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/6] prevent time-ordering warnings
       [not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

[-- Attachment #1: dirty-order-fix.patch --]
[-- Type: text/plain, Size: 814 bytes --]

Just to make the inode list time ordering check logic comfortable.
Otherwise meaningless.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -224,14 +224,7 @@ static void redirty_tail(struct inode *i
 	struct super_block *sb = inode->i_sb;
 
 	check_dirty_inode(inode);
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
-
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (!time_after_eq(inode->dirtied_when,
-				tail_inode->dirtied_when))
-			inode->dirtied_when = jiffies;
-	}
+	inode->dirtied_when = jiffies;
 	list_move(&inode->i_list, &sb->s_dirty);
 	check_dirty_inode(inode);
 }

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092053.242474484@mail.ustc.edu.cn>]

* [PATCH 6/6] track redirty_tail() calls
       [not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

[-- Attachment #1: redirty-debug.patch --]
[-- Type: text/plain, Size: 1209 bytes --]

It helps a lot to know how redirty_tail() are called.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -210,6 +210,11 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
+#define redirty_tail(inode)						\
+	do {								\
+		__redirty_tail(inode, __LINE__);			\
+	} while (0)
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -219,11 +224,14 @@ static int write_inode(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void __redirty_tail(struct inode *inode, int line)
 {
 	struct super_block *sb = inode->i_sb;
 
 	check_dirty_inode(inode);
+	if (unlikely(sysctl_inode_debug))
+		printk(KERN_DEBUG "redirtied inode %lu line %d\n",
+				inode->i_ino, line);
 	inode->dirtied_when = jiffies;
 	list_move(&inode->i_list, &sb->s_dirty);
 	check_dirty_inode(inode);

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/6] track redirty_tail() calls
       [not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, linux-kernel, linux-fsdevel

[-- Attachment #1: redirty-debug.patch --]
[-- Type: text/plain, Size: 1209 bytes --]

It helps a lot to know how redirty_tail() are called.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -210,6 +210,11 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
+#define redirty_tail(inode)						\
+	do {								\
+		__redirty_tail(inode, __LINE__);			\
+	} while (0)
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -219,11 +224,14 @@ static int write_inode(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void __redirty_tail(struct inode *inode, int line)
 {
 	struct super_block *sb = inode->i_sb;
 
 	check_dirty_inode(inode);
+	if (unlikely(sysctl_inode_debug))
+		printk(KERN_DEBUG "redirtied inode %lu line %d\n",
+				inode->i_ino, line);
 	inode->dirtied_when = jiffies;
 	list_move(&inode->i_list, &sb->s_dirty);
 	check_dirty_inode(inode);

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070812092052.848213359@mail.ustc.edu.cn>]

* [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
  2007-08-13  1:03   ` David Chinner
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, Andrew Morton, linux-kernel, linux-fsdevel

[-- Attachment #1: no-skipped.patch --]
[-- Type: text/plain, Size: 6253 bytes --]

Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh 
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023


% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824  # generated by a debug printk
[  529.263184] redirtied inode 847824 line 548
[  564.250872] redirtied inode 847824 line 548
[  594.272797] redirtied inode 847824 line 548
[  629.231330] redirtied inode 847824 line 548
[  659.224674] redirtied inode 847824 line 548
[  689.219890] redirtied inode 847824 line 548
[  724.226655] redirtied inode 847824 line 548
[  759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543                 if (wbc->pages_skipped != pages_skipped) {
544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */
548                         redirty_tail(inode);
549                 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

This patch fixes this bug. I'm not quite sure about it.
But at least the comment in generic_sync_sb_inodes():

544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */

and the comment in __block_write_full_page():

1713                 /*
1714                  * The page was marked dirty, but the buffers were
1715                  * clean.  Someone wrote them back by hand with
1716                  * ll_rw_block/submit_bh.  A rare case.
1717                  */

do not quite agree with each other. The page writeback is skipped not because
of 'locked buffer', but 'clean buffer'.

This is the new behavior after the patch:

# bin/test-writeback.sh
nr_dirty 60
847824 /var/x
nr_dirty 60
nr_dirty 31139
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.55338 seconds, 135 MB/s
nr_dirty 47137
nr_dirty 46147
nr_dirty 46147
nr_dirty 46147
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46109
nr_dirty 46109
nr_dirty 46109
nr_dirty 46113
nr_dirty 46113
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46089
nr_dirty 46089
nr_dirty 46090
nr_dirty 46093
nr_dirty 46093
nr_dirty 15
nr_dirty 15
nr_dirty 15
nr_dirty 15

It is pretty numbers: wait 30s and write ALL:)

But another run is not so good:

# sh bin/test-writeback.sh
mount: proc already mounted
nr_dirty 223
nr_dirty 223
nr_dirty 23664
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.51092 seconds, 139 MB/s
nr_dirty 47299
nr_dirty 47271
nr_dirty 47260
nr_dirty 47260
nr_dirty 47267
nr_dirty 47267
nr_dirty 47329
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47606
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47480
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 46470
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 45428
nr_dirty 45435
nr_dirty 45436
nr_dirty 45436
nr_dirty 45436
nr_dirty 257
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16

Basicly they are
- during the dd: ~16M 
- after 30s:      ~4M
- after 5s:       ~4M
- after 5s:     ~176M

The box has 2G memory.

Question 1:
How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

Question 2:
__block_write_full_page() is virtually doing nothing for the whole dirty file.
Isn't it abnormal? Who did the actual write back for us? The jounal? How to fix it?

Any suggestions? Thank you.

Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/buffer.c |    1 -
 1 file changed, 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
 		 * The page and buffer_heads can be released at any time from
 		 * here on.
 		 */
-		wbc->pages_skipped++;	/* We didn't write this page */
 	}
 	return err;
 

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
@ 2007-08-12  9:11   ` Fengguang Wu
  2007-08-13  1:03   ` David Chinner
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-12  9:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cc: Ken Chen, Andrew Morton, linux-kernel, linux-fsdevel

[-- Attachment #1: no-skipped.patch --]
[-- Type: text/plain, Size: 6253 bytes --]

Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh 
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023


% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824  # generated by a debug printk
[  529.263184] redirtied inode 847824 line 548
[  564.250872] redirtied inode 847824 line 548
[  594.272797] redirtied inode 847824 line 548
[  629.231330] redirtied inode 847824 line 548
[  659.224674] redirtied inode 847824 line 548
[  689.219890] redirtied inode 847824 line 548
[  724.226655] redirtied inode 847824 line 548
[  759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543                 if (wbc->pages_skipped != pages_skipped) {
544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */
548                         redirty_tail(inode);
549                 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

This patch fixes this bug. I'm not quite sure about it.
But at least the comment in generic_sync_sb_inodes():

544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */

and the comment in __block_write_full_page():

1713                 /*
1714                  * The page was marked dirty, but the buffers were
1715                  * clean.  Someone wrote them back by hand with
1716                  * ll_rw_block/submit_bh.  A rare case.
1717                  */

do not quite agree with each other. The page writeback is skipped not because
of 'locked buffer', but 'clean buffer'.

This is the new behavior after the patch:

# bin/test-writeback.sh
nr_dirty 60
847824 /var/x
nr_dirty 60
nr_dirty 31139
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.55338 seconds, 135 MB/s
nr_dirty 47137
nr_dirty 46147
nr_dirty 46147
nr_dirty 46147
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46109
nr_dirty 46109
nr_dirty 46109
nr_dirty 46113
nr_dirty 46113
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46089
nr_dirty 46089
nr_dirty 46090
nr_dirty 46093
nr_dirty 46093
nr_dirty 15
nr_dirty 15
nr_dirty 15
nr_dirty 15

It is pretty numbers: wait 30s and write ALL:)

But another run is not so good:

# sh bin/test-writeback.sh
mount: proc already mounted
nr_dirty 223
nr_dirty 223
nr_dirty 23664
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.51092 seconds, 139 MB/s
nr_dirty 47299
nr_dirty 47271
nr_dirty 47260
nr_dirty 47260
nr_dirty 47267
nr_dirty 47267
nr_dirty 47329
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47606
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47480
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 46470
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 45428
nr_dirty 45435
nr_dirty 45436
nr_dirty 45436
nr_dirty 45436
nr_dirty 257
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16

Basicly they are
- during the dd: ~16M 
- after 30s:      ~4M
- after 5s:       ~4M
- after 5s:     ~176M

The box has 2G memory.

Question 1:
How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

Question 2:
__block_write_full_page() is virtually doing nothing for the whole dirty file.
Isn't it abnormal? Who did the actual write back for us? The jounal? How to fix it?

Any suggestions? Thank you.

Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/buffer.c |    1 -
 1 file changed, 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
 		 * The page and buffer_heads can be released at any time from
 		 * here on.
 		 */
-		wbc->pages_skipped++;	/* We didn't write this page */
 	}
 	return err;
 

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
  2007-08-12  9:11   ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
  2007-08-12  9:11   ` Fengguang Wu
@ 2007-08-13  1:03   ` David Chinner
       [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
  2 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-08-13  1:03 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Cc: Ken Chen, Andrew Morton, linux-kernel,
	linux-fsdevel

On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> Basicly they are
> - during the dd: ~16M 
> - after 30s:      ~4M
> - after 5s:       ~4M
> - after 5s:     ~176M
> 
> The box has 2G memory.
> 
> Question 1:
> How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

pdflush runs every five seconds, so that is indicative of the inode being
written once for 1024 pages, and then delayed to the next pdflush run 5s later.
perhaps the inodes aren't moving between the lists exactly the way you
think they are...

> --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
> +++ linux-2.6.23-rc2-mm2/fs/buffer.c
> @@ -1713,7 +1713,6 @@ done:
>  		 * The page and buffer_heads can be released at any time from
>  		 * here on.
>  		 */
> -		wbc->pages_skipped++;	/* We didn't write this page */
>  	}
>  	return err;

Hmmmm - I suspect XFS is going to need a similar fix as well. I'm moving
house so I'm not going to get a chance to look at this for a week...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070813103000.GA8520@mail.ustc.edu.cn>]

* Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
@ 2007-08-13 10:30       ` Fengguang Wu
  2007-08-13 10:30       ` Fengguang Wu
       [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-13 10:30 UTC (permalink / raw)
  To: David Chinner
  Cc: Andrew Morton, Cc: Ken Chen, Andrew Morton, linux-kernel,
	linux-fsdevel

On Mon, Aug 13, 2007 at 11:03:21AM +1000, David Chinner wrote:
> > --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
> > +++ linux-2.6.23-rc2-mm2/fs/buffer.c
> > @@ -1713,7 +1713,6 @@ done:
> >  		 * The page and buffer_heads can be released at any time from
> >  		 * here on.
> >  		 */
> > -		wbc->pages_skipped++;	/* We didn't write this page */
> >  	}
> >  	return err;
> 
> Hmmmm - I suspect XFS is going to need a similar fix as well. I'm moving
> house so I'm not going to get a chance to look at this for a week...

I guess as long as the skipped page no longer has dirty bit set both
as a page flag and a radix tree tag(i.e. has been put to io by someone
else), it is OK to not increase wbc->pages_skipped.


> On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> > Basicly they are
> > - during the dd: ~16M 
> > - after 30s:      ~4M
> > - after 5s:       ~4M
> > - after 5s:     ~176M
> > 
> > The box has 2G memory.
> > 
> > Question 1:
> > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> 
> pdflush runs every five seconds, so that is indicative of the inode being
> written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> perhaps the inodes aren't moving between the lists exactly the way you
> think they are...

Now I figured out the exact situation. When the scan of s_io finishes
with some small inodes, nr_to_write will be positive, fooling kupdate
to quit prematurely. But in fact the big dirty file is on s_more_io
waiting for more io... The attached patch fixes it.

Fengguang
===

Subject: writeback: introduce writeback_control.more_io to indicate more io

After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in 3) fools the upper layer to think that data have all been
written out. Bug the big dirty file is still sitting in s_more_io. We cannot
simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
in s_dirty.  It is also not an option to draw inodes from both s_more_io and
s_dirty, an let the loop go on: this might lead to live locks, and might also
starve other superblocks in sync time(well kupdate may still starve some
superblocks, that's another bug).

So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate this situation. With it the big dirty file
no longer has to wait for the next kupdate invocation 5s later.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c         |    2 ++
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 9 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
 		if (wbc->nr_to_write <= 0)
 			break;
 	}
+	if (!list_empty(&sb->s_more_io))
+		wbc->more_io = 1;
 	spin_unlock(&inode_lock);
 	return ret;		/* Leave any unwritten inodes on s_io */
 }
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -61,6 +61,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 
 	void *fs_private;		/* For use by ->writepages() */
 };
--- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
@@ -382,6 +382,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -389,8 +390,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion)
+				congestion_wait(WRITE, HZ/10);
+			else if (!wbc.more_io)
 				break;
 		}
 	}
@@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion)
 				congestion_wait(WRITE, HZ/10);
-			else
+			else if (!wbc.more_io)
 				break;	/* All the old data is written */
 		}
 		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
  2007-08-13 10:30       ` Fengguang Wu
@ 2007-08-13 10:30       ` Fengguang Wu
       [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
  2 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-13 10:30 UTC (permalink / raw)
  To: David Chinner
  Cc: Andrew Morton, Cc: Ken Chen, Andrew Morton, linux-kernel,
	linux-fsdevel

On Mon, Aug 13, 2007 at 11:03:21AM +1000, David Chinner wrote:
> > --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
> > +++ linux-2.6.23-rc2-mm2/fs/buffer.c
> > @@ -1713,7 +1713,6 @@ done:
> >  		 * The page and buffer_heads can be released at any time from
> >  		 * here on.
> >  		 */
> > -		wbc->pages_skipped++;	/* We didn't write this page */
> >  	}
> >  	return err;
> 
> Hmmmm - I suspect XFS is going to need a similar fix as well. I'm moving
> house so I'm not going to get a chance to look at this for a week...

I guess as long as the skipped page no longer has dirty bit set both
as a page flag and a radix tree tag(i.e. has been put to io by someone
else), it is OK to not increase wbc->pages_skipped.


> On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> > Basicly they are
> > - during the dd: ~16M 
> > - after 30s:      ~4M
> > - after 5s:       ~4M
> > - after 5s:     ~176M
> > 
> > The box has 2G memory.
> > 
> > Question 1:
> > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> 
> pdflush runs every five seconds, so that is indicative of the inode being
> written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> perhaps the inodes aren't moving between the lists exactly the way you
> think they are...

Now I figured out the exact situation. When the scan of s_io finishes
with some small inodes, nr_to_write will be positive, fooling kupdate
to quit prematurely. But in fact the big dirty file is on s_more_io
waiting for more io... The attached patch fixes it.

Fengguang
===

Subject: writeback: introduce writeback_control.more_io to indicate more io

After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in 3) fools the upper layer to think that data have all been
written out. Bug the big dirty file is still sitting in s_more_io. We cannot
simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
in s_dirty.  It is also not an option to draw inodes from both s_more_io and
s_dirty, an let the loop go on: this might lead to live locks, and might also
starve other superblocks in sync time(well kupdate may still starve some
superblocks, that's another bug).

So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate this situation. With it the big dirty file
no longer has to wait for the next kupdate invocation 5s later.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c         |    2 ++
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 9 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
 		if (wbc->nr_to_write <= 0)
 			break;
 	}
+	if (!list_empty(&sb->s_more_io))
+		wbc->more_io = 1;
 	spin_unlock(&inode_lock);
 	return ret;		/* Leave any unwritten inodes on s_io */
 }
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -61,6 +61,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 
 	void *fs_private;		/* For use by ->writepages() */
 };
--- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
@@ -382,6 +382,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -389,8 +390,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion)
+				congestion_wait(WRITE, HZ/10);
+			else if (!wbc.more_io)
 				break;
 		}
 	}
@@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion)
 				congestion_wait(WRITE, HZ/10);
-			else
+			else if (!wbc.more_io)
 				break;	/* All the old data is written */
 		}
 		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20070817071317.GA8965@mail.ustc.edu.cn>]

* Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
       [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
@ 2007-08-17  7:13         ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2007-08-17  7:13 UTC (permalink / raw)
  To: David Chinner, Andrew Morton, Ken Chen, Andrew Morton,
	linux-kernel, linux-fsdevel

On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote:
> > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> > > Basicly they are
> > > - during the dd: ~16M 
> > > - after 30s:      ~4M
> > > - after 5s:       ~4M
> > > - after 5s:     ~176M
> > > 
> > > The box has 2G memory.
> > > 
> > > Question 1:
> > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> > 
> > pdflush runs every five seconds, so that is indicative of the inode being
> > written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> > perhaps the inodes aren't moving between the lists exactly the way you
> > think they are...
> 
> Now I figured out the exact situation. When the scan of s_io finishes
> with some small inodes, nr_to_write will be positive, fooling kupdate
> to quit prematurely. But in fact the big dirty file is on s_more_io
> waiting for more io... The attached patch fixes it.
> 
> Fengguang
> ===
> 
> Subject: writeback: introduce writeback_control.more_io to indicate more io
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
> 	- after 30s:    ~4M
> 	- after 5s:     ~4M
> 	- after 5s:     all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
> 		s_io            s_more_io
> 		-------------------------
> 	1)	100M,1K         0
> 	2)	1K              96M
> 	3)	0               96M
> 
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> 
> nr_to_write > 0 in 3) fools the upper layer to think that data have all been
> written out. Bug the big dirty file is still sitting in s_more_io. We cannot
> simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
> loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
> in s_dirty.  It is also not an option to draw inodes from both s_more_io and
> s_dirty, an let the loop go on: this might lead to live locks, and might also
> starve other superblocks in sync time(well kupdate may still starve some
> superblocks, that's another bug).
> 
> So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
> not necessarily mean that "all data are written". This patch introduces a flag
> writeback_control.more_io to indicate this situation. With it the big dirty file
> no longer has to wait for the next kupdate invocation 5s later.

Sorry, this patch is found to be dangerous. It locks up my desktop
on heavy I/O: kupdate *immediately* returns to push the file in
s_more_io for writeback, but it *could* still not able to make
progress(locks etc.). Now kupdate ends up *busy looping*.  That could
be fixed by wait for somehow 100ms and retry the io. Should we do
it?(or: Is 5s interval considered too long a wait?)

> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  fs/fs-writeback.c         |    2 ++
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    9 ++++++---
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
> +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
> @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
>  		if (wbc->nr_to_write <= 0)
>  			break;
>  	}
> +	if (!list_empty(&sb->s_more_io))
> +		wbc->more_io = 1;
>  	spin_unlock(&inode_lock);
>  	return ret;		/* Leave any unwritten inodes on s_io */
>  }
> --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
> +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
> @@ -61,6 +61,7 @@ struct writeback_control {
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned for_writepages:1;	/* This is a writepages() call */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> +	unsigned more_io:1;		/* more io to be dispatched */
>  
>  	void *fs_private;		/* For use by ->writepages() */
>  };
> --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
> @@ -382,6 +382,7 @@ static void background_writeout(unsigned
>  			global_page_state(NR_UNSTABLE_NFS) < background_thresh
>  				&& min_pages <= 0)
>  			break;
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
> @@ -389,8 +390,9 @@ static void background_writeout(unsigned
>  		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  			/* Wrote less than expected */
> -			congestion_wait(WRITE, HZ/10);
> -			if (!wbc.encountered_congestion)
> +			if (wbc.encountered_congestion)
> +				congestion_wait(WRITE, HZ/10);
> +			else if (!wbc.more_io)
>  				break;
>  		}
>  	}
> @@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
>  			global_page_state(NR_UNSTABLE_NFS) +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		writeback_inodes(&wbc);
>  		if (wbc.nr_to_write > 0) {
>  			if (wbc.encountered_congestion)
>  				congestion_wait(WRITE, HZ/10);
> -			else
> +			else if (!wbc.more_io)
>  				break;	/* All the old data is written */
>  		}
>  		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2007-08-29  7:53 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070812091120.189651872@mail.ustc.edu.cn>
2007-08-12  9:11 ` [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-22  0:23   ` Chris Mason
     [not found]     ` <20070822011841.GA8090@mail.ustc.edu.cn>
2007-08-22  1:18       ` Fengguang Wu
2007-08-22 12:42         ` Chris Mason
2007-08-23  2:47           ` David Chinner
2007-08-23 12:13             ` Chris Mason
     [not found]               ` <20070824125643.GB7933@mail.ustc.edu.cn>
2007-08-24 12:56                 ` Fengguang Wu
2007-08-24 12:56                 ` Fengguang Wu
     [not found]           ` <20070824132458.GC7933@mail.ustc.edu.cn>
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 14:36               ` Chris Mason
2007-08-22  1:18       ` Fengguang Wu
2007-08-23  2:33       ` David Chinner
     [not found]         ` <20070824135504.GA9029@mail.ustc.edu.cn>
2007-08-24 13:55           ` Fengguang Wu
2007-08-24 13:55           ` Fengguang Wu
     [not found]           ` <20070828145530.GD61154114@sgi.com>
     [not found]             ` <20070828110820.542bbd67@think.oraclecorp.com>
     [not found]               ` <20070828163308.GE61154114@sgi.com>
     [not found]                 ` <20070829075330.GA5960@mail.ustc.edu.cn>
2007-08-29  7:53                   ` Fengguang Wu
2007-08-29  7:53                   ` Fengguang Wu
2007-08-12  9:11 ` Fengguang Wu
     [not found] ` <20070812092052.558804846@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.704326603@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.983296733@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.113127445@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092053.242474484@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
     [not found] ` <20070812092052.848213359@mail.ustc.edu.cn>
2007-08-12  9:11   ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-13  1:03   ` David Chinner
     [not found]     ` <20070813103000.GA8520@mail.ustc.edu.cn>
2007-08-13 10:30       ` Fengguang Wu
2007-08-13 10:30       ` Fengguang Wu
     [not found]       ` <20070817071317.GA8965@mail.ustc.edu.cn>
2007-08-17  7:13         ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).