[PATCH] Fix ext2 error reporting on fsync

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix ext2 error reporting on fsync
@ 2006-01-11 17:43 Jan Kara
  2006-01-12  4:24 ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-11 17:43 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 380 bytes --]

  Hello,

  attached patch (from Chris Mason) fixes possible bug in error
reporting during fsync on ext2 filesystem. If some previous metadata
async write failed with EIO, it just set AS_EIO in the device's mapping.
We need to check this in ext2_file_write() and act accordingly. The
patch is against 2.6.15. Please apply.

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: ext2-sync_check-2.6.15.diff --]
[-- Type: text/plain, Size: 1831 bytes --]

From: Chris Mason <mason@suse.com>
Subject: ext2 should force the FS readonly for metadata write errors
References: 65718

During fsync we should check for write errors to the block device in order to make
sure all metadata writes have been properly written to the disk.  Without this check
writes that happen through the normal async mechanisms might hit errors without
reporting them back to the application.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.15/fs/ext2/fsync.c linux-2.6.15-1-ext2_sync_check/fs/ext2/fsync.c
--- linux-2.6.15/fs/ext2/fsync.c	2004-10-18 23:55:24.000000000 +0200
+++ linux-2.6.15-1-ext2_sync_check/fs/ext2/fsync.c	2006-01-13 22:56:43.000000000 +0100
@@ -25,6 +25,7 @@
 #include "ext2.h"
 #include <linux/smp_lock.h>
 #include <linux/buffer_head.h>		/* for fsync_inode_buffers() */
+#include <linux/pagemap.h>
 
 
 /*
@@ -35,10 +36,26 @@
 int ext2_sync_file(struct file *file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
 	int err;
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
+
+	/* it might make more sense to ext2_error on -EIO from
+	 * sync_mapping_buffers as well, but those errors are isolated to just
+	 * this file. We can safely return -EIO to fsync and let the app know
+	 * they have a problem.  
+	 *
+	 * AS_EIO indicates a failure to write a metadata page, but we have no 
+	 * way of knowing which one.  It's best to force readonly and let fsck 
+	 * figure it all out.
+	 */
+	if (test_and_clear_bit(AS_EIO, &sb->s_bdev->bd_inode->i_mapping->flags)) {
+		ext2_error(sb, "ext2_sync_file", "metadata io error");
+		if (!ret)
+			ret = -EIO;
+	}
 	if (!(inode->i_state & I_DIRTY))
 		return ret;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-11 17:43 [PATCH] Fix ext2 error reporting on fsync Jan Kara
@ 2006-01-12  4:24 ` Andrew Morton
  2006-01-12 10:20   ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-01-12  4:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Chris Mason

Jan Kara <jack@suse.cz> wrote:
>
>   Hello,
> 
>   attached patch (from Chris Mason) fixes possible bug in error
> reporting during fsync on ext2 filesystem. If some previous metadata
> async write failed with EIO, it just set AS_EIO in the device's mapping.
> We need to check this in ext2_file_write() and act accordingly. The
> patch is against 2.6.15. Please apply.
> 
> ...
> @@ -35,10 +36,26 @@
>  int ext2_sync_file(struct file *file, struct dentry *dentry, int datasync)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	struct super_block *sb = inode->i_sb;
>  	int err;
>  	int ret;
>  
>  	ret = sync_mapping_buffers(inode->i_mapping);
> +
> +	/* it might make more sense to ext2_error on -EIO from
> +	 * sync_mapping_buffers as well, but those errors are isolated to just
> +	 * this file. We can safely return -EIO to fsync and let the app know
> +	 * they have a problem.  
> +	 *
> +	 * AS_EIO indicates a failure to write a metadata page, but we have no 
> +	 * way of knowing which one.  It's best to force readonly and let fsck 
> +	 * figure it all out.
> +	 */
> +	if (test_and_clear_bit(AS_EIO, &sb->s_bdev->bd_inode->i_mapping->flags)) {
> +		ext2_error(sb, "ext2_sync_file", "metadata io error");
> +		if (!ret)
> +			ret = -EIO;
> +	}
>  	if (!(inode->i_state & I_DIRTY))
>  		return ret;
>  	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> 

This is desperation stuff, isn't it?

It would be better to stick a BH_WriteError into buffer_head.b_state, set
that in the I/O completion handler, check it in fsync_buffers_list()?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12  4:24 ` Andrew Morton
@ 2006-01-12 10:20   ` Jan Kara
  2006-01-12 11:52     ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-12 10:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Chris Mason

> Jan Kara <jack@suse.cz> wrote:
> >
  <snip>
> > +	 */
> > +	if (test_and_clear_bit(AS_EIO, &sb->s_bdev->bd_inode->i_mapping->flags)) {
> > +		ext2_error(sb, "ext2_sync_file", "metadata io error");
> > +		if (!ret)
> > +			ret = -EIO;
> > +	}
> >  	if (!(inode->i_state & I_DIRTY))
> >  		return ret;
> >  	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > 
> 
> This is desperation stuff, isn't it?
> 
> It would be better to stick a BH_WriteError into buffer_head.b_state, set
> that in the I/O completion handler, check it in fsync_buffers_list()?
  It actually depends on the desired behaviour of fsync() in case of
IO errors. Consider the following scenario:
  * process writes some data
    - async writing starts in background and gets an IO error on some
      metadata buffer
  * process calls fsync()
    - all the remaining buffers are written without problems

  If I understand your solution correctly, it need not report any error
in this case as buffer with IO error could be already removed from
mapping->private_list by try_to_free_buffers(). But it would be nice to
report to the user that we were not able to write all the data...

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 10:20   ` Jan Kara
@ 2006-01-12 11:52     ` Andrew Morton
  2006-01-12 13:58       ` Chris Mason
  2006-01-12 14:26       ` Jan Kara
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2006-01-12 11:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, mason

Jan Kara <jack@suse.cz> wrote:
>
> > Jan Kara <jack@suse.cz> wrote:
> > >
>   <snip>
> > > +	 */
> > > +	if (test_and_clear_bit(AS_EIO, &sb->s_bdev->bd_inode->i_mapping->flags)) {
> > > +		ext2_error(sb, "ext2_sync_file", "metadata io error");
> > > +		if (!ret)
> > > +			ret = -EIO;
> > > +	}
> > >  	if (!(inode->i_state & I_DIRTY))
> > >  		return ret;
> > >  	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > 
> > 
> > This is desperation stuff, isn't it?
> > 
> > It would be better to stick a BH_WriteError into buffer_head.b_state, set
> > that in the I/O completion handler, check it in fsync_buffers_list()?
>   It actually depends on the desired behaviour of fsync() in case of
> IO errors. Consider the following scenario:
>   * process writes some data
>     - async writing starts in background and gets an IO error on some
>       metadata buffer
>   * process calls fsync()
>     - all the remaining buffers are written without problems
> 
>   If I understand your solution correctly, it need not report any error
> in this case as buffer with IO error could be already removed from
> mapping->private_list by try_to_free_buffers(). But it would be nice to
> report to the user that we were not able to write all the data...
> 

hm, yes, try_to_free_buffers() against a blockdev page is a problem.  But I
think it's fixable.

In the second loop of drop_buffers(), if we find a bh which had a write
error we need to set AS_EIO on the address_space which is interested in
this buffer.

That address_space is pinned by a) the fact that it has buffers at
->private_list and b) we hold the blockdev mapping's private_lock.

The only problem is actually _finding_ the address_space which is
interested in this buffer_head.  Looks like we'd need a backpointer in the
buffer_head.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 11:52     ` Andrew Morton
@ 2006-01-12 13:58       ` Chris Mason
  2006-01-12 14:21         ` Jan Kara
  2006-01-12 14:26       ` Jan Kara
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2006-01-12 13:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel

On Thursday 12 January 2006 06:52, Andrew Morton wrote:

> hm, yes, try_to_free_buffers() against a blockdev page is a problem.  But I
> think it's fixable.
>
> In the second loop of drop_buffers(), if we find a bh which had a write
> error we need to set AS_EIO on the address_space which is interested in
> this buffer.
>
> That address_space is pinned by a) the fact that it has buffers at
> ->private_list and b) we hold the blockdev mapping's private_lock.
>
> The only problem is actually _finding_ the address_space which is
> interested in this buffer_head.  Looks like we'd need a backpointer in the
> buffer_head.

Hmmm, how does this work with two concurrent fsyncs on different files?

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 13:58       ` Chris Mason
@ 2006-01-12 14:21         ` Jan Kara
  2006-01-12 15:36           ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-12 14:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, Jan Kara, linux-fsdevel

> On Thursday 12 January 2006 06:52, Andrew Morton wrote:
> 
> > hm, yes, try_to_free_buffers() against a blockdev page is a problem.  But I
> > think it's fixable.
> >
> > In the second loop of drop_buffers(), if we find a bh which had a write
> > error we need to set AS_EIO on the address_space which is interested in
> > this buffer.
> >
> > That address_space is pinned by a) the fact that it has buffers at
> > ->private_list and b) we hold the blockdev mapping's private_lock.
> >
> > The only problem is actually _finding_ the address_space which is
> > interested in this buffer_head.  Looks like we'd need a backpointer in the
> > buffer_head.
> 
> Hmmm, how does this work with two concurrent fsyncs on different files?
  Where is the problem? Different files will have different buffers in
its private_list lists, different address_space structures etc. So they
should not interact in any way. Or am I missing something?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 11:52     ` Andrew Morton
  2006-01-12 13:58       ` Chris Mason
@ 2006-01-12 14:26       ` Jan Kara
  2006-01-12 20:47         ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-12 14:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mason

> Jan Kara <jack@suse.cz> wrote:
> 
> hm, yes, try_to_free_buffers() against a blockdev page is a problem.  But I
> think it's fixable.
> 
> In the second loop of drop_buffers(), if we find a bh which had a write
> error we need to set AS_EIO on the address_space which is interested in
> this buffer.
> 
> That address_space is pinned by a) the fact that it has buffers at
> ->private_list and b) we hold the blockdev mapping's private_lock.
> 
> The only problem is actually _finding_ the address_space which is
> interested in this buffer_head.  Looks like we'd need a backpointer in the
> buffer_head.
  Yes, the pointer seems to be inevitable in this solution. If you think
the improvement is worth adding 4 bytes to each buffer_head, then I can
write the patch.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 14:21         ` Jan Kara
@ 2006-01-12 15:36           ` Chris Mason
  2006-01-12 16:32             ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2006-01-12 15:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel

On Thursday 12 January 2006 09:21, Jan Kara wrote:
>
> > Hmmm, how does this work with two concurrent fsyncs on different files?
>
>   Where is the problem? Different files will have different buffers in
> its private_list lists, different address_space structures etc. So they
> should not interact in any way. Or am I missing something?
>

I thought ext2 was using mark_buffer_dirty_inode for the bitmap blocks as 
well.  Checking again, it's only using it for the direct/indirect blocks.  
So, Andrew's idea will work fine.

I suppose it's up for debate if a failure writing other metadata should fail 
the fsync.  I would say yes, but there are valid reasons to say no.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 15:36           ` Chris Mason
@ 2006-01-12 16:32             ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2006-01-12 16:32 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, linux-fsdevel

> On Thursday 12 January 2006 09:21, Jan Kara wrote:
> >
> > > Hmmm, how does this work with two concurrent fsyncs on different files?
> >
> >   Where is the problem? Different files will have different buffers in
> > its private_list lists, different address_space structures etc. So they
> > should not interact in any way. Or am I missing something?
> >
> 
> I thought ext2 was using mark_buffer_dirty_inode for the bitmap blocks as 
> well.  Checking again, it's only using it for the direct/indirect blocks.  
> So, Andrew's idea will work fine.
> 
> I suppose it's up for debate if a failure writing other metadata should fail 
> the fsync.  I would say yes, but there are valid reasons to say no.
  Hmm, the case of bitmaps is really questionable. In case of inode it
would be good to return EIO. And you're right that that is not covered
by checking the private_list.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 14:26       ` Jan Kara
@ 2006-01-12 20:47         ` Andrew Morton
  2006-01-13  2:08           ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-01-12 20:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, mason

Jan Kara <jack@suse.cz> wrote:
>
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > hm, yes, try_to_free_buffers() against a blockdev page is a problem.  But I
> > think it's fixable.
> > 
> > In the second loop of drop_buffers(), if we find a bh which had a write
> > error we need to set AS_EIO on the address_space which is interested in
> > this buffer.
> > 
> > That address_space is pinned by a) the fact that it has buffers at
> > ->private_list and b) we hold the blockdev mapping's private_lock.
> > 
> > The only problem is actually _finding_ the address_space which is
> > interested in this buffer_head.  Looks like we'd need a backpointer in the
> > buffer_head.
>   Yes, the pointer seems to be inevitable in this solution. If you think
> the improvement is worth adding 4 bytes to each buffer_head, then I can
> write the patch.

hm.  It only affects what are now rarely-used filesystems like ext2, minix,
etc.  Not sure about reiser3.  But it is a strict correctness issue.  I
suppose we should do it.  It'd be nice to find a way to avoid increasing
the bh size..

It'll be a hard patch to test.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-12 20:47         ` Andrew Morton
@ 2006-01-13  2:08           ` Chris Mason
  2006-01-13  2:16             ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2006-01-13  2:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel

On Thursday 12 January 2006 15:47, Andrew Morton wrote:
> Jan Kara <jack@suse.cz> wrote:
> > > Jan Kara <jack@suse.cz> wrote:
> > >
> > > hm, yes, try_to_free_buffers() against a blockdev page is a problem. 
> > > But I think it's fixable.
> > >
> > > In the second loop of drop_buffers(), if we find a bh which had a write
> > > error we need to set AS_EIO on the address_space which is interested in
> > > this buffer.
> > >
> > > That address_space is pinned by a) the fact that it has buffers at
> > > ->private_list and b) we hold the blockdev mapping's private_lock.
> > >
> > > The only problem is actually _finding_ the address_space which is
> > > interested in this buffer_head.  Looks like we'd need a backpointer in
> > > the buffer_head.
> >
> >   Yes, the pointer seems to be inevitable in this solution. If you think
> > the improvement is worth adding 4 bytes to each buffer_head, then I can
> > write the patch.
>
> hm.  It only affects what are now rarely-used filesystems like ext2, minix,
> etc.  Not sure about reiser3.  But it is a strict correctness issue.  I
> suppose we should do it.  It'd be nice to find a way to avoid increasing
> the bh size..
>

I know my patch is pretty nasty, but is it really worth adding complexity to 
the base code for this corner case?

> It'll be a hard patch to test.

I'll dig up $partner.  They had a nasty test suite for that kind of thing.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-13  2:08           ` Chris Mason
@ 2006-01-13  2:16             ` Andrew Morton
  2006-01-18 22:46               ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-01-13  2:16 UTC (permalink / raw)
  To: Chris Mason; +Cc: jack, linux-fsdevel

Chris Mason <mason@suse.com> wrote:
>
> On Thursday 12 January 2006 15:47, Andrew Morton wrote:
> > Jan Kara <jack@suse.cz> wrote:
> > > > Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > hm, yes, try_to_free_buffers() against a blockdev page is a problem. 
> > > > But I think it's fixable.
> > > >
> > > > In the second loop of drop_buffers(), if we find a bh which had a write
> > > > error we need to set AS_EIO on the address_space which is interested in
> > > > this buffer.
> > > >
> > > > That address_space is pinned by a) the fact that it has buffers at
> > > > ->private_list and b) we hold the blockdev mapping's private_lock.
> > > >
> > > > The only problem is actually _finding_ the address_space which is
> > > > interested in this buffer_head.  Looks like we'd need a backpointer in
> > > > the buffer_head.
> > >
> > >   Yes, the pointer seems to be inevitable in this solution. If you think
> > > the improvement is worth adding 4 bytes to each buffer_head, then I can
> > > write the patch.
> >
> > hm.  It only affects what are now rarely-used filesystems like ext2, minix,
> > etc.  Not sure about reiser3.  But it is a strict correctness issue.  I
> > suppose we should do it.  It'd be nice to find a way to avoid increasing
> > the bh size..
> >
> 
> I know my patch is pretty nasty, but is it really worth adding complexity to 
> the base code for this corner case?
> 

It's not much complexity: a few set_bits and test_and_clear_bits in three
places.  The main drawback is a larger buffer_head.

Yeah, it's a pita, but it is a data-integrity correctness thing.

> > It'll be a hard patch to test.
> 
> I'll dig up $partner.  They had a nasty test suite for that kind of thing.
> 

That'd be neat.  They do IO error simulations?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-13  2:16             ` Andrew Morton
@ 2006-01-18 22:46               ` Jan Kara
  2006-01-18 23:06                 ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-18 22:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chris Mason, linux-fsdevel

> Chris Mason <mason@suse.com> wrote:
> >
> > On Thursday 12 January 2006 15:47, Andrew Morton wrote:
> > > Jan Kara <jack@suse.cz> wrote:
> > > > > Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > hm, yes, try_to_free_buffers() against a blockdev page is a problem. 
> > > > > But I think it's fixable.
> > > > >
> > > > > In the second loop of drop_buffers(), if we find a bh which had a write
> > > > > error we need to set AS_EIO on the address_space which is interested in
> > > > > this buffer.
> > > > >
> > > > > That address_space is pinned by a) the fact that it has buffers at
> > > > > ->private_list and b) we hold the blockdev mapping's private_lock.
> > > > >
> > > > > The only problem is actually _finding_ the address_space which is
> > > > > interested in this buffer_head.  Looks like we'd need a backpointer in
> > > > > the buffer_head.
> > > >
> > > >   Yes, the pointer seems to be inevitable in this solution. If you think
> > > > the improvement is worth adding 4 bytes to each buffer_head, then I can
> > > > write the patch.
> > >
> > > hm.  It only affects what are now rarely-used filesystems like ext2, minix,
> > > etc.  Not sure about reiser3.  But it is a strict correctness issue.  I
> > > suppose we should do it.  It'd be nice to find a way to avoid increasing
> > > the bh size..
> > >
> > 
> > I know my patch is pretty nasty, but is it really worth adding complexity to 
> > the base code for this corner case?
> > 
> 
> It's not much complexity: a few set_bits and test_and_clear_bits in three
> places.  The main drawback is a larger buffer_head.
> 
> Yeah, it's a pita, but it is a data-integrity correctness thing.
  There's a way to avoid the extra pointer in buffer_head: we could
change the circular linked list b_assoc_buffers to a non-circular one
(probably use hlists). Then if we find a buffer with IO error, we could
find a list head and from it compute the pointer to the mapping... It
would not be fast but on IO error I think we could afford it. Do you think
it's a good idea?

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-18 22:46               ` Jan Kara
@ 2006-01-18 23:06                 ` Andrew Morton
  2006-01-19 12:21                   ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-01-18 23:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: mason, linux-fsdevel

Jan Kara <jack@suse.cz> wrote:
>
> > It's not much complexity: a few set_bits and test_and_clear_bits in three
>  > places.  The main drawback is a larger buffer_head.
>  > 
>  > Yeah, it's a pita, but it is a data-integrity correctness thing.
>    There's a way to avoid the extra pointer in buffer_head: we could
>  change the circular linked list b_assoc_buffers to a non-circular one
>  (probably use hlists). Then if we find a buffer with IO error, we could
>  find a list head and from it compute the pointer to the mapping... It
>  would not be fast but on IO error I think we could afford it. Do you think
>  it's a good idea?

I don't understand.  We have a pointer to a buffer_head against the
blockdev mapping and we want to find, from that, the S_ISREG address_space
which has connected that buffer to itself.  Precisely how are you proposing
that we do that?

(If you're saying that the hlist space saving would allow us to add an
address_space* pointer to the bh at no space cost then yes).

(If you're proposing that we walk the buffer_head list until we somehow
find the list_head or hlist_head which is embedded within the desired
address_space then yes2, but how do we know where to terminate the search?)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-18 23:06                 ` Andrew Morton
@ 2006-01-19 12:21                   ` Jan Kara
  2006-01-19 21:16                     ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-19 12:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mason, linux-fsdevel

> Jan Kara <jack@suse.cz> wrote:
> >
> > > It's not much complexity: a few set_bits and test_and_clear_bits in three
> >  > places.  The main drawback is a larger buffer_head.
> >  > 
> >  > Yeah, it's a pita, but it is a data-integrity correctness thing.
> >    There's a way to avoid the extra pointer in buffer_head: we could
> >  change the circular linked list b_assoc_buffers to a non-circular one
> >  (probably use hlists). Then if we find a buffer with IO error, we could
> >  find a list head and from it compute the pointer to the mapping... It
> >  would not be fast but on IO error I think we could afford it. Do you think
> >  it's a good idea?
> 
> I don't understand.  We have a pointer to a buffer_head against the
> blockdev mapping and we want to find, from that, the S_ISREG address_space
> which has connected that buffer to itself.  Precisely how are you proposing
> that we do that?
  OK, my explanation was probably too brief. So once more :)

> (If you're saying that the hlist space saving would allow us to add an
> address_space* pointer to the bh at no space cost then yes).
  list_head has the same size as hlist_head so I don't think we'll save
any space. Anyway I meant what you write below.

> (If you're proposing that we walk the buffer_head list until we somehow
> find the list_head or hlist_head which is embedded within the desired
> address_space then yes2, but how do we know where to terminate the search?)
  Yes, that's it. If the list is not circular, then it's easy to find
it's head. So what I suggest is to change private_list from a circular
list of list_head structures to the non-circular one. And because
non-circular list is already implemented in hlist macros I'd simply use
those. The only minor hack is that mapping would not contain hlist_head
but hlist_node with pprev set to NULL so that we easily recognize it when
traversing backwards. Is it clearer now?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-19 12:21                   ` Jan Kara
@ 2006-01-19 21:16                     ` Andrew Morton
  2006-01-20 13:41                       ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-01-19 21:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: mason, linux-fsdevel

Jan Kara <jack@suse.cz> wrote:
>
> > (If you're proposing that we walk the buffer_head list until we somehow
>  > find the list_head or hlist_head which is embedded within the desired
>  > address_space then yes2, but how do we know where to terminate the search?)
>    Yes, that's it. If the list is not circular, then it's easy to find
>  it's head. So what I suggest is to change private_list from a circular
>  list of list_head structures to the non-circular one. And because
>  non-circular list is already implemented in hlist macros I'd simply use
>  those. The only minor hack is that mapping would not contain hlist_head
>  but hlist_node with pprev set to NULL so that we easily recognize it when
>  traversing backwards. Is it clearer now?

Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
4GB file will have up to 1000 buffers on that list so we're looking at
potentially O(1000000) operations.

Plan B is to stick with a doubly-linked list, but mark the address_space's
list_head by setting bit 0 of its list_head.next to 1.

The challenge is to implement this in a manner which Linus doesn't notice ;)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-19 21:16                     ` Andrew Morton
@ 2006-01-20 13:41                       ` Jan Kara
  2006-01-20 21:24                         ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-20 13:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mason, linux-fsdevel

> Jan Kara <jack@suse.cz> wrote:
> >
> > > (If you're proposing that we walk the buffer_head list until we somehow
> >  > find the list_head or hlist_head which is embedded within the desired
> >  > address_space then yes2, but how do we know where to terminate the search?)
> >    Yes, that's it. If the list is not circular, then it's easy to find
> >  it's head. So what I suggest is to change private_list from a circular
> >  list of list_head structures to the non-circular one. And because
> >  non-circular list is already implemented in hlist macros I'd simply use
> >  those. The only minor hack is that mapping would not contain hlist_head
> >  but hlist_node with pprev set to NULL so that we easily recognize it when
> >  traversing backwards. Is it clearer now?
> 
> Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
> 4GB file will have up to 1000 buffers on that list so we're looking at
> potentially O(1000000) operations.
> 
> Plan B is to stick with a doubly-linked list, but mark the address_space's
> list_head by setting bit 0 of its list_head.next to 1.
> 
> The challenge is to implement this in a manner which Linus doesn't notice ;)
  In the mean time I though of an improvement of our cunning plan ;) We
could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
buffer is going to be thrown out from private_list (and has BH_Write_EIO
or BH_Neighbor_EIO set) we find another victim (previous buffer on the
list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
we can set error on the mapping. By this we get rid of going through the
long link list.
  I guess I like this solution enough so that I start coding ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 13:41                       ` Jan Kara
@ 2006-01-20 21:24                         ` Andrew Morton
  2006-01-20 21:31                           ` Andrew Morton
                                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2006-01-20 21:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: mason, linux-fsdevel

Jan Kara <jack@suse.cz> wrote:
>
> > Jan Kara <jack@suse.cz> wrote:
> > >
> > > > (If you're proposing that we walk the buffer_head list until we somehow
> > >  > find the list_head or hlist_head which is embedded within the desired
> > >  > address_space then yes2, but how do we know where to terminate the search?)
> > >    Yes, that's it. If the list is not circular, then it's easy to find
> > >  it's head. So what I suggest is to change private_list from a circular
> > >  list of list_head structures to the non-circular one. And because
> > >  non-circular list is already implemented in hlist macros I'd simply use
> > >  those. The only minor hack is that mapping would not contain hlist_head
> > >  but hlist_node with pprev set to NULL so that we easily recognize it when
> > >  traversing backwards. Is it clearer now?
> > 
> > Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
> > 4GB file will have up to 1000 buffers on that list so we're looking at
> > potentially O(1000000) operations.
> > 
> > Plan B is to stick with a doubly-linked list, but mark the address_space's
> > list_head by setting bit 0 of its list_head.next to 1.
> > 
> > The challenge is to implement this in a manner which Linus doesn't notice ;)
>   In the mean time I though of an improvement of our cunning plan ;) We
> could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
> buffer is going to be thrown out from private_list (and has BH_Write_EIO
> or BH_Neighbor_EIO set) we find another victim (previous buffer on the
> list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
> we can set error on the mapping. By this we get rid of going through the
> long link list.
>   I guess I like this solution enough so that I start coding ;).

hm.

Another approach would be to allocate a new buffer_head which "belongs" to
the address_space.  So instead of doing:

		bh<->bh<->address_space<->bh<->bh

we do:

                     address_space
                           ^
                           |
                           v
		bh<->bh<->bh'<->bh<->bh

and set a magic bit in bh' which says "this is not a real bh; your
address_space is at *b_private".

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 21:24                         ` Andrew Morton
@ 2006-01-20 21:31                           ` Andrew Morton
  2006-01-20 21:33                             ` Andrew Morton
  2006-01-22 22:55                             ` Jan Kara
  2006-01-22 22:32                           ` Jan Kara
  2006-01-22 23:31                           ` Jan Kara
  2 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2006-01-20 21:31 UTC (permalink / raw)
  To: jack, mason, linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:
>
>  Another approach would be to allocate a new buffer_head which "belongs" to
>  the address_space.  So instead of doing:
> 
>  		bh<->bh<->address_space<->bh<->bh
> 
>  we do:
> 
>                       address_space
>                             ^
>                             |
>                             v
>  		bh<->bh<->bh'<->bh<->bh
> 
>  and set a magic bit in bh' which says "this is not a real bh; your
>  address_space is at *b_private".

And if we do that, we can actually shrink the address_space by 2*sizeof(long):

- private_list can become `struct buffer_head *metadata_buffers'

- and if we're going to abandon the address_space.private* genericity, we
  can remove address_space.assoc_mapping and use a new
  address_space->metadata_buffers->b_bdev->metadata_lock.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 21:31                           ` Andrew Morton
@ 2006-01-20 21:33                             ` Andrew Morton
  2006-01-22 22:55                             ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2006-01-20 21:33 UTC (permalink / raw)
  To: jack, mason, linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:
>
> Andrew Morton <akpm@osdl.org> wrote:
> >
> >  Another approach would be to allocate a new buffer_head which "belongs" to
> >  the address_space.  So instead of doing:
> > 
> >  		bh<->bh<->address_space<->bh<->bh
> > 
> >  we do:
> > 
> >                       address_space
> >                             ^
> >                             |
> >                             v
> >  		bh<->bh<->bh'<->bh<->bh
> > 
> >  and set a magic bit in bh' which says "this is not a real bh; your
> >  address_space is at *b_private".
> 
> And if we do that, we can actually shrink the address_space by 2*sizeof(long):
> 

3*sizeof(long)!

> - private_list can become `struct buffer_head *metadata_buffers'
> 
> - and if we're going to abandon the address_space.private* genericity, we
>   can remove address_space.assoc_mapping and use a new
>   address_space->metadata_buffers->b_bdev->metadata_lock.

- nuke address_space.assoc_mapping, too.


Of course the cost is that written-to address_spaces will have an
associated extra bh.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 21:24                         ` Andrew Morton
  2006-01-20 21:31                           ` Andrew Morton
@ 2006-01-22 22:32                           ` Jan Kara
  2006-01-22 23:31                           ` Jan Kara
  2 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2006-01-22 22:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mason, linux-fsdevel

> Jan Kara <jack@suse.cz> wrote:
> >
> > > Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > > (If you're proposing that we walk the buffer_head list until we somehow
> > > >  > find the list_head or hlist_head which is embedded within the desired
> > > >  > address_space then yes2, but how do we know where to terminate the search?)
> > > >    Yes, that's it. If the list is not circular, then it's easy to find
> > > >  it's head. So what I suggest is to change private_list from a circular
> > > >  list of list_head structures to the non-circular one. And because
> > > >  non-circular list is already implemented in hlist macros I'd simply use
> > > >  those. The only minor hack is that mapping would not contain hlist_head
> > > >  but hlist_node with pprev set to NULL so that we easily recognize it when
> > > >  traversing backwards. Is it clearer now?
> > > 
> > > Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
> > > 4GB file will have up to 1000 buffers on that list so we're looking at
> > > potentially O(1000000) operations.
> > > 
> > > Plan B is to stick with a doubly-linked list, but mark the address_space's
> > > list_head by setting bit 0 of its list_head.next to 1.
> > > 
> > > The challenge is to implement this in a manner which Linus doesn't notice ;)
> >   In the mean time I though of an improvement of our cunning plan ;) We
> > could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
> > buffer is going to be thrown out from private_list (and has BH_Write_EIO
> > or BH_Neighbor_EIO set) we find another victim (previous buffer on the
> > list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
> > we can set error on the mapping. By this we get rid of going through the
> > long link list.
> >   I guess I like this solution enough so that I start coding ;).
> 
> hm.
> 
> Another approach would be to allocate a new buffer_head which "belongs" to
> the address_space.  So instead of doing:
> 
> 		bh<->bh<->address_space<->bh<->bh
> 
> we do:
> 
>                      address_space
>                            ^
>                            |
>                            v
> 		bh<->bh<->bh'<->bh<->bh
> 
> and set a magic bit in bh' which says "this is not a real bh; your
> address_space is at *b_private".
  Yes, that's a good idea. It simplifies the coding. Actually we don't
need the pointer to the mapping and detection of special buffer head if
we use my bit propagation as I wrote in my last email. We just
eventually set bit BH_Neighbor_EIO in the fake buffer head and we check
it in the fsync_buffers_list() and return EIO in that case.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 21:31                           ` Andrew Morton
  2006-01-20 21:33                             ` Andrew Morton
@ 2006-01-22 22:55                             ` Jan Kara
  2006-01-23  0:06                               ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2006-01-22 22:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mason, linux-fsdevel

> Andrew Morton <akpm@osdl.org> wrote:
> >
> >  Another approach would be to allocate a new buffer_head which "belongs" to
> >  the address_space.  So instead of doing:
> > 
> >  		bh<->bh<->address_space<->bh<->bh
> > 
> >  we do:
> > 
> >                       address_space
> >                             ^
> >                             |
> >                             v
> >  		bh<->bh<->bh'<->bh<->bh
> > 
> >  and set a magic bit in bh' which says "this is not a real bh; your
> >  address_space is at *b_private".
> 
> And if we do that, we can actually shrink the address_space by 2*sizeof(long):
> 
> - private_list can become `struct buffer_head *metadata_buffers'
  Ack.

> - and if we're going to abandon the address_space.private* genericity, we
>   can remove address_space.assoc_mapping and use a new
>   address_space->metadata_buffers->b_bdev->metadata_lock.
  Umm, I'm not sure about this locking change. As I see the code,
private_lock is used for serializing several players accessing the
mapping (__find_get_block_slow, try_to_free_buffers, fsync_buffers_list,
mark_buffer_dirty_inode, __set_page_dirty_buffers,
invalidate_inode_buffer, remove_inode_buffers, grow_dev_page,
create_empty_buffers, bforget). So I'm now a bit lost which of the
private_locks belong to the mapping of the block device and which to the
mappings of the inodes. Or should all those use device's mapping's
private_lock and you indend to convert all those users to metadata_lock?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-20 21:24                         ` Andrew Morton
  2006-01-20 21:31                           ` Andrew Morton
  2006-01-22 22:32                           ` Jan Kara
@ 2006-01-22 23:31                           ` Jan Kara
  2 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2006-01-22 23:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mason, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

> Jan Kara <jack@suse.cz> wrote:
> >
> > > Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > > (If you're proposing that we walk the buffer_head list until we somehow
> > > >  > find the list_head or hlist_head which is embedded within the desired
> > > >  > address_space then yes2, but how do we know where to terminate the search?)
> > > >    Yes, that's it. If the list is not circular, then it's easy to find
> > > >  it's head. So what I suggest is to change private_list from a circular
> > > >  list of list_head structures to the non-circular one. And because
> > > >  non-circular list is already implemented in hlist macros I'd simply use
> > > >  those. The only minor hack is that mapping would not contain hlist_head
> > > >  but hlist_node with pprev set to NULL so that we easily recognize it when
> > > >  traversing backwards. Is it clearer now?
> > > 
> > > Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
> > > 4GB file will have up to 1000 buffers on that list so we're looking at
> > > potentially O(1000000) operations.
> > > 
> > > Plan B is to stick with a doubly-linked list, but mark the address_space's
> > > list_head by setting bit 0 of its list_head.next to 1.
> > > 
> > > The challenge is to implement this in a manner which Linus doesn't notice ;)
> >   In the mean time I though of an improvement of our cunning plan ;) We
> > could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
> > buffer is going to be thrown out from private_list (and has BH_Write_EIO
> > or BH_Neighbor_EIO set) we find another victim (previous buffer on the
> > list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
> > we can set error on the mapping. By this we get rid of going through the
> > long link list.
> >   I guess I like this solution enough so that I start coding ;).
> 
> hm.
> 
> Another approach would be to allocate a new buffer_head which "belongs" to
> the address_space.  So instead of doing:
> 
> 		bh<->bh<->address_space<->bh<->bh
> 
> we do:
> 
>                      address_space
>                            ^
>                            |
>                            v
> 		bh<->bh<->bh'<->bh<->bh
> 
> and set a magic bit in bh' which says "this is not a real bh; your
> address_space is at *b_private".
  Attached is the first patch combining your and my ideas. It's just for
illustration - it's completely untested.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: vfs-2.6.16-rc1-1-private_list.diff --]
[-- Type: text/plain, Size: 9957 bytes --]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/buffer.c linux-2.6.16-rc1-1-private_list/fs/buffer.c
--- linux-2.6.16-rc1/fs/buffer.c	2006-01-17 21:44:01.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/buffer.c	2006-01-23 00:43:16.000000000 +0100
@@ -663,26 +663,26 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  *
  * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
  * inode_has_buffers() and invalidate_inode_buffers() are provided for the
- * management of a list of dependent buffers at ->i_mapping->private_list.
+ * management of a list of dependent buffers at ->i_mapping.metadata_buffers.
  *
- * Locking is a little subtle: try_to_free_buffers() will remove buffers
- * from their controlling inode's queue when they are being freed.  But
- * try_to_free_buffers() will be operating against the *blockdev* mapping
- * at the time, not against the S_ISREG file which depends on those buffers.
- * So the locking for private_list is via the private_lock in the address_space
- * which backs the buffers.  Which is different from the address_space 
- * against which the buffers are listed.  So for a particular address_space,
- * mapping->private_lock does *not* protect mapping->private_list!  In fact,
- * mapping->private_list will always be protected by the backing blockdev's
- * ->private_lock.
+ * Locking is a little subtle: try_to_free_buffers() will remove buffers from
+ * their controlling inode's queue when they are being freed.  But
+ * try_to_free_buffers() will be operating against the *blockdev* mapping at
+ * the time, not against the S_ISREG file which depends on those buffers.  So
+ * the locking for metadata_buffers is via the private_lock in the
+ * address_space which backs the buffers.  Which is different from the
+ * address_space against which the buffers are listed.  So for a particular
+ * address_space, mapping->private_lock does *not* protect
+ * mapping.metadata_buffers! In fact, mapping.metadata_buffers will always be
+ * protected by the backing blockdev's ->private_lock.
  *
  * Which introduces a requirement: all buffers on an address_space's
- * ->private_list must be from the same address_space: the blockdev's.
+ * metadata_buffers must be from the same address_space: the blockdev's.
  *
- * address_spaces which do not place buffers at ->private_list via these
- * utility functions are free to use private_lock and private_list for
- * whatever they want.  The only requirement is that list_empty(private_list)
- * be true at clear_inode() time.
+ * address_spaces which do not place buffers at metadata_buffers via these
+ * utility functions are free to use private_lock and metadata_buffers for
+ * whatever they want.  The only requirement is that
+ * list_empty(metadata_buffers) be true at clear_inode() time.
  *
  * FIXME: clear_inode should not call invalidate_inode_buffers().  The
  * filesystems should do that.  invalidate_inode_buffers() should just go
@@ -713,7 +713,7 @@ static inline void __remove_assoc_queue(
 
 int inode_has_buffers(struct inode *inode)
 {
-	return !list_empty(&inode->i_data.private_list);
+	return !list_empty(&inode->i_data.metadata_buffers.b_assoc_buffers);
 }
 
 /*
@@ -756,7 +756,7 @@ repeat:
  *                        buffers
  * @mapping: the mapping which wants those buffers written
  *
- * Starts I/O against the buffers at mapping->private_list, and waits upon
+ * Starts I/O against the buffers at mapping->metadata_buffers, and waits upon
  * that I/O.
  *
  * Basically, this is a convenience function for fsync().
@@ -767,11 +767,12 @@ int sync_mapping_buffers(struct address_
 {
 	struct address_space *buffer_mapping = mapping->assoc_mapping;
 
-	if (buffer_mapping == NULL || list_empty(&mapping->private_list))
+	if (buffer_mapping == NULL ||
+		list_empty(&mapping->metadata_buffers.b_assoc_buffers))
 		return 0;
 
 	return fsync_buffers_list(&buffer_mapping->private_lock,
-					&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 }
 EXPORT_SYMBOL(sync_mapping_buffers);
 
@@ -807,7 +808,7 @@ void mark_buffer_dirty_inode(struct buff
 	if (list_empty(&bh->b_assoc_buffers)) {
 		spin_lock(&buffer_mapping->private_lock);
 		list_move_tail(&bh->b_assoc_buffers,
-				&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 }
@@ -899,9 +900,18 @@ static int fsync_buffers_list(spinlock_t
 	INIT_LIST_HEAD(&tmp);
 
 	spin_lock(lock);
+	/* Was there IO error on some already evicted buffer? */
+	if (buffer_file_io_error(BH_ENTRY(list))) {
+		clear_buffer_file_io_error(BH_ENTRY(list));
+		err = -EIO;
+	}
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
-		list_del_init(&bh->b_assoc_buffers);
+		__remove_assoc_queue(bh);
+		if (buffer_write_io_error(bh) || buffer_file_io_error(bh)) {
+			clear_buffer_file_io_error(BH_ENTRY(list));
+			err = -EIO;
+		}
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
 			if (buffer_dirty(bh)) {
@@ -953,7 +963,8 @@ void invalidate_inode_buffers(struct ino
 {
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -975,7 +986,8 @@ int remove_inode_buffers(struct inode *i
 
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -2957,8 +2969,15 @@ drop_buffers(struct page *page, struct b
 	do {
 		struct buffer_head *next = bh->b_this_page;
 
-		if (!list_empty(&bh->b_assoc_buffers))
+		if (!list_empty(&bh->b_assoc_buffers)) {
+			if (buffer_write_io_error(bh) ||
+			    buffer_file_io_error(bh)) {
+				struct buffer_head *prev = BH_ENTRY(bh->
+					b_assoc_buffers.prev);
+				set_buffer_file_io_error(prev);
+			}
 			__remove_assoc_queue(bh);
+		}
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/fs-writeback.c linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c
--- linux-2.6.16-rc1/fs/fs-writeback.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c	2006-01-22 23:32:37.000000000 +0100
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(sync_inode);
  * written and waited upon.
  *
  *    OSYNC_DATA:     i_mapping's dirty data
- *    OSYNC_METADATA: the buffers at i_mapping->private_list
+ *    OSYNC_METADATA: the buffers at i_mapping->metadata_buffers
  *    OSYNC_INODE:    the inode itself
  */
 
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/inode.c linux-2.6.16-rc1-1-private_list/fs/inode.c
--- linux-2.6.16-rc1/fs/inode.c	2006-01-23 00:46:45.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/inode.c	2006-01-23 00:48:26.000000000 +0100
@@ -199,7 +199,7 @@ void inode_init_once(struct inode *inode
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	spin_lock_init(&inode->i_data.private_lock);
-	INIT_LIST_HEAD(&inode->i_data.private_list);
+	INIT_LIST_HEAD(&inode->i_data.metadata_buffers.b_assoc_buffers);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
 	spin_lock_init(&inode->i_lock);
@@ -413,7 +413,7 @@ static int can_unuse(struct inode *inode
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
+ * If the inode has metadata buffers attached to mapping->metadata_buffers then
  * try to remove them.
  */
 static void prune_icache(int nr_to_scan)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/buffer_head.h linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h
--- linux-2.6.16-rc1/include/linux/buffer_head.h	2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h	2006-01-22 21:40:09.000000000 +0100
@@ -32,6 +32,7 @@ enum bh_state_bits {
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
 	BH_Eopnotsupp,	/* operation not supported (barrier) */
+	BH_File_EIO,	/* Some other buffer in the file had IO error */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -119,6 +120,7 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
+BUFFER_FNS(File_EIO, file_io_error)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/fs.h linux-2.6.16-rc1-1-private_list/include/linux/fs.h
--- linux-2.6.16-rc1/include/linux/fs.h	2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/fs.h	2006-01-23 00:51:10.000000000 +0100
@@ -381,8 +381,10 @@ struct address_space {
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
-	struct list_head	private_list;	/* ditto */
-	struct address_space	*assoc_mapping;	/* ditto */
+	/* Fake buffer head in the list of metadata buffers belonging to block
+	   device address space */
+	struct buffer_head	metadata_buffers;	
+	struct address_space	*assoc_mapping;	/* for use by the address_space */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fix ext2 error reporting on fsync
  2006-01-22 22:55                             ` Jan Kara
@ 2006-01-23  0:06                               ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2006-01-23  0:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: mason, linux-fsdevel

Jan Kara <jack@suse.cz> wrote:
>
> > Andrew Morton <akpm@osdl.org> wrote:
> > >
> > >  Another approach would be to allocate a new buffer_head which "belongs" to
> > >  the address_space.  So instead of doing:
> > > 
> > >  		bh<->bh<->address_space<->bh<->bh
> > > 
> > >  we do:
> > > 
> > >                       address_space
> > >                             ^
> > >                             |
> > >                             v
> > >  		bh<->bh<->bh'<->bh<->bh
> > > 
> > >  and set a magic bit in bh' which says "this is not a real bh; your
> > >  address_space is at *b_private".
> > 
> > And if we do that, we can actually shrink the address_space by 2*sizeof(long):
> > 
> > - private_list can become `struct buffer_head *metadata_buffers'
>   Ack.
> 
> > - and if we're going to abandon the address_space.private* genericity, we
> >   can remove address_space.assoc_mapping and use a new
> >   address_space->metadata_buffers->b_bdev->metadata_lock.
>   Umm, I'm not sure about this locking change. As I see the code,
> private_lock is used for serializing several players accessing the
> mapping (__find_get_block_slow, try_to_free_buffers, fsync_buffers_list,
> mark_buffer_dirty_inode, __set_page_dirty_buffers,
> invalidate_inode_buffer, remove_inode_buffers, grow_dev_page,
> create_empty_buffers, bforget). So I'm now a bit lost which of the
> private_locks belong to the mapping of the block device and which to the
> mappings of the inodes. Or should all those use device's mapping's
> private_lock and you indend to convert all those users to metadata_lock?
> 

Each inode has a list of buffer_heads at its private_list.  It so happens
that all of these buffers are against the blockdev mapping.  Hence we
always use the blockdev mapping's private_lock to protect each inode's
private_list.

Each inode records, in ->assoc_mapping, which address_space really owns the
buffers which that inode has on its private_list.  It so happens that in
all cases, inode->assoc_mapping points at the inode's backing blockdev's
address_space.

I did it this way during the lock contention reduction work so that
non-block-backed address_spaces could use the storage at private_list,
private_lock and assoc_mapping for something else.  But I don't think
anyone did that.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-01-23  0:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-11 17:43 [PATCH] Fix ext2 error reporting on fsync Jan Kara
2006-01-12  4:24 ` Andrew Morton
2006-01-12 10:20   ` Jan Kara
2006-01-12 11:52     ` Andrew Morton
2006-01-12 13:58       ` Chris Mason
2006-01-12 14:21         ` Jan Kara
2006-01-12 15:36           ` Chris Mason
2006-01-12 16:32             ` Jan Kara
2006-01-12 14:26       ` Jan Kara
2006-01-12 20:47         ` Andrew Morton
2006-01-13  2:08           ` Chris Mason
2006-01-13  2:16             ` Andrew Morton
2006-01-18 22:46               ` Jan Kara
2006-01-18 23:06                 ` Andrew Morton
2006-01-19 12:21                   ` Jan Kara
2006-01-19 21:16                     ` Andrew Morton
2006-01-20 13:41                       ` Jan Kara
2006-01-20 21:24                         ` Andrew Morton
2006-01-20 21:31                           ` Andrew Morton
2006-01-20 21:33                             ` Andrew Morton
2006-01-22 22:55                             ` Jan Kara
2006-01-23  0:06                               ` Andrew Morton
2006-01-22 22:32                           ` Jan Kara
2006-01-22 23:31                           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).