Re: [PATCH] Fix ext2 error reporting on fsync

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@osdl.org>
Cc: mason@suse.com, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] Fix ext2 error reporting on fsync
Date: Mon, 23 Jan 2006 00:31:29 +0100	[thread overview]
Message-ID: <20060122233126.GC28667@atrey.karlin.mff.cuni.cz> (raw)
In-Reply-To: <20060120132402.51ee5151.akpm@osdl.org>

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

> Jan Kara <jack@suse.cz> wrote:
> >
> > > Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > > (If you're proposing that we walk the buffer_head list until we somehow
> > > >  > find the list_head or hlist_head which is embedded within the desired
> > > >  > address_space then yes2, but how do we know where to terminate the search?)
> > > >    Yes, that's it. If the list is not circular, then it's easy to find
> > > >  it's head. So what I suggest is to change private_list from a circular
> > > >  list of list_head structures to the non-circular one. And because
> > > >  non-circular list is already implemented in hlist macros I'd simply use
> > > >  those. The only minor hack is that mapping would not contain hlist_head
> > > >  but hlist_node with pprev set to NULL so that we easily recognize it when
> > > >  traversing backwards. Is it clearer now?
> > > 
> > > Ah, yes.  It could get pretty inefficient in some corner cases, I guess.  A
> > > 4GB file will have up to 1000 buffers on that list so we're looking at
> > > potentially O(1000000) operations.
> > > 
> > > Plan B is to stick with a doubly-linked list, but mark the address_space's
> > > list_head by setting bit 0 of its list_head.next to 1.
> > > 
> > > The challenge is to implement this in a manner which Linus doesn't notice ;)
> >   In the mean time I though of an improvement of our cunning plan ;) We
> > could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
> > buffer is going to be thrown out from private_list (and has BH_Write_EIO
> > or BH_Neighbor_EIO set) we find another victim (previous buffer on the
> > list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
> > we can set error on the mapping. By this we get rid of going through the
> > long link list.
> >   I guess I like this solution enough so that I start coding ;).
> 
> hm.
> 
> Another approach would be to allocate a new buffer_head which "belongs" to
> the address_space.  So instead of doing:
> 
> 		bh<->bh<->address_space<->bh<->bh
> 
> we do:
> 
>                      address_space
>                            ^
>                            |
>                            v
> 		bh<->bh<->bh'<->bh<->bh
> 
> and set a magic bit in bh' which says "this is not a real bh; your
> address_space is at *b_private".
  Attached is the first patch combining your and my ideas. It's just for
illustration - it's completely untested.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: vfs-2.6.16-rc1-1-private_list.diff --]
[-- Type: text/plain, Size: 9957 bytes --]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/buffer.c linux-2.6.16-rc1-1-private_list/fs/buffer.c
--- linux-2.6.16-rc1/fs/buffer.c	2006-01-17 21:44:01.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/buffer.c	2006-01-23 00:43:16.000000000 +0100
@@ -663,26 +663,26 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  *
  * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
  * inode_has_buffers() and invalidate_inode_buffers() are provided for the
- * management of a list of dependent buffers at ->i_mapping->private_list.
+ * management of a list of dependent buffers at ->i_mapping.metadata_buffers.
  *
- * Locking is a little subtle: try_to_free_buffers() will remove buffers
- * from their controlling inode's queue when they are being freed.  But
- * try_to_free_buffers() will be operating against the *blockdev* mapping
- * at the time, not against the S_ISREG file which depends on those buffers.
- * So the locking for private_list is via the private_lock in the address_space
- * which backs the buffers.  Which is different from the address_space 
- * against which the buffers are listed.  So for a particular address_space,
- * mapping->private_lock does *not* protect mapping->private_list!  In fact,
- * mapping->private_list will always be protected by the backing blockdev's
- * ->private_lock.
+ * Locking is a little subtle: try_to_free_buffers() will remove buffers from
+ * their controlling inode's queue when they are being freed.  But
+ * try_to_free_buffers() will be operating against the *blockdev* mapping at
+ * the time, not against the S_ISREG file which depends on those buffers.  So
+ * the locking for metadata_buffers is via the private_lock in the
+ * address_space which backs the buffers.  Which is different from the
+ * address_space against which the buffers are listed.  So for a particular
+ * address_space, mapping->private_lock does *not* protect
+ * mapping.metadata_buffers! In fact, mapping.metadata_buffers will always be
+ * protected by the backing blockdev's ->private_lock.
  *
  * Which introduces a requirement: all buffers on an address_space's
- * ->private_list must be from the same address_space: the blockdev's.
+ * metadata_buffers must be from the same address_space: the blockdev's.
  *
- * address_spaces which do not place buffers at ->private_list via these
- * utility functions are free to use private_lock and private_list for
- * whatever they want.  The only requirement is that list_empty(private_list)
- * be true at clear_inode() time.
+ * address_spaces which do not place buffers at metadata_buffers via these
+ * utility functions are free to use private_lock and metadata_buffers for
+ * whatever they want.  The only requirement is that
+ * list_empty(metadata_buffers) be true at clear_inode() time.
  *
  * FIXME: clear_inode should not call invalidate_inode_buffers().  The
  * filesystems should do that.  invalidate_inode_buffers() should just go
@@ -713,7 +713,7 @@ static inline void __remove_assoc_queue(
 
 int inode_has_buffers(struct inode *inode)
 {
-	return !list_empty(&inode->i_data.private_list);
+	return !list_empty(&inode->i_data.metadata_buffers.b_assoc_buffers);
 }
 
 /*
@@ -756,7 +756,7 @@ repeat:
  *                        buffers
  * @mapping: the mapping which wants those buffers written
  *
- * Starts I/O against the buffers at mapping->private_list, and waits upon
+ * Starts I/O against the buffers at mapping->metadata_buffers, and waits upon
  * that I/O.
  *
  * Basically, this is a convenience function for fsync().
@@ -767,11 +767,12 @@ int sync_mapping_buffers(struct address_
 {
 	struct address_space *buffer_mapping = mapping->assoc_mapping;
 
-	if (buffer_mapping == NULL || list_empty(&mapping->private_list))
+	if (buffer_mapping == NULL ||
+		list_empty(&mapping->metadata_buffers.b_assoc_buffers))
 		return 0;
 
 	return fsync_buffers_list(&buffer_mapping->private_lock,
-					&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 }
 EXPORT_SYMBOL(sync_mapping_buffers);
 
@@ -807,7 +808,7 @@ void mark_buffer_dirty_inode(struct buff
 	if (list_empty(&bh->b_assoc_buffers)) {
 		spin_lock(&buffer_mapping->private_lock);
 		list_move_tail(&bh->b_assoc_buffers,
-				&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 }
@@ -899,9 +900,18 @@ static int fsync_buffers_list(spinlock_t
 	INIT_LIST_HEAD(&tmp);
 
 	spin_lock(lock);
+	/* Was there IO error on some already evicted buffer? */
+	if (buffer_file_io_error(BH_ENTRY(list))) {
+		clear_buffer_file_io_error(BH_ENTRY(list));
+		err = -EIO;
+	}
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
-		list_del_init(&bh->b_assoc_buffers);
+		__remove_assoc_queue(bh);
+		if (buffer_write_io_error(bh) || buffer_file_io_error(bh)) {
+			clear_buffer_file_io_error(BH_ENTRY(list));
+			err = -EIO;
+		}
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
 			if (buffer_dirty(bh)) {
@@ -953,7 +963,8 @@ void invalidate_inode_buffers(struct ino
 {
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -975,7 +986,8 @@ int remove_inode_buffers(struct inode *i
 
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -2957,8 +2969,15 @@ drop_buffers(struct page *page, struct b
 	do {
 		struct buffer_head *next = bh->b_this_page;
 
-		if (!list_empty(&bh->b_assoc_buffers))
+		if (!list_empty(&bh->b_assoc_buffers)) {
+			if (buffer_write_io_error(bh) ||
+			    buffer_file_io_error(bh)) {
+				struct buffer_head *prev = BH_ENTRY(bh->
+					b_assoc_buffers.prev);
+				set_buffer_file_io_error(prev);
+			}
 			__remove_assoc_queue(bh);
+		}
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/fs-writeback.c linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c
--- linux-2.6.16-rc1/fs/fs-writeback.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c	2006-01-22 23:32:37.000000000 +0100
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(sync_inode);
  * written and waited upon.
  *
  *    OSYNC_DATA:     i_mapping's dirty data
- *    OSYNC_METADATA: the buffers at i_mapping->private_list
+ *    OSYNC_METADATA: the buffers at i_mapping->metadata_buffers
  *    OSYNC_INODE:    the inode itself
  */
 
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/inode.c linux-2.6.16-rc1-1-private_list/fs/inode.c
--- linux-2.6.16-rc1/fs/inode.c	2006-01-23 00:46:45.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/inode.c	2006-01-23 00:48:26.000000000 +0100
@@ -199,7 +199,7 @@ void inode_init_once(struct inode *inode
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	spin_lock_init(&inode->i_data.private_lock);
-	INIT_LIST_HEAD(&inode->i_data.private_list);
+	INIT_LIST_HEAD(&inode->i_data.metadata_buffers.b_assoc_buffers);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
 	spin_lock_init(&inode->i_lock);
@@ -413,7 +413,7 @@ static int can_unuse(struct inode *inode
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
+ * If the inode has metadata buffers attached to mapping->metadata_buffers then
  * try to remove them.
  */
 static void prune_icache(int nr_to_scan)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/buffer_head.h linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h
--- linux-2.6.16-rc1/include/linux/buffer_head.h	2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h	2006-01-22 21:40:09.000000000 +0100
@@ -32,6 +32,7 @@ enum bh_state_bits {
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
 	BH_Eopnotsupp,	/* operation not supported (barrier) */
+	BH_File_EIO,	/* Some other buffer in the file had IO error */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -119,6 +120,7 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
+BUFFER_FNS(File_EIO, file_io_error)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/fs.h linux-2.6.16-rc1-1-private_list/include/linux/fs.h
--- linux-2.6.16-rc1/include/linux/fs.h	2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/fs.h	2006-01-23 00:51:10.000000000 +0100
@@ -381,8 +381,10 @@ struct address_space {
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
-	struct list_head	private_list;	/* ditto */
-	struct address_space	*assoc_mapping;	/* ditto */
+	/* Fake buffer head in the list of metadata buffers belonging to block
+	   device address space */
+	struct buffer_head	metadata_buffers;	
+	struct address_space	*assoc_mapping;	/* for use by the address_space */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but

     prev parent reply	other threads:[~2006-01-22 23:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-01-11 17:43 [PATCH] Fix ext2 error reporting on fsync Jan Kara
2006-01-12  4:24 ` Andrew Morton
2006-01-12 10:20   ` Jan Kara
2006-01-12 11:52     ` Andrew Morton
2006-01-12 13:58       ` Chris Mason
2006-01-12 14:21         ` Jan Kara
2006-01-12 15:36           ` Chris Mason
2006-01-12 16:32             ` Jan Kara
2006-01-12 14:26       ` Jan Kara
2006-01-12 20:47         ` Andrew Morton
2006-01-13  2:08           ` Chris Mason
2006-01-13  2:16             ` Andrew Morton
2006-01-18 22:46               ` Jan Kara
2006-01-18 23:06                 ` Andrew Morton
2006-01-19 12:21                   ` Jan Kara
2006-01-19 21:16                     ` Andrew Morton
2006-01-20 13:41                       ` Jan Kara
2006-01-20 21:24                         ` Andrew Morton
2006-01-20 21:31                           ` Andrew Morton
2006-01-20 21:33                             ` Andrew Morton
2006-01-22 22:55                             ` Jan Kara
2006-01-23  0:06                               ` Andrew Morton
2006-01-22 22:32                           ` Jan Kara
2006-01-22 23:31                           ` Jan Kara [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060122233126.GC28667@atrey.karlin.mff.cuni.cz \
    --to=jack@suse.cz \
    --cc=akpm@osdl.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mason@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).