From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH] Fix ext2 error reporting on fsync Date: Mon, 23 Jan 2006 00:31:29 +0100 Message-ID: <20060122233126.GC28667@atrey.karlin.mff.cuni.cz> References: <20060112142656.GB14235@atrey.karlin.mff.cuni.cz> <20060112124717.6e242802.akpm@osdl.org> <200601122108.55103.mason@suse.com> <20060112181628.63c4bf39.akpm@osdl.org> <20060118224652.GA6434@atrey.karlin.mff.cuni.cz> <20060118150646.7514ba5f.akpm@osdl.org> <20060119122125.GA12563@atrey.karlin.mff.cuni.cz> <20060119131648.68d7c6eb.akpm@osdl.org> <20060120134118.GK668@atrey.karlin.mff.cuni.cz> <20060120132402.51ee5151.akpm@osdl.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="W/nzBZO5zC0uMSeA" Cc: mason@suse.com, linux-fsdevel@vger.kernel.org Return-path: Received: from atrey.karlin.mff.cuni.cz ([195.113.31.123]:6329 "EHLO atrey.karlin.mff.cuni.cz") by vger.kernel.org with ESMTP id S1751340AbWAVXbc (ORCPT ); Sun, 22 Jan 2006 18:31:32 -0500 To: Andrew Morton Content-Disposition: inline In-Reply-To: <20060120132402.51ee5151.akpm@osdl.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org --W/nzBZO5zC0uMSeA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline > Jan Kara wrote: > > > > > Jan Kara wrote: > > > > > > > > > (If you're proposing that we walk the buffer_head list until we somehow > > > > > find the list_head or hlist_head which is embedded within the desired > > > > > address_space then yes2, but how do we know where to terminate the search?) > > > > Yes, that's it. If the list is not circular, then it's easy to find > > > > it's head. So what I suggest is to change private_list from a circular > > > > list of list_head structures to the non-circular one. And because > > > > non-circular list is already implemented in hlist macros I'd simply use > > > > those. The only minor hack is that mapping would not contain hlist_head > > > > but hlist_node with pprev set to NULL so that we easily recognize it when > > > > traversing backwards. Is it clearer now? > > > > > > Ah, yes. It could get pretty inefficient in some corner cases, I guess. A > > > 4GB file will have up to 1000 buffers on that list so we're looking at > > > potentially O(1000000) operations. > > > > > > Plan B is to stick with a doubly-linked list, but mark the address_space's > > > list_head by setting bit 0 of its list_head.next to 1. > > > > > > The challenge is to implement this in a manner which Linus doesn't notice ;) > > In the mean time I though of an improvement of our cunning plan ;) We > > could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If > > buffer is going to be thrown out from private_list (and has BH_Write_EIO > > or BH_Neighbor_EIO set) we find another victim (previous buffer on the > > list) and mark it by BH_Neighbor_EIO. If the buffer was the first one, > > we can set error on the mapping. By this we get rid of going through the > > long link list. > > I guess I like this solution enough so that I start coding ;). > > hm. > > Another approach would be to allocate a new buffer_head which "belongs" to > the address_space. So instead of doing: > > bh<->bh<->address_space<->bh<->bh > > we do: > > address_space > ^ > | > v > bh<->bh<->bh'<->bh<->bh > > and set a magic bit in bh' which says "this is not a real bh; your > address_space is at *b_private". Attached is the first patch combining your and my ideas. It's just for illustration - it's completely untested. Honza -- Jan Kara SuSE CR Labs --W/nzBZO5zC0uMSeA Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="vfs-2.6.16-rc1-1-private_list.diff" diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/buffer.c linux-2.6.16-rc1-1-private_list/fs/buffer.c --- linux-2.6.16-rc1/fs/buffer.c 2006-01-17 21:44:01.000000000 +0100 +++ linux-2.6.16-rc1-1-private_list/fs/buffer.c 2006-01-23 00:43:16.000000000 +0100 @@ -663,26 +663,26 @@ EXPORT_SYMBOL(mark_buffer_async_write); * * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(), * inode_has_buffers() and invalidate_inode_buffers() are provided for the - * management of a list of dependent buffers at ->i_mapping->private_list. + * management of a list of dependent buffers at ->i_mapping.metadata_buffers. * - * Locking is a little subtle: try_to_free_buffers() will remove buffers - * from their controlling inode's queue when they are being freed. But - * try_to_free_buffers() will be operating against the *blockdev* mapping - * at the time, not against the S_ISREG file which depends on those buffers. - * So the locking for private_list is via the private_lock in the address_space - * which backs the buffers. Which is different from the address_space - * against which the buffers are listed. So for a particular address_space, - * mapping->private_lock does *not* protect mapping->private_list! In fact, - * mapping->private_list will always be protected by the backing blockdev's - * ->private_lock. + * Locking is a little subtle: try_to_free_buffers() will remove buffers from + * their controlling inode's queue when they are being freed. But + * try_to_free_buffers() will be operating against the *blockdev* mapping at + * the time, not against the S_ISREG file which depends on those buffers. So + * the locking for metadata_buffers is via the private_lock in the + * address_space which backs the buffers. Which is different from the + * address_space against which the buffers are listed. So for a particular + * address_space, mapping->private_lock does *not* protect + * mapping.metadata_buffers! In fact, mapping.metadata_buffers will always be + * protected by the backing blockdev's ->private_lock. * * Which introduces a requirement: all buffers on an address_space's - * ->private_list must be from the same address_space: the blockdev's. + * metadata_buffers must be from the same address_space: the blockdev's. * - * address_spaces which do not place buffers at ->private_list via these - * utility functions are free to use private_lock and private_list for - * whatever they want. The only requirement is that list_empty(private_list) - * be true at clear_inode() time. + * address_spaces which do not place buffers at metadata_buffers via these + * utility functions are free to use private_lock and metadata_buffers for + * whatever they want. The only requirement is that + * list_empty(metadata_buffers) be true at clear_inode() time. * * FIXME: clear_inode should not call invalidate_inode_buffers(). The * filesystems should do that. invalidate_inode_buffers() should just go @@ -713,7 +713,7 @@ static inline void __remove_assoc_queue( int inode_has_buffers(struct inode *inode) { - return !list_empty(&inode->i_data.private_list); + return !list_empty(&inode->i_data.metadata_buffers.b_assoc_buffers); } /* @@ -756,7 +756,7 @@ repeat: * buffers * @mapping: the mapping which wants those buffers written * - * Starts I/O against the buffers at mapping->private_list, and waits upon + * Starts I/O against the buffers at mapping->metadata_buffers, and waits upon * that I/O. * * Basically, this is a convenience function for fsync(). @@ -767,11 +767,12 @@ int sync_mapping_buffers(struct address_ { struct address_space *buffer_mapping = mapping->assoc_mapping; - if (buffer_mapping == NULL || list_empty(&mapping->private_list)) + if (buffer_mapping == NULL || + list_empty(&mapping->metadata_buffers.b_assoc_buffers)) return 0; return fsync_buffers_list(&buffer_mapping->private_lock, - &mapping->private_list); + &mapping->metadata_buffers.b_assoc_buffers); } EXPORT_SYMBOL(sync_mapping_buffers); @@ -807,7 +808,7 @@ void mark_buffer_dirty_inode(struct buff if (list_empty(&bh->b_assoc_buffers)) { spin_lock(&buffer_mapping->private_lock); list_move_tail(&bh->b_assoc_buffers, - &mapping->private_list); + &mapping->metadata_buffers.b_assoc_buffers); spin_unlock(&buffer_mapping->private_lock); } } @@ -899,9 +900,18 @@ static int fsync_buffers_list(spinlock_t INIT_LIST_HEAD(&tmp); spin_lock(lock); + /* Was there IO error on some already evicted buffer? */ + if (buffer_file_io_error(BH_ENTRY(list))) { + clear_buffer_file_io_error(BH_ENTRY(list)); + err = -EIO; + } while (!list_empty(list)) { bh = BH_ENTRY(list->next); - list_del_init(&bh->b_assoc_buffers); + __remove_assoc_queue(bh); + if (buffer_write_io_error(bh) || buffer_file_io_error(bh)) { + clear_buffer_file_io_error(BH_ENTRY(list)); + err = -EIO; + } if (buffer_dirty(bh) || buffer_locked(bh)) { list_add(&bh->b_assoc_buffers, &tmp); if (buffer_dirty(bh)) { @@ -953,7 +963,8 @@ void invalidate_inode_buffers(struct ino { if (inode_has_buffers(inode)) { struct address_space *mapping = &inode->i_data; - struct list_head *list = &mapping->private_list; + struct list_head *list = + &mapping->metadata_buffers.b_assoc_buffers; struct address_space *buffer_mapping = mapping->assoc_mapping; spin_lock(&buffer_mapping->private_lock); @@ -975,7 +986,8 @@ int remove_inode_buffers(struct inode *i if (inode_has_buffers(inode)) { struct address_space *mapping = &inode->i_data; - struct list_head *list = &mapping->private_list; + struct list_head *list = + &mapping->metadata_buffers.b_assoc_buffers; struct address_space *buffer_mapping = mapping->assoc_mapping; spin_lock(&buffer_mapping->private_lock); @@ -2957,8 +2969,15 @@ drop_buffers(struct page *page, struct b do { struct buffer_head *next = bh->b_this_page; - if (!list_empty(&bh->b_assoc_buffers)) + if (!list_empty(&bh->b_assoc_buffers)) { + if (buffer_write_io_error(bh) || + buffer_file_io_error(bh)) { + struct buffer_head *prev = BH_ENTRY(bh-> + b_assoc_buffers.prev); + set_buffer_file_io_error(prev); + } __remove_assoc_queue(bh); + } bh = next; } while (bh != head); *buffers_to_free = head; diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/fs-writeback.c linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c --- linux-2.6.16-rc1/fs/fs-writeback.c 2006-01-15 00:20:12.000000000 +0100 +++ linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c 2006-01-22 23:32:37.000000000 +0100 @@ -609,7 +609,7 @@ EXPORT_SYMBOL(sync_inode); * written and waited upon. * * OSYNC_DATA: i_mapping's dirty data - * OSYNC_METADATA: the buffers at i_mapping->private_list + * OSYNC_METADATA: the buffers at i_mapping->metadata_buffers * OSYNC_INODE: the inode itself */ diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/inode.c linux-2.6.16-rc1-1-private_list/fs/inode.c --- linux-2.6.16-rc1/fs/inode.c 2006-01-23 00:46:45.000000000 +0100 +++ linux-2.6.16-rc1-1-private_list/fs/inode.c 2006-01-23 00:48:26.000000000 +0100 @@ -199,7 +199,7 @@ void inode_init_once(struct inode *inode rwlock_init(&inode->i_data.tree_lock); spin_lock_init(&inode->i_data.i_mmap_lock); spin_lock_init(&inode->i_data.private_lock); - INIT_LIST_HEAD(&inode->i_data.private_list); + INIT_LIST_HEAD(&inode->i_data.metadata_buffers.b_assoc_buffers); INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap); INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear); spin_lock_init(&inode->i_lock); @@ -413,7 +413,7 @@ static int can_unuse(struct inode *inode * inode is still freeable, proceed. The right inode is found 99.9% of the * time in testing on a 4-way. * - * If the inode has metadata buffers attached to mapping->private_list then + * If the inode has metadata buffers attached to mapping->metadata_buffers then * try to remove them. */ static void prune_icache(int nr_to_scan) diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/buffer_head.h linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h --- linux-2.6.16-rc1/include/linux/buffer_head.h 2006-01-17 21:44:07.000000000 +0100 +++ linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h 2006-01-22 21:40:09.000000000 +0100 @@ -32,6 +32,7 @@ enum bh_state_bits { BH_Write_EIO, /* I/O error on write */ BH_Ordered, /* ordered write */ BH_Eopnotsupp, /* operation not supported (barrier) */ + BH_File_EIO, /* Some other buffer in the file had IO error */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities @@ -119,6 +120,7 @@ BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Ordered, ordered) BUFFER_FNS(Eopnotsupp, eopnotsupp) +BUFFER_FNS(File_EIO, file_io_error) #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) #define touch_buffer(bh) mark_page_accessed(bh->b_page) diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/fs.h linux-2.6.16-rc1-1-private_list/include/linux/fs.h --- linux-2.6.16-rc1/include/linux/fs.h 2006-01-17 21:44:07.000000000 +0100 +++ linux-2.6.16-rc1-1-private_list/include/linux/fs.h 2006-01-23 00:51:10.000000000 +0100 @@ -381,8 +381,10 @@ struct address_space { unsigned long flags; /* error bits/gfp mask */ struct backing_dev_info *backing_dev_info; /* device readahead, etc */ spinlock_t private_lock; /* for use by the address_space */ - struct list_head private_list; /* ditto */ - struct address_space *assoc_mapping; /* ditto */ + /* Fake buffer head in the list of metadata buffers belonging to block + device address space */ + struct buffer_head metadata_buffers; + struct address_space *assoc_mapping; /* for use by the address_space */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but --W/nzBZO5zC0uMSeA--