From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@osdl.org>
Cc: mason@suse.com, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] Fix ext2 error reporting on fsync
Date: Mon, 23 Jan 2006 00:31:29 +0100 [thread overview]
Message-ID: <20060122233126.GC28667@atrey.karlin.mff.cuni.cz> (raw)
In-Reply-To: <20060120132402.51ee5151.akpm@osdl.org>
[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]
> Jan Kara <jack@suse.cz> wrote:
> >
> > > Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > > (If you're proposing that we walk the buffer_head list until we somehow
> > > > > find the list_head or hlist_head which is embedded within the desired
> > > > > address_space then yes2, but how do we know where to terminate the search?)
> > > > Yes, that's it. If the list is not circular, then it's easy to find
> > > > it's head. So what I suggest is to change private_list from a circular
> > > > list of list_head structures to the non-circular one. And because
> > > > non-circular list is already implemented in hlist macros I'd simply use
> > > > those. The only minor hack is that mapping would not contain hlist_head
> > > > but hlist_node with pprev set to NULL so that we easily recognize it when
> > > > traversing backwards. Is it clearer now?
> > >
> > > Ah, yes. It could get pretty inefficient in some corner cases, I guess. A
> > > 4GB file will have up to 1000 buffers on that list so we're looking at
> > > potentially O(1000000) operations.
> > >
> > > Plan B is to stick with a doubly-linked list, but mark the address_space's
> > > list_head by setting bit 0 of its list_head.next to 1.
> > >
> > > The challenge is to implement this in a manner which Linus doesn't notice ;)
> > In the mean time I though of an improvement of our cunning plan ;) We
> > could reserve a bit in bh_state. Something like BH_Neighbor_EIO. If
> > buffer is going to be thrown out from private_list (and has BH_Write_EIO
> > or BH_Neighbor_EIO set) we find another victim (previous buffer on the
> > list) and mark it by BH_Neighbor_EIO. If the buffer was the first one,
> > we can set error on the mapping. By this we get rid of going through the
> > long link list.
> > I guess I like this solution enough so that I start coding ;).
>
> hm.
>
> Another approach would be to allocate a new buffer_head which "belongs" to
> the address_space. So instead of doing:
>
> bh<->bh<->address_space<->bh<->bh
>
> we do:
>
> address_space
> ^
> |
> v
> bh<->bh<->bh'<->bh<->bh
>
> and set a magic bit in bh' which says "this is not a real bh; your
> address_space is at *b_private".
Attached is the first patch combining your and my ideas. It's just for
illustration - it's completely untested.
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
[-- Attachment #2: vfs-2.6.16-rc1-1-private_list.diff --]
[-- Type: text/plain, Size: 9957 bytes --]
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/buffer.c linux-2.6.16-rc1-1-private_list/fs/buffer.c
--- linux-2.6.16-rc1/fs/buffer.c 2006-01-17 21:44:01.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/buffer.c 2006-01-23 00:43:16.000000000 +0100
@@ -663,26 +663,26 @@ EXPORT_SYMBOL(mark_buffer_async_write);
*
* The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
* inode_has_buffers() and invalidate_inode_buffers() are provided for the
- * management of a list of dependent buffers at ->i_mapping->private_list.
+ * management of a list of dependent buffers at ->i_mapping.metadata_buffers.
*
- * Locking is a little subtle: try_to_free_buffers() will remove buffers
- * from their controlling inode's queue when they are being freed. But
- * try_to_free_buffers() will be operating against the *blockdev* mapping
- * at the time, not against the S_ISREG file which depends on those buffers.
- * So the locking for private_list is via the private_lock in the address_space
- * which backs the buffers. Which is different from the address_space
- * against which the buffers are listed. So for a particular address_space,
- * mapping->private_lock does *not* protect mapping->private_list! In fact,
- * mapping->private_list will always be protected by the backing blockdev's
- * ->private_lock.
+ * Locking is a little subtle: try_to_free_buffers() will remove buffers from
+ * their controlling inode's queue when they are being freed. But
+ * try_to_free_buffers() will be operating against the *blockdev* mapping at
+ * the time, not against the S_ISREG file which depends on those buffers. So
+ * the locking for metadata_buffers is via the private_lock in the
+ * address_space which backs the buffers. Which is different from the
+ * address_space against which the buffers are listed. So for a particular
+ * address_space, mapping->private_lock does *not* protect
+ * mapping.metadata_buffers! In fact, mapping.metadata_buffers will always be
+ * protected by the backing blockdev's ->private_lock.
*
* Which introduces a requirement: all buffers on an address_space's
- * ->private_list must be from the same address_space: the blockdev's.
+ * metadata_buffers must be from the same address_space: the blockdev's.
*
- * address_spaces which do not place buffers at ->private_list via these
- * utility functions are free to use private_lock and private_list for
- * whatever they want. The only requirement is that list_empty(private_list)
- * be true at clear_inode() time.
+ * address_spaces which do not place buffers at metadata_buffers via these
+ * utility functions are free to use private_lock and metadata_buffers for
+ * whatever they want. The only requirement is that
+ * list_empty(metadata_buffers) be true at clear_inode() time.
*
* FIXME: clear_inode should not call invalidate_inode_buffers(). The
* filesystems should do that. invalidate_inode_buffers() should just go
@@ -713,7 +713,7 @@ static inline void __remove_assoc_queue(
int inode_has_buffers(struct inode *inode)
{
- return !list_empty(&inode->i_data.private_list);
+ return !list_empty(&inode->i_data.metadata_buffers.b_assoc_buffers);
}
/*
@@ -756,7 +756,7 @@ repeat:
* buffers
* @mapping: the mapping which wants those buffers written
*
- * Starts I/O against the buffers at mapping->private_list, and waits upon
+ * Starts I/O against the buffers at mapping->metadata_buffers, and waits upon
* that I/O.
*
* Basically, this is a convenience function for fsync().
@@ -767,11 +767,12 @@ int sync_mapping_buffers(struct address_
{
struct address_space *buffer_mapping = mapping->assoc_mapping;
- if (buffer_mapping == NULL || list_empty(&mapping->private_list))
+ if (buffer_mapping == NULL ||
+ list_empty(&mapping->metadata_buffers.b_assoc_buffers))
return 0;
return fsync_buffers_list(&buffer_mapping->private_lock,
- &mapping->private_list);
+ &mapping->metadata_buffers.b_assoc_buffers);
}
EXPORT_SYMBOL(sync_mapping_buffers);
@@ -807,7 +808,7 @@ void mark_buffer_dirty_inode(struct buff
if (list_empty(&bh->b_assoc_buffers)) {
spin_lock(&buffer_mapping->private_lock);
list_move_tail(&bh->b_assoc_buffers,
- &mapping->private_list);
+ &mapping->metadata_buffers.b_assoc_buffers);
spin_unlock(&buffer_mapping->private_lock);
}
}
@@ -899,9 +900,18 @@ static int fsync_buffers_list(spinlock_t
INIT_LIST_HEAD(&tmp);
spin_lock(lock);
+ /* Was there IO error on some already evicted buffer? */
+ if (buffer_file_io_error(BH_ENTRY(list))) {
+ clear_buffer_file_io_error(BH_ENTRY(list));
+ err = -EIO;
+ }
while (!list_empty(list)) {
bh = BH_ENTRY(list->next);
- list_del_init(&bh->b_assoc_buffers);
+ __remove_assoc_queue(bh);
+ if (buffer_write_io_error(bh) || buffer_file_io_error(bh)) {
+ clear_buffer_file_io_error(BH_ENTRY(list));
+ err = -EIO;
+ }
if (buffer_dirty(bh) || buffer_locked(bh)) {
list_add(&bh->b_assoc_buffers, &tmp);
if (buffer_dirty(bh)) {
@@ -953,7 +963,8 @@ void invalidate_inode_buffers(struct ino
{
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
- struct list_head *list = &mapping->private_list;
+ struct list_head *list =
+ &mapping->metadata_buffers.b_assoc_buffers;
struct address_space *buffer_mapping = mapping->assoc_mapping;
spin_lock(&buffer_mapping->private_lock);
@@ -975,7 +986,8 @@ int remove_inode_buffers(struct inode *i
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
- struct list_head *list = &mapping->private_list;
+ struct list_head *list =
+ &mapping->metadata_buffers.b_assoc_buffers;
struct address_space *buffer_mapping = mapping->assoc_mapping;
spin_lock(&buffer_mapping->private_lock);
@@ -2957,8 +2969,15 @@ drop_buffers(struct page *page, struct b
do {
struct buffer_head *next = bh->b_this_page;
- if (!list_empty(&bh->b_assoc_buffers))
+ if (!list_empty(&bh->b_assoc_buffers)) {
+ if (buffer_write_io_error(bh) ||
+ buffer_file_io_error(bh)) {
+ struct buffer_head *prev = BH_ENTRY(bh->
+ b_assoc_buffers.prev);
+ set_buffer_file_io_error(prev);
+ }
__remove_assoc_queue(bh);
+ }
bh = next;
} while (bh != head);
*buffers_to_free = head;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/fs-writeback.c linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c
--- linux-2.6.16-rc1/fs/fs-writeback.c 2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/fs-writeback.c 2006-01-22 23:32:37.000000000 +0100
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(sync_inode);
* written and waited upon.
*
* OSYNC_DATA: i_mapping's dirty data
- * OSYNC_METADATA: the buffers at i_mapping->private_list
+ * OSYNC_METADATA: the buffers at i_mapping->metadata_buffers
* OSYNC_INODE: the inode itself
*/
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/inode.c linux-2.6.16-rc1-1-private_list/fs/inode.c
--- linux-2.6.16-rc1/fs/inode.c 2006-01-23 00:46:45.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/fs/inode.c 2006-01-23 00:48:26.000000000 +0100
@@ -199,7 +199,7 @@ void inode_init_once(struct inode *inode
rwlock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
spin_lock_init(&inode->i_data.private_lock);
- INIT_LIST_HEAD(&inode->i_data.private_list);
+ INIT_LIST_HEAD(&inode->i_data.metadata_buffers.b_assoc_buffers);
INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
spin_lock_init(&inode->i_lock);
@@ -413,7 +413,7 @@ static int can_unuse(struct inode *inode
* inode is still freeable, proceed. The right inode is found 99.9% of the
* time in testing on a 4-way.
*
- * If the inode has metadata buffers attached to mapping->private_list then
+ * If the inode has metadata buffers attached to mapping->metadata_buffers then
* try to remove them.
*/
static void prune_icache(int nr_to_scan)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/buffer_head.h linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h
--- linux-2.6.16-rc1/include/linux/buffer_head.h 2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/buffer_head.h 2006-01-22 21:40:09.000000000 +0100
@@ -32,6 +32,7 @@ enum bh_state_bits {
BH_Write_EIO, /* I/O error on write */
BH_Ordered, /* ordered write */
BH_Eopnotsupp, /* operation not supported (barrier) */
+ BH_File_EIO, /* Some other buffer in the file had IO error */
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
@@ -119,6 +120,7 @@ BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
BUFFER_FNS(Ordered, ordered)
BUFFER_FNS(Eopnotsupp, eopnotsupp)
+BUFFER_FNS(File_EIO, file_io_error)
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
#define touch_buffer(bh) mark_page_accessed(bh->b_page)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/include/linux/fs.h linux-2.6.16-rc1-1-private_list/include/linux/fs.h
--- linux-2.6.16-rc1/include/linux/fs.h 2006-01-17 21:44:07.000000000 +0100
+++ linux-2.6.16-rc1-1-private_list/include/linux/fs.h 2006-01-23 00:51:10.000000000 +0100
@@ -381,8 +381,10 @@ struct address_space {
unsigned long flags; /* error bits/gfp mask */
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
spinlock_t private_lock; /* for use by the address_space */
- struct list_head private_list; /* ditto */
- struct address_space *assoc_mapping; /* ditto */
+ /* Fake buffer head in the list of metadata buffers belonging to block
+ device address space */
+ struct buffer_head metadata_buffers;
+ struct address_space *assoc_mapping; /* for use by the address_space */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
prev parent reply other threads:[~2006-01-22 23:31 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-01-11 17:43 [PATCH] Fix ext2 error reporting on fsync Jan Kara
2006-01-12 4:24 ` Andrew Morton
2006-01-12 10:20 ` Jan Kara
2006-01-12 11:52 ` Andrew Morton
2006-01-12 13:58 ` Chris Mason
2006-01-12 14:21 ` Jan Kara
2006-01-12 15:36 ` Chris Mason
2006-01-12 16:32 ` Jan Kara
2006-01-12 14:26 ` Jan Kara
2006-01-12 20:47 ` Andrew Morton
2006-01-13 2:08 ` Chris Mason
2006-01-13 2:16 ` Andrew Morton
2006-01-18 22:46 ` Jan Kara
2006-01-18 23:06 ` Andrew Morton
2006-01-19 12:21 ` Jan Kara
2006-01-19 21:16 ` Andrew Morton
2006-01-20 13:41 ` Jan Kara
2006-01-20 21:24 ` Andrew Morton
2006-01-20 21:31 ` Andrew Morton
2006-01-20 21:33 ` Andrew Morton
2006-01-22 22:55 ` Jan Kara
2006-01-23 0:06 ` Andrew Morton
2006-01-22 22:32 ` Jan Kara
2006-01-22 23:31 ` Jan Kara [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060122233126.GC28667@atrey.karlin.mff.cuni.cz \
--to=jack@suse.cz \
--cc=akpm@osdl.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mason@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).