[PATCH 0/3] Fix IO error reporting on fsync()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Fix IO error reporting on fsync()
@ 2006-10-06 11:49 Jan Kara
  2006-10-06 11:55 ` [PATCH 1/3] " Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Jan Kara @ 2006-10-06 11:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, jack

  Hello,

  current code in buffer.c has two pitfalls that cause problems with IO
error reporting of filesystems using mapping->private_list for their
metadata buffers (e.g. ext2).
  The first problem is that end_io_async_write() does not mark IO error
in the buffer flags, only in the page flags. Hence fsync_buffers_list()
does not find out that some IO error has occured and will not report it.
  The second problem is that buffers from private_list can be freed
(e.g. under memory pressure) and if fsync_buffer_list() is called after
that moment, IO error is lost - note that metadata buffers mark AS_EIO
on the *device mapping* not on the inode mapping.
  Following series of three patches tries to fix these problems. The
approach I took (after some discussions with Andrew) is introducing
dummy buffer_head in the mapping instead of private_list. This dummy
buffer head serves as a head of metadata buffer list and also collects
IO errors from other buffers on the list (see the third patch for more
details). This is kind of compromise between introducing a pointer to
inode's address_space into each buffer and between using list_head
instead of buffer_head and playing some dirty tricks to recognize that
one particular list_head is actually from address_space and not from
buffer_head. Any suggestions for improvements welcome.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] Fix IO error reporting on fsync()
  2006-10-06 11:49 [PATCH 0/3] Fix IO error reporting on fsync() Jan Kara
@ 2006-10-06 11:55 ` Jan Kara
  2006-10-06 11:56 ` [PATCH 2/3] " Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2006-10-06 11:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

When IO error happens during async write, we have to mark buffer as having IO
error too. Otherwise IO error reporting for filesystems like ext2 does not
work.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.18/fs/buffer.c linux-2.6.18-1-mark_error_buffer/fs/buffer.c
--- linux-2.6.18/fs/buffer.c	2006-09-27 13:08:34.000000000 +0200
+++ linux-2.6.18-1-mark_error_buffer/fs/buffer.c	2006-10-06 13:05:29.000000000 +0200
@@ -589,6 +589,7 @@ static void end_buffer_async_write(struc
 			       bdevname(bh->b_bdev, b));
 		}
 		set_bit(AS_EIO, &page->mapping->flags);
+		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 		SetPageError(page);
 	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/3] Fix IO error reporting on fsync()
  2006-10-06 11:49 [PATCH 0/3] Fix IO error reporting on fsync() Jan Kara
  2006-10-06 11:55 ` [PATCH 1/3] " Jan Kara
@ 2006-10-06 11:56 ` Jan Kara
  2006-10-06 11:57 ` [PATCH 3/3] " Jan Kara
  2006-10-07  6:06 ` [PATCH 0/3] " Andrew Morton
  3 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2006-10-06 11:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

Reorganize headers with buffer heads. Separate buffer_head declaration and
functions handling it so that fs.h can see the whole buffer head structure.
This is needed so that address_space can contain a dummy buffer_head.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -ruNpX /home/jack/.kerndiffexclude linux-2.6.18-1-mark_error_buffer/include/linux/buffer_head.h linux-2.6.18-2-buffer_headers/include/linux/buffer_head.h
--- linux-2.6.18-1-mark_error_buffer/include/linux/buffer_head.h	2006-09-27 13:09:04.000000000 +0200
+++ linux-2.6.18-2-buffer_headers/include/linux/buffer_head.h	2006-10-06 13:11:06.000000000 +0200
@@ -7,132 +7,9 @@
 #ifndef _LINUX_BUFFER_HEAD_H
 #define _LINUX_BUFFER_HEAD_H
 
-#include <linux/types.h>
+#include <linux/buffer_head_struct.h>
 #include <linux/fs.h>
-#include <linux/linkage.h>
 #include <linux/pagemap.h>
-#include <linux/wait.h>
-#include <asm/atomic.h>
-
-enum bh_state_bits {
-	BH_Uptodate,	/* Contains valid data */
-	BH_Dirty,	/* Is dirty */
-	BH_Lock,	/* Is locked */
-	BH_Req,		/* Has been submitted for I/O */
-	BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
-			  * IO completion of other buffers in the page
-			  */
-
-	BH_Mapped,	/* Has a disk mapping */
-	BH_New,		/* Disk mapping was newly created by get_block */
-	BH_Async_Read,	/* Is under end_buffer_async_read I/O */
-	BH_Async_Write,	/* Is under end_buffer_async_write I/O */
-	BH_Delay,	/* Buffer is not yet allocated on disk */
-	BH_Boundary,	/* Block is followed by a discontiguity */
-	BH_Write_EIO,	/* I/O error on write */
-	BH_Ordered,	/* ordered write */
-	BH_Eopnotsupp,	/* operation not supported (barrier) */
-
-	BH_PrivateStart,/* not a state bit, but the first bit available
-			 * for private allocation by other entities
-			 */
-};
-
-#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-
-struct page;
-struct buffer_head;
-struct address_space;
-typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
-
-/*
- * Historically, a buffer_head was used to map a single block
- * within a page, and of course as the unit of I/O through the
- * filesystem and block layers.  Nowadays the basic I/O unit
- * is the bio, and buffer_heads are used for extracting block
- * mappings (via a get_block_t call), for tracking state within
- * a page (via a page_mapping) and for wrapping bio submission
- * for backward compatibility reasons (e.g. submit_bh).
- */
-struct buffer_head {
-	unsigned long b_state;		/* buffer state bitmap (see above) */
-	struct buffer_head *b_this_page;/* circular list of page's buffers */
-	struct page *b_page;		/* the page this bh is mapped to */
-
-	sector_t b_blocknr;		/* start block number */
-	size_t b_size;			/* size of mapping */
-	char *b_data;			/* pointer to data within the page */
-
-	struct block_device *b_bdev;
-	bh_end_io_t *b_end_io;		/* I/O completion */
- 	void *b_private;		/* reserved for b_end_io */
-	struct list_head b_assoc_buffers; /* associated with another mapping */
-	atomic_t b_count;		/* users using this buffer_head */
-};
-
-/*
- * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
- * and buffer_foo() functions.
- */
-#define BUFFER_FNS(bit, name)						\
-static inline void set_buffer_##name(struct buffer_head *bh)		\
-{									\
-	set_bit(BH_##bit, &(bh)->b_state);				\
-}									\
-static inline void clear_buffer_##name(struct buffer_head *bh)		\
-{									\
-	clear_bit(BH_##bit, &(bh)->b_state);				\
-}									\
-static inline int buffer_##name(const struct buffer_head *bh)		\
-{									\
-	return test_bit(BH_##bit, &(bh)->b_state);			\
-}
-
-/*
- * test_set_buffer_foo() and test_clear_buffer_foo()
- */
-#define TAS_BUFFER_FNS(bit, name)					\
-static inline int test_set_buffer_##name(struct buffer_head *bh)	\
-{									\
-	return test_and_set_bit(BH_##bit, &(bh)->b_state);		\
-}									\
-static inline int test_clear_buffer_##name(struct buffer_head *bh)	\
-{									\
-	return test_and_clear_bit(BH_##bit, &(bh)->b_state);		\
-}									\
-
-/*
- * Emit the buffer bitops functions.   Note that there are also functions
- * of the form "mark_buffer_foo()".  These are higher-level functions which
- * do something in addition to setting a b_state bit.
- */
-BUFFER_FNS(Uptodate, uptodate)
-BUFFER_FNS(Dirty, dirty)
-TAS_BUFFER_FNS(Dirty, dirty)
-BUFFER_FNS(Lock, locked)
-TAS_BUFFER_FNS(Lock, locked)
-BUFFER_FNS(Req, req)
-TAS_BUFFER_FNS(Req, req)
-BUFFER_FNS(Mapped, mapped)
-BUFFER_FNS(New, new)
-BUFFER_FNS(Async_Read, async_read)
-BUFFER_FNS(Async_Write, async_write)
-BUFFER_FNS(Delay, delay)
-BUFFER_FNS(Boundary, boundary)
-BUFFER_FNS(Write_EIO, write_io_error)
-BUFFER_FNS(Ordered, ordered)
-BUFFER_FNS(Eopnotsupp, eopnotsupp)
-
-#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
-#define touch_buffer(bh)	mark_page_accessed(bh->b_page)
-
-/* If we *know* page->private refers to buffer_heads */
-#define page_buffers(page)					\
-	({							\
-		BUG_ON(!PagePrivate(page));			\
-		((struct buffer_head *)page_private(page));	\
-	})
-#define page_has_buffers(page)	PagePrivate(page)
 
 /*
  * Declarations
diff -ruNpX /home/jack/.kerndiffexclude linux-2.6.18-1-mark_error_buffer/include/linux/buffer_head_struct.h linux-2.6.18-2-buffer_headers/include/linux/buffer_head_struct.h
--- linux-2.6.18-1-mark_error_buffer/include/linux/buffer_head_struct.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.18-2-buffer_headers/include/linux/buffer_head_struct.h	2006-10-06 13:26:42.000000000 +0200
@@ -0,0 +1,135 @@
+/*
+ * include/linux/buffer_head.h
+ *
+ * Everything to do with buffer_heads.
+ */
+
+#ifndef _LINUX_BUFFER_HEAD_STRUCT_H
+#define _LINUX_BUFFER_HEAD_STRUCT_H
+
+#include <linux/types.h>
+#include <linux/linkage.h>
+#include <linux/wait.h>
+#include <asm/atomic.h>
+
+enum bh_state_bits {
+	BH_Uptodate,	/* Contains valid data */
+	BH_Dirty,	/* Is dirty */
+	BH_Lock,	/* Is locked */
+	BH_Req,		/* Has been submitted for I/O */
+	BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
+			  * IO completion of other buffers in the page
+			  */
+
+	BH_Mapped,	/* Has a disk mapping */
+	BH_New,		/* Disk mapping was newly created by get_block */
+	BH_Async_Read,	/* Is under end_buffer_async_read I/O */
+	BH_Async_Write,	/* Is under end_buffer_async_write I/O */
+	BH_Delay,	/* Buffer is not yet allocated on disk */
+	BH_Boundary,	/* Block is followed by a discontiguity */
+	BH_Write_EIO,	/* I/O error on write */
+	BH_Ordered,	/* ordered write */
+	BH_Eopnotsupp,	/* operation not supported (barrier) */
+
+	BH_PrivateStart,/* not a state bit, but the first bit available
+			 * for private allocation by other entities
+			 */
+};
+
+#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
+
+struct page;
+struct buffer_head;
+struct address_space;
+typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
+
+/*
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
+ */
+struct buffer_head {
+	unsigned long b_state;		/* buffer state bitmap (see above) */
+	struct buffer_head *b_this_page;/* circular list of page's buffers */
+	struct page *b_page;		/* the page this bh is mapped to */
+
+	sector_t b_blocknr;		/* start block number */
+	size_t b_size;			/* size of mapping */
+	char *b_data;			/* pointer to data within the page */
+
+	struct block_device *b_bdev;
+	bh_end_io_t *b_end_io;		/* I/O completion */
+ 	void *b_private;		/* reserved for b_end_io */
+	struct list_head b_assoc_buffers; /* associated with another mapping */
+	atomic_t b_count;		/* users using this buffer_head */
+};
+
+/*
+ * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
+ * and buffer_foo() functions.
+ */
+#define BUFFER_FNS(bit, name)						\
+static inline void set_buffer_##name(struct buffer_head *bh)		\
+{									\
+	set_bit(BH_##bit, &(bh)->b_state);				\
+}									\
+static inline void clear_buffer_##name(struct buffer_head *bh)		\
+{									\
+	clear_bit(BH_##bit, &(bh)->b_state);				\
+}									\
+static inline int buffer_##name(const struct buffer_head *bh)		\
+{									\
+	return test_bit(BH_##bit, &(bh)->b_state);			\
+}
+
+/*
+ * test_set_buffer_foo() and test_clear_buffer_foo()
+ */
+#define TAS_BUFFER_FNS(bit, name)					\
+static inline int test_set_buffer_##name(struct buffer_head *bh)	\
+{									\
+	return test_and_set_bit(BH_##bit, &(bh)->b_state);		\
+}									\
+static inline int test_clear_buffer_##name(struct buffer_head *bh)	\
+{									\
+	return test_and_clear_bit(BH_##bit, &(bh)->b_state);		\
+}									\
+
+/*
+ * Emit the buffer bitops functions.   Note that there are also functions
+ * of the form "mark_buffer_foo()".  These are higher-level functions which
+ * do something in addition to setting a b_state bit.
+ */
+BUFFER_FNS(Uptodate, uptodate)
+BUFFER_FNS(Dirty, dirty)
+TAS_BUFFER_FNS(Dirty, dirty)
+BUFFER_FNS(Lock, locked)
+TAS_BUFFER_FNS(Lock, locked)
+BUFFER_FNS(Req, req)
+TAS_BUFFER_FNS(Req, req)
+BUFFER_FNS(Mapped, mapped)
+BUFFER_FNS(New, new)
+BUFFER_FNS(Async_Read, async_read)
+BUFFER_FNS(Async_Write, async_write)
+BUFFER_FNS(Delay, delay)
+BUFFER_FNS(Boundary, boundary)
+BUFFER_FNS(Write_EIO, write_io_error)
+BUFFER_FNS(Ordered, ordered)
+BUFFER_FNS(Eopnotsupp, eopnotsupp)
+
+#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
+#define touch_buffer(bh)	mark_page_accessed(bh->b_page)
+
+/* If we *know* page->private refers to buffer_heads */
+#define page_buffers(page)					\
+	({							\
+		BUG_ON(!PagePrivate(page));			\
+		((struct buffer_head *)page_private(page));	\
+	})
+#define page_has_buffers(page)	PagePrivate(page)
+
+#endif /* _LINUX_BUFFER_HEAD_STRUCT_H */
diff -ruNpX /home/jack/.kerndiffexclude linux-2.6.18-1-mark_error_buffer/include/linux/fs.h linux-2.6.18-2-buffer_headers/include/linux/fs.h
--- linux-2.6.18-1-mark_error_buffer/include/linux/fs.h	2006-09-27 13:09:04.000000000 +0200
+++ linux-2.6.18-2-buffer_headers/include/linux/fs.h	2006-10-06 13:09:38.000000000 +0200
@@ -236,6 +236,7 @@ extern int dir_notify_enable;
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/buffer_head_struct.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -256,7 +257,6 @@ extern void __init inode_init_early(void
 extern void __init mnt_init(unsigned long);
 extern void __init files_init(unsigned long);
 
-struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
diff -ruNpX /home/jack/.kerndiffexclude linux-2.6.18-1-mark_error_buffer/include/linux/mm.h linux-2.6.18-2-buffer_headers/include/linux/mm.h
--- linux-2.6.18-1-mark_error_buffer/include/linux/mm.h	2006-10-02 18:27:46.000000000 +0200
+++ linux-2.6.18-2-buffer_headers/include/linux/mm.h	2006-10-06 13:09:38.000000000 +0200
@@ -759,6 +759,8 @@ static inline int handle_mm_fault(struct
 }
 #endif
 
+struct writeback_control;
+struct file_ra_state;
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] Fix IO error reporting on fsync()
  2006-10-06 11:49 [PATCH 0/3] Fix IO error reporting on fsync() Jan Kara
  2006-10-06 11:55 ` [PATCH 1/3] " Jan Kara
  2006-10-06 11:56 ` [PATCH 2/3] " Jan Kara
@ 2006-10-06 11:57 ` Jan Kara
  2006-10-07  6:06 ` [PATCH 0/3] " Andrew Morton
  3 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2006-10-06 11:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

When IO error happens on metadata buffer, buffer is freed from memory
and later fsync() is called, filesystems like ext2 fail to report EIO.
We solve the problem by introducing fake buffer head in the list of
metadata buffers attached to an address_space and a special buffer flag
meaning that some metadata buffer had an IO error. When buffer with
IO error or the new metadata IO error flag is being removed from memory,
previous buffer in the metadata list is marked with the metadata IO error
flag. Hence the information about IO error is retained in memory (at least
in the fake buffer in the beginning of the list) and fsync() can check it.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.18-2-buffer_headers/fs/buffer.c linux-2.6.18-3-fsync_fix/fs/buffer.c
--- linux-2.6.18-2-buffer_headers/fs/buffer.c	2006-10-06 13:05:29.000000000 +0200
+++ linux-2.6.18-3-fsync_fix/fs/buffer.c	2006-10-06 13:14:21.000000000 +0200
@@ -663,26 +663,26 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  *
  * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
  * inode_has_buffers() and invalidate_inode_buffers() are provided for the
- * management of a list of dependent buffers at ->i_mapping->private_list.
+ * management of a list of dependent buffers at ->i_mapping.metadata_buffers.
  *
- * Locking is a little subtle: try_to_free_buffers() will remove buffers
- * from their controlling inode's queue when they are being freed.  But
- * try_to_free_buffers() will be operating against the *blockdev* mapping
- * at the time, not against the S_ISREG file which depends on those buffers.
- * So the locking for private_list is via the private_lock in the address_space
- * which backs the buffers.  Which is different from the address_space 
- * against which the buffers are listed.  So for a particular address_space,
- * mapping->private_lock does *not* protect mapping->private_list!  In fact,
- * mapping->private_list will always be protected by the backing blockdev's
- * ->private_lock.
+ * Locking is a little subtle: try_to_free_buffers() will remove buffers from
+ * their controlling inode's queue when they are being freed.  But
+ * try_to_free_buffers() will be operating against the *blockdev* mapping at
+ * the time, not against the S_ISREG file which depends on those buffers.  So
+ * the locking for metadata_buffers is via the private_lock in the
+ * address_space which backs the buffers.  Which is different from the
+ * address_space against which the buffers are listed.  So for a particular
+ * address_space, mapping->private_lock does *not* protect
+ * mapping.metadata_buffers! In fact, mapping.metadata_buffers will always be
+ * protected by the backing blockdev's ->private_lock.
  *
  * Which introduces a requirement: all buffers on an address_space's
- * ->private_list must be from the same address_space: the blockdev's.
+ * metadata_buffers must be from the same address_space: the blockdev's.
  *
- * address_spaces which do not place buffers at ->private_list via these
- * utility functions are free to use private_lock and private_list for
- * whatever they want.  The only requirement is that list_empty(private_list)
- * be true at clear_inode() time.
+ * address_spaces which do not place buffers at metadata_buffers via these
+ * utility functions are free to use private_lock and metadata_buffers for
+ * whatever they want.  The only requirement is that
+ * list_empty(metadata_buffers) be true at clear_inode() time.
  *
  * FIXME: clear_inode should not call invalidate_inode_buffers().  The
  * filesystems should do that.  invalidate_inode_buffers() should just go
@@ -713,7 +713,7 @@ static inline void __remove_assoc_queue(
 
 int inode_has_buffers(struct inode *inode)
 {
-	return !list_empty(&inode->i_data.private_list);
+	return !list_empty(&inode->i_data.metadata_buffers.b_assoc_buffers);
 }
 
 /*
@@ -756,7 +756,7 @@ repeat:
  *                        buffers
  * @mapping: the mapping which wants those buffers written
  *
- * Starts I/O against the buffers at mapping->private_list, and waits upon
+ * Starts I/O against the buffers at mapping->metadata_buffers, and waits upon
  * that I/O.
  *
  * Basically, this is a convenience function for fsync().
@@ -767,11 +767,11 @@ int sync_mapping_buffers(struct address_
 {
 	struct address_space *buffer_mapping = mapping->assoc_mapping;
 
-	if (buffer_mapping == NULL || list_empty(&mapping->private_list))
+	if (buffer_mapping == NULL)
 		return 0;
 
 	return fsync_buffers_list(&buffer_mapping->private_lock,
-					&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 }
 EXPORT_SYMBOL(sync_mapping_buffers);
 
@@ -806,7 +806,7 @@ void mark_buffer_dirty_inode(struct buff
 	if (list_empty(&bh->b_assoc_buffers)) {
 		spin_lock(&buffer_mapping->private_lock);
 		list_move_tail(&bh->b_assoc_buffers,
-				&mapping->private_list);
+				&mapping->metadata_buffers.b_assoc_buffers);
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 }
@@ -898,9 +898,18 @@ static int fsync_buffers_list(spinlock_t
 	INIT_LIST_HEAD(&tmp);
 
 	spin_lock(lock);
+	/* Was there IO error on some already evicted buffer? */
+	if (buffer_file_io_error(BH_ENTRY(list))) {
+		clear_buffer_file_io_error(BH_ENTRY(list));
+		err = -EIO;
+	}
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
-		list_del_init(&bh->b_assoc_buffers);
+		__remove_assoc_queue(bh);
+		if (buffer_write_io_error(bh) || buffer_file_io_error(bh)) {
+			clear_buffer_file_io_error(BH_ENTRY(list));
+			err = -EIO;
+		}
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
 			if (buffer_dirty(bh)) {
@@ -952,7 +961,8 @@ void invalidate_inode_buffers(struct ino
 {
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -974,7 +984,8 @@ int remove_inode_buffers(struct inode *i
 
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
-		struct list_head *list = &mapping->private_list;
+		struct list_head *list =
+			&mapping->metadata_buffers.b_assoc_buffers;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -2947,11 +2958,14 @@ drop_buffers(struct page *page, struct b
 {
 	struct buffer_head *head = page_buffers(page);
 	struct buffer_head *bh;
+	int ioerror = 0;
 
 	bh = head;
 	do {
 		if (buffer_write_io_error(bh) && page->mapping)
 			set_bit(AS_EIO, &page->mapping->flags);
+		if (buffer_write_io_error(bh) || buffer_file_io_error(bh))
+			ioerror = 1;
 		if (buffer_busy(bh))
 			goto failed;
 		bh = bh->b_this_page;
@@ -2960,8 +2974,14 @@ drop_buffers(struct page *page, struct b
 	do {
 		struct buffer_head *next = bh->b_this_page;
 
-		if (!list_empty(&bh->b_assoc_buffers))
+		if (!list_empty(&bh->b_assoc_buffers)) {
+			if (ioerror) {
+				struct buffer_head *prev = BH_ENTRY(bh->
+					b_assoc_buffers.prev);
+				set_buffer_file_io_error(prev);
+			}
 			__remove_assoc_queue(bh);
+		}
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.18-2-buffer_headers/fs/fs-writeback.c linux-2.6.18-3-fsync_fix/fs/fs-writeback.c
--- linux-2.6.18-2-buffer_headers/fs/fs-writeback.c	2006-09-27 13:08:35.000000000 +0200
+++ linux-2.6.18-3-fsync_fix/fs/fs-writeback.c	2006-10-06 13:14:21.000000000 +0200
@@ -613,7 +613,7 @@ EXPORT_SYMBOL(sync_inode);
  * written and waited upon.
  *
  *    OSYNC_DATA:     i_mapping's dirty data
- *    OSYNC_METADATA: the buffers at i_mapping->private_list
+ *    OSYNC_METADATA: the buffers at i_mapping->metadata_buffers
  *    OSYNC_INODE:    the inode itself
  */
 
diff -rupX /home/jack/.kerndiffexclude linux-2.6.18-2-buffer_headers/fs/inode.c linux-2.6.18-3-fsync_fix/fs/inode.c
--- linux-2.6.18-2-buffer_headers/fs/inode.c	2006-09-27 13:08:38.000000000 +0200
+++ linux-2.6.18-3-fsync_fix/fs/inode.c	2006-10-06 13:14:21.000000000 +0200
@@ -196,7 +196,7 @@ void inode_init_once(struct inode *inode
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
-	INIT_LIST_HEAD(&inode->i_data.private_list);
+ 	INIT_LIST_HEAD(&inode->i_data.metadata_buffers.b_assoc_buffers);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
@@ -408,7 +408,7 @@ static int can_unuse(struct inode *inode
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
+ * If the inode has metadata buffers attached to mapping->metadata_buffers then
  * try to remove them.
  */
 static void prune_icache(int nr_to_scan)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.18-2-buffer_headers/include/linux/buffer_head_struct.h linux-2.6.18-3-fsync_fix/include/linux/buffer_head_struct.h
--- linux-2.6.18-2-buffer_headers/include/linux/buffer_head_struct.h	2006-10-06 13:26:42.000000000 +0200
+++ linux-2.6.18-3-fsync_fix/include/linux/buffer_head_struct.h	2006-10-06 13:27:10.000000000 +0200
@@ -30,6 +30,7 @@ enum bh_state_bits {
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
 	BH_Eopnotsupp,	/* operation not supported (barrier) */
+	BH_File_EIO,	/* Some other buffer in the file had IO error */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -120,6 +121,7 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
+BUFFER_FNS(File_EIO, file_io_error)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.18-2-buffer_headers/include/linux/fs.h linux-2.6.18-3-fsync_fix/include/linux/fs.h
--- linux-2.6.18-2-buffer_headers/include/linux/fs.h	2006-10-06 13:09:38.000000000 +0200
+++ linux-2.6.18-3-fsync_fix/include/linux/fs.h	2006-10-06 13:14:21.000000000 +0200
@@ -400,8 +400,10 @@ struct address_space {
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
-	struct list_head	private_list;	/* ditto */
-	struct address_space	*assoc_mapping;	/* ditto */
+	/* Fake buffer head in the list of metadata buffers belonging to block
+	   device address space */
+	struct buffer_head	metadata_buffers;	
+	struct address_space	*assoc_mapping;	/* for use by the address_space */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Fix IO error reporting on fsync()
  2006-10-06 11:49 [PATCH 0/3] Fix IO error reporting on fsync() Jan Kara
                   ` (2 preceding siblings ...)
  2006-10-06 11:57 ` [PATCH 3/3] " Jan Kara
@ 2006-10-07  6:06 ` Andrew Morton
  2006-10-09 11:40   ` Jan Kara
  2006-10-09 14:47   ` [PATCH] Try to avoid a pessimistic vmalloc() recursion Eric Dumazet
  3 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2006-10-07  6:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel

On Fri, 6 Oct 2006 13:49:47 +0200
Jan Kara <jack@suse.cz> wrote:

>   current code in buffer.c has two pitfalls that cause problems with IO
> error reporting of filesystems using mapping->private_list for their
> metadata buffers (e.g. ext2).
>   The first problem is that end_io_async_write() does not mark IO error
> in the buffer flags, only in the page flags. Hence fsync_buffers_list()
> does not find out that some IO error has occured and will not report it.
>   The second problem is that buffers from private_list can be freed
> (e.g. under memory pressure) and if fsync_buffer_list() is called after
> that moment, IO error is lost - note that metadata buffers mark AS_EIO
> on the *device mapping* not on the inode mapping.
>   Following series of three patches tries to fix these problems. The
> approach I took (after some discussions with Andrew) is introducing
> dummy buffer_head in the mapping instead of private_list. This dummy
> buffer head serves as a head of metadata buffer list and also collects
> IO errors from other buffers on the list (see the third patch for more
> details). This is kind of compromise between introducing a pointer to
> inode's address_space into each buffer and between using list_head
> instead of buffer_head and playing some dirty tricks to recognize that
> one particular list_head is actually from address_space and not from
> buffer_head. Any suggestions for improvements welcome.

This is really complex, and enlarges the inode by quite a lot, which hurts.

What about putting an address_space* into the buffer_head?  Transfer the
EIO state into the address_space within, say, __remove_assoc_queue()?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Fix IO error reporting on fsync()
  2006-10-07  6:06 ` [PATCH 0/3] " Andrew Morton
@ 2006-10-09 11:40   ` Jan Kara
  2006-10-09 18:20     ` Andrew Morton
  2006-10-09 14:47   ` [PATCH] Try to avoid a pessimistic vmalloc() recursion Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2006-10-09 11:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> On Fri, 6 Oct 2006 13:49:47 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> >   current code in buffer.c has two pitfalls that cause problems with IO
> > error reporting of filesystems using mapping->private_list for their
> > metadata buffers (e.g. ext2).
> >   The first problem is that end_io_async_write() does not mark IO error
> > in the buffer flags, only in the page flags. Hence fsync_buffers_list()
> > does not find out that some IO error has occured and will not report it.
> >   The second problem is that buffers from private_list can be freed
> > (e.g. under memory pressure) and if fsync_buffer_list() is called after
> > that moment, IO error is lost - note that metadata buffers mark AS_EIO
> > on the *device mapping* not on the inode mapping.
> >   Following series of three patches tries to fix these problems. The
> > approach I took (after some discussions with Andrew) is introducing
> > dummy buffer_head in the mapping instead of private_list. This dummy
> > buffer head serves as a head of metadata buffer list and also collects
> > IO errors from other buffers on the list (see the third patch for more
> > details). This is kind of compromise between introducing a pointer to
> > inode's address_space into each buffer and between using list_head
> > instead of buffer_head and playing some dirty tricks to recognize that
> > one particular list_head is actually from address_space and not from
> > buffer_head. Any suggestions for improvements welcome.
> 
> This is really complex, and enlarges the inode by quite a lot, which hurts.
  I agree (at least with the second part ;). I can write a patch which
keeps the inode size but the code will be uglier... Another possibility
is to put there just a buffer_head pointer and allocate buffer head
dynamically. That has an advantage that only filesystems using metadata_list
has to bear the memory cost...

> What about putting an address_space* into the buffer_head?  Transfer the
> EIO state into the address_space within, say, __remove_assoc_queue()?
  Yes, that's of course possible. But it enlarges each buffer head by 4
bytes (or 8 on 64-bit arch). Hmm, but you are right that on the systems
I've looked at this would actually be less memory. OK, I'll write this
version of the patch.
								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Try to avoid a pessimistic vmalloc() recursion
  2006-10-07  6:06 ` [PATCH 0/3] " Andrew Morton
  2006-10-09 11:40   ` Jan Kara
@ 2006-10-09 14:47   ` Eric Dumazet
  2006-10-09 16:25     ` Nick Piggin
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2006-10-09 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 688 bytes --]

__vmalloc_area_node() is a litle bit pessimist when allocating space for 
storing struct page pointers.

When allocating more than 4 MB on ia32, or 2 MB on x86_64,  
__vmalloc_area_node() has to allocate more than PAGE_SIZE bytes to store 
pointers to  page structs. This means that two TLB translations are needed to 
access data.

This patch tries a kmalloc() call, then only if this first attempt failed, a 
vmalloc() is performed. (Later, at vfree() time we chose kfree() or vfree() 
with a test on flags & VM_VPAGES : no change is needed) 

Most of the time, the first kmalloc() should be OK, so we reduce TLB usage.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: vmalloc.patch --]
[-- Type: text/plain, Size: 702 bytes --]

--- linux-2.6.18/mm/vmalloc.c	2006-10-09 13:58:13.000000000 +0200
+++ linux-2.6.18-ed/mm/vmalloc.c	2006-10-09 14:04:11.000000000 +0200
@@ -426,12 +426,12 @@
 	array_size = (nr_pages * sizeof(struct page *));

 	area->nr_pages = nr_pages;
+	pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
 	/* Please note that the recursion is strictly bounded. */
-	if (array_size > PAGE_SIZE) {
+	if (!pages && array_size > PAGE_SIZE) {
 		pages = __vmalloc_node(array_size, gfp_mask, PAGE_KERNEL, node);
 		area->flags |= VM_VPAGES;
-	} else
-		pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
+	}
 	area->pages = pages;
 	if (!area->pages) {
 		remove_vm_area(area->addr);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Try to avoid a pessimistic vmalloc() recursion
  2006-10-09 14:47   ` [PATCH] Try to avoid a pessimistic vmalloc() recursion Eric Dumazet
@ 2006-10-09 16:25     ` Nick Piggin
  2006-10-09 19:43       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-10-09 16:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, linux-kernel

Eric Dumazet wrote:
> __vmalloc_area_node() is a litle bit pessimist when allocating space for 
> storing struct page pointers.
> 
> When allocating more than 4 MB on ia32, or 2 MB on x86_64,  
> __vmalloc_area_node() has to allocate more than PAGE_SIZE bytes to store 
> pointers to  page structs. This means that two TLB translations are needed to 
> access data.
> 
> This patch tries a kmalloc() call, then only if this first attempt failed, a 
> vmalloc() is performed. (Later, at vfree() time we chose kfree() or vfree() 
> with a test on flags & VM_VPAGES : no change is needed) 
> 
> Most of the time, the first kmalloc() should be OK, so we reduce TLB usage.

But this is only TLB usage when managing (read: freeing) the vmalloc pages,
isn't it? Not when actually accessing the data.

I'd be inclined to NACK this, unless you can show an improvement somewhere:
it is suboptimal to even _try_ allocating higher order pages.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Fix IO error reporting on fsync()
  2006-10-09 11:40   ` Jan Kara
@ 2006-10-09 18:20     ` Andrew Morton
  2006-10-10 11:56       ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2006-10-09 18:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel

On Mon, 9 Oct 2006 13:40:41 +0200
Jan Kara <jack@suse.cz> wrote:

> > What about putting an address_space* into the buffer_head?  Transfer the
> > EIO state into the address_space within, say, __remove_assoc_queue()?
>   Yes, that's of course possible. But it enlarges each buffer head by 4
> bytes (or 8 on 64-bit arch).

I suspect we could get that back by removing buffer_head.b_bdev.  That's
not a trivial thing to do, but should be feasible.

We can't just do bh->b_page->mapping->host->i_sb->s_bdev because of races
with trunate, plus the general horror of it all.  But I expect that all
callers of submit_bh() have the blockdev* easily available by other means,
so adding a `struct block_device*' argument to submit_bh() would get us
there.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Try to avoid a pessimistic vmalloc() recursion
  2006-10-09 16:25     ` Nick Piggin
@ 2006-10-09 19:43       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2006-10-09 19:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

Nick Piggin a écrit :
> Eric Dumazet wrote:
>> __vmalloc_area_node() is a litle bit pessimist when allocating space 
>> for storing struct page pointers.
>>
>> When allocating more than 4 MB on ia32, or 2 MB on x86_64,  
>> __vmalloc_area_node() has to allocate more than PAGE_SIZE bytes to 
>> store pointers to  page structs. This means that two TLB translations 
>> are needed to access data.
>>
>> This patch tries a kmalloc() call, then only if this first attempt 
>> failed, a vmalloc() is performed. (Later, at vfree() time we chose 
>> kfree() or vfree() with a test on flags & VM_VPAGES : no change is 
>> needed)
>> Most of the time, the first kmalloc() should be OK, so we reduce TLB 
>> usage.
> 
> But this is only TLB usage when managing (read: freeing) the vmalloc pages,
> isn't it? Not when actually accessing the data.

Yes indeed...
I was trying to reduce time taken by a processes handling lot of files (thus 
vmalloc()ing fdtables and fdset). I noticed a high oprofile hit in 
fget_light(). I suspected overhead caused by vmalloc(), but obviously, once 
the vmalloc() mapping is done, the array of pointers wont be used until vfree().

> 
> I'd be inclined to NACK this, unless you can show an improvement somewhere:
> it is suboptimal to even _try_ allocating higher order pages.
> 

Your point is valid. And it seems there is not much cpu used to linearly scan 
vmlist (to find the vm_struct), at least on my little servers.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Fix IO error reporting on fsync()
  2006-10-09 18:20     ` Andrew Morton
@ 2006-10-10 11:56       ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2006-10-10 11:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> On Mon, 9 Oct 2006 13:40:41 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > > What about putting an address_space* into the buffer_head?  Transfer the
> > > EIO state into the address_space within, say, __remove_assoc_queue()?
> >   Yes, that's of course possible. But it enlarges each buffer head by 4
> > bytes (or 8 on 64-bit arch).
> 
> I suspect we could get that back by removing buffer_head.b_bdev.  That's
> not a trivial thing to do, but should be feasible.
> 
> We can't just do bh->b_page->mapping->host->i_sb->s_bdev because of races
> with trunate, plus the general horror of it all.  But I expect that all
> callers of submit_bh() have the blockdev* easily available by other means,
> so adding a `struct block_device*' argument to submit_bh() would get us
> there.
  OK. For now I've written and tested the patch with the pointer in
buffer head. Later we can have a look at shrinking buffer_head again...

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-10-10 11:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-06 11:49 [PATCH 0/3] Fix IO error reporting on fsync() Jan Kara
2006-10-06 11:55 ` [PATCH 1/3] " Jan Kara
2006-10-06 11:56 ` [PATCH 2/3] " Jan Kara
2006-10-06 11:57 ` [PATCH 3/3] " Jan Kara
2006-10-07  6:06 ` [PATCH 0/3] " Andrew Morton
2006-10-09 11:40   ` Jan Kara
2006-10-09 18:20     ` Andrew Morton
2006-10-10 11:56       ` Jan Kara
2006-10-09 14:47   ` [PATCH] Try to avoid a pessimistic vmalloc() recursion Eric Dumazet
2006-10-09 16:25     ` Nick Piggin
2006-10-09 19:43       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox