[patch] fs: improved handling of page and buffer IO errors

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch] fs: improved handling of page and buffer IO errors
@ 2008-10-21 11:21 Nick Piggin
  2008-10-21 12:52 ` Miklos Szeredi
  2008-10-21 16:16 ` Andi Kleen
  0 siblings, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2008-10-21 11:21 UTC (permalink / raw)
  To: Andrew Morton, Linux Memory Management List, linux-fsdevel

IO error handling in the core mm/fs still doesn't seem perfect, but with
the recent round of patches and this one, it should be getting on the
right track.

I kind of get the feeling some people would rather forget about all this
and brush it under the carpet. Hopefully I'm mistaken, but if anybody
disagrees with my assertion that error handling, and data integrity
semantics are first-class correctness issues, and therefore are more
important than all other non-correctness problems... speak now and let's
discuss that, please.

Otherwise, unless anybody sees obvious problems with this, hopefully it
can go into -mm for some wider testing (I've tested it with a few filesystems
so far and no immediate problems)

Thanks,
Nick
--

- Don't clear PageUptodate or buffer_uptodate on write failure (this could go
  BUG in the VM, and really, the data in the page is still the most uptodate
  even if we can't write it back to disk). If we decide to invalidate a page on
  write failure (silly, because then the app can't retry the write), the
  correct way to invalidate a page is via the pagecache invalidation calls,
  which will do the right thing with mapped pages for example.
- Don't assume !PageUptodate == EIO. Pages can be invalidated or reclaimed,
  in which case the read should be retried.
- Haven't gone through filesystems yet, but this gets core code into better 
  shape.
- Warnings or bugs can come about because we have !uptodate pages mapped into
  page tables, !uptodate && dirty pages or buffers, etc.

  example:
  WARNING: at /home/npiggin/usr/src/linux-2.6/fs/buffer.c:1185 mark_buffer_dirty+0x7a/0xc0()
Call Trace:
  bd541ca8:  [<60033b76>] warn_on_slowpath+0x56/0x80
  bd541cd8:  [<6004aa37>] wake_up_bit+0x27/0x40
  bd541cf8:  [<600a6fd2>] __writeback_single_inode+0xf2/0x360
  bd541d28:  [<60053f13>] debug_mutex_free_waiter+0x23/0x50
  bd541d48:  [<601f6ea9>] __mutex_lock_slowpath+0x149/0x220
  bd541db8:  [<600688a0>] pdflush+0x0/0x1c0
  bd541dc8:  [<600ac51a>] mark_buffer_dirty+0x7a/0xc0
  bd541de8:  [<600d8125>] ext2_write_super+0x45/0x90
  bd541e08:  [<6008c846>] sync_supers+0x76/0xc0
  bd541e28:  [<600682b0>] wb_kupdate+0x30/0x110
  bd541e98:  [<600688a0>] pdflush+0x0/0x1c0
  bd541ea8:  [<6006898c>] pdflush+0xec/0x1c0
  bd541eb8:  [<60068280>] wb_kupdate+0x0/0x110
  bd541f08:  [<6004a628>] kthread+0x58/0x90
  bd541f48:  [<60025b47>] run_kernel_thread+0x47/0x50
  bd541f58:  [<6004a5d0>] kthread+0x0/0x90
  bd541f98:  [<60025b28>] run_kernel_thread+0x28/0x50
  bd541fc8:  [<600166f7>] new_thread_handler+0x67/0xa0

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1147,20 +1147,17 @@ readpage:
 			error = lock_page_killable(page);
 			if (unlikely(error))
 				goto readpage_error;
-			if (!PageUptodate(page)) {
-				if (page->mapping == NULL) {
-					/*
-					 * invalidate_inode_pages got it
-					 */
-					unlock_page(page);
-					page_cache_release(page);
-					goto find_page;
-				}
+			if (PageError(page)) {
 				unlock_page(page);
 				shrink_readahead_size_eio(filp, ra);
 				error = -EIO;
 				goto readpage_error;
 			}
+			if (!PageUptodate(page) || !page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto find_page;
+			}
 			unlock_page(page);
 		}
 
@@ -1576,7 +1573,7 @@ page_not_uptodate:
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
-		if (!PageUptodate(page))
+		if (unlikely(PageError(page)))
 			error = -EIO;
 	}
 	page_cache_release(page);
@@ -1701,6 +1698,7 @@ retry:
 		unlock_page(page);
 		goto out;
 	}
+	ClearPageError(page);
 	err = filler(data, page);
 	if (err < 0) {
 		page_cache_release(page);
@@ -1722,7 +1720,7 @@ EXPORT_SYMBOL(read_cache_page_async);
  * Read into the page cache. If a page already exists, and PageUptodate() is
  * not set, try to fill the page then wait for it to become unlocked.
  *
- * If the page does not get brought uptodate, return -EIO.
+ * If the page IO fails, return -EIO.
  */
 struct page *read_cache_page(struct address_space *mapping,
 				pgoff_t index,
@@ -1735,7 +1733,7 @@ struct page *read_cache_page(struct addr
 	if (IS_ERR(page))
 		goto out;
 	wait_on_page_locked(page);
-	if (!PageUptodate(page)) {
+	if (PageError(page)) {
 		page_cache_release(page);
 		page = ERR_PTR(-EIO);
 	}
@@ -2052,6 +2050,9 @@ again:
 			 *
 			 * Instead, we have to bring it uptodate here.
 			 */
+			if (PageError(page))
+				return -EIO;
+
 			ret = aops->readpage(file, page);
 			page_cache_release(page);
 			if (ret) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2336,10 +2336,11 @@ static int do_swap_page(struct mm_struct
 	if (unlikely(!pte_same(*page_table, orig_pte)))
 		goto out_nomap;
 
-	if (unlikely(!PageUptodate(page))) {
+	if (unlikely(!PageError(page))) {
 		ret = VM_FAULT_SIGBUS;
 		goto out_nomap;
 	}
+	VM_BUG_ON(!PageUptodate(page)); /* page_io.c should guarantee this */
 
 	/* The page isn't present yet, go ahead with the fault. */
 
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1278,7 +1278,7 @@ repeat:
 			page_cache_release(swappage);
 			goto repeat;
 		}
-		if (!PageUptodate(swappage)) {
+		if (PageError(swappage)) {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			unlock_page(swappage);
@@ -1286,6 +1286,11 @@ repeat:
 			error = -EIO;
 			goto failed;
 		}
+		/*
+		 * swap cache doesn't get invalidated, so if not error it
+		 * should be uptodate
+		 */
+		VM_BUG_ON(!PageUptodate(swappage));
 
 		if (filepage) {
 			shmem_swp_set(info, entry, 0);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -121,8 +121,11 @@ static void __end_buffer_read_notouch(st
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		/* This happens, due to failed READA attempts. */
-		clear_buffer_uptodate(bh);
+		if (buffer_uptodate(bh)) {
+			WARN_ON_ONCE(1);
+			/* This happens, due to failed READA attempts. */
+			clear_buffer_uptodate(bh);
+		}
 	}
 	unlock_buffer(bh);
 }
@@ -142,7 +145,10 @@ void end_buffer_write_sync(struct buffer
 	char b[BDEVNAME_SIZE];
 
 	if (uptodate) {
-		set_buffer_uptodate(bh);
+		if (!buffer_uptodate(bh)) {
+			WARN_ON_ONCE(1);
+			set_buffer_uptodate(bh);
+		}
 	} else {
 		if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
 			buffer_io_error(bh);
@@ -151,7 +157,6 @@ void end_buffer_write_sync(struct buffer
 				       bdevname(bh->b_bdev, b));
 		}
 		set_buffer_write_io_error(bh);
-		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
 	put_bh(bh);
@@ -393,7 +398,10 @@ static void end_buffer_async_read(struct
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		clear_buffer_uptodate(bh);
+		if (buffer_uptodate(bh)) {
+			WARN_ON_ONCE(1);
+			clear_buffer_uptodate(bh);
+		}
 		if (printk_ratelimit())
 			buffer_io_error(bh);
 		SetPageError(page);
@@ -453,7 +461,10 @@ static void end_buffer_async_write(struc
 
 	page = bh->b_page;
 	if (uptodate) {
-		set_buffer_uptodate(bh);
+		if (!buffer_uptodate(bh)) {
+			WARN_ON_ONCE(1);
+			set_buffer_uptodate(bh);
+		}
 	} else {
 		if (printk_ratelimit()) {
 			buffer_io_error(bh);
@@ -463,7 +474,6 @@ static void end_buffer_async_write(struc
 		}
 		set_bit(AS_EIO, &page->mapping->flags);
 		set_buffer_write_io_error(bh);
-		clear_buffer_uptodate(bh);
 		SetPageError(page);
 	}
 
@@ -1998,8 +2008,6 @@ int block_write_begin(struct file *file,
 
 	status = __block_prepare_write(inode, page, start, end, get_block);
 	if (unlikely(status)) {
-		ClearPageUptodate(page);
-
 		if (ownpage) {
 			unlock_page(page);
 			page_cache_release(page);
@@ -2372,10 +2380,7 @@ int block_prepare_write(struct page *pag
 			get_block_t *get_block)
 {
 	struct inode *inode = page->mapping->host;
-	int err = __block_prepare_write(inode, page, from, to, get_block);
-	if (err)
-		ClearPageUptodate(page);
-	return err;
+	return __block_prepare_write(inode, page, from, to, get_block);
 }
 
 int block_commit_write(struct page *page, unsigned from, unsigned to)
@@ -2752,16 +2757,19 @@ has_buffers:
 
 	/* Ok, it's mapped. Make sure it's up-to-date */
 	if (!PageUptodate(page)) {
+again:
 		err = mapping->a_ops->readpage(NULL, page);
 		if (err) {
 			page_cache_release(page);
 			goto out;
 		}
 		lock_page(page);
-		if (!PageUptodate(page)) {
+		if (!PageError(page)) {
 			err = -EIO;
 			goto unlock;
 		}
+		if (!PageUptodate(page))
+			goto again;
 		if (page_has_buffers(page))
 			goto has_buffers;
 	}
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -112,11 +112,16 @@ static int page_cache_pipe_buf_confirm(s
 		/*
 		 * Uh oh, read-error from disk.
 		 */
-		if (!PageUptodate(page)) {
+		if (PageError(page)) {
 			err = -EIO;
 			goto error;
 		}
 
+		if (!PageUptodate(page)) {
+			err = -ENODATA;
+			goto error;
+		}
+
 		/*
 		 * Page is ok afterall, we are done.
 		 */
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -53,7 +53,11 @@ static void mpage_end_io_read(struct bio
 		if (uptodate) {
 			SetPageUptodate(page);
 		} else {
-			ClearPageUptodate(page);
+			if (PageUptodate(page)) {
+				/* let's get rid of this case ASAP */
+				WARN_ON_ONCE(1);
+				ClearPageUptodate(page);
+			}
 			SetPageError(page);
 		}
 		unlock_page(page);
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -77,7 +77,10 @@ void end_swap_bio_read(struct bio *bio, 
 
 	if (!uptodate) {
 		SetPageError(page);
-		ClearPageUptodate(page);
+		if (PageUptodate(page)) {
+			WARN_ON_ONCE(1);
+			ClearPageUptodate(page);
+		}
 		printk(KERN_ALERT "Read-error on swap-device (%u:%u:%Lu)\n",
 				imajor(bio->bi_bdev->bd_inode),
 				iminor(bio->bi_bdev->bd_inode),
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -936,8 +936,10 @@ retry:
 				unlock_page(page);
 				ret = 0;
 			}
-			if (ret || (--nr_to_write <= 0))
+			if (ret || (--nr_to_write <= 0)) {
 				done = 1;
+				break;
+			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 11:21 [patch] fs: improved handling of page and buffer IO errors Nick Piggin
@ 2008-10-21 12:52 ` Miklos Szeredi
  2008-10-21 12:59   ` steve
  2008-10-22 13:16   ` Nick Piggin
  2008-10-21 16:16 ` Andi Kleen
  1 sibling, 2 replies; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 12:52 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, Nick Piggin wrote:
> IO error handling in the core mm/fs still doesn't seem perfect, but with
> the recent round of patches and this one, it should be getting on the
> right track.
> 
> I kind of get the feeling some people would rather forget about all this
> and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> disagrees with my assertion that error handling, and data integrity
> semantics are first-class correctness issues, and therefore are more
> important than all other non-correctness problems... speak now and let's
> discuss that, please.

I agree that error handling is important.  But careful: some
filesystems (NFS I know) don't set PG_error on async read errors, and
changing the semantics of ->readpage() from returning EIO to retrying
will potentially cause infinite loops.  And no casual testing will
reveal those because peristent read errors are extremely rare.

So I think a better aproach would be to do

			error = lock_page_killable(page);
			if (unlikely(error))
				goto readpage_error;
			if (PageError(page) || !PageUptodate(page)) {
				unlock_page(page);
				shrink_readahead_size_eio(filp, ra);
				error = -EIO;
				goto readpage_error;
			}
			if (!page->mapping) {
				unlock_page(page);
				page_cache_release(page);
				goto find_page;
			}

etc...

Is there a case where retrying in case of !PageUptodate() makes any
sense?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 12:52 ` Miklos Szeredi
@ 2008-10-21 12:59   ` steve
  2008-10-21 13:14     ` Miklos Szeredi
  2008-10-22 22:23     ` Mark Fasheh
  2008-10-22 13:16   ` Nick Piggin
  1 sibling, 2 replies; 41+ messages in thread
From: steve @ 2008-10-21 12:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 02:52:45PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, Nick Piggin wrote:
> > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > the recent round of patches and this one, it should be getting on the
> > right track.
> > 
> > I kind of get the feeling some people would rather forget about all this
> > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > disagrees with my assertion that error handling, and data integrity
> > semantics are first-class correctness issues, and therefore are more
> > important than all other non-correctness problems... speak now and let's
> > discuss that, please.
> 
> I agree that error handling is important.  But careful: some
> filesystems (NFS I know) don't set PG_error on async read errors, and
> changing the semantics of ->readpage() from returning EIO to retrying
> will potentially cause infinite loops.  And no casual testing will
> reveal those because peristent read errors are extremely rare.
> 
> So I think a better aproach would be to do
> 
> 			error = lock_page_killable(page);
> 			if (unlikely(error))
> 				goto readpage_error;
> 			if (PageError(page) || !PageUptodate(page)) {
> 				unlock_page(page);
> 				shrink_readahead_size_eio(filp, ra);
> 				error = -EIO;
> 				goto readpage_error;
> 			}
> 			if (!page->mapping) {
> 				unlock_page(page);
> 				page_cache_release(page);
> 				goto find_page;
> 			}
> 
> etc...
> 
> Is there a case where retrying in case of !PageUptodate() makes any
> sense?
>
Yes... cluster filesystems. Its very important in case a readpage
races with a lock demotion. Since the introduction of page_mkwrite
that hasn't worked quite right, but by retrying when the page is
not uptodate, that should fix the problem,

Steve.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 12:59   ` steve
@ 2008-10-21 13:14     ` Miklos Szeredi
  2008-10-21 13:38       ` steve
  2008-10-22 22:23     ` Mark Fasheh
  1 sibling, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 13:14 UTC (permalink / raw)
  To: steve; +Cc: miklos, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, steve@chygwyn.com
> > Is there a case where retrying in case of !PageUptodate() makes any
> > sense?
> >
> Yes... cluster filesystems. Its very important in case a readpage
> races with a lock demotion. Since the introduction of page_mkwrite
> that hasn't worked quite right, but by retrying when the page is
> not uptodate, that should fix the problem,

I see.

Could you please give some more details?  In particular I don't know
what's lock demotion in this context.  And how page_mkwrite() come
into the picture?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 13:14     ` Miklos Szeredi
@ 2008-10-21 13:38       ` steve
  2008-10-21 14:32         ` Miklos Szeredi
  2008-10-21 14:35         ` Evgeniy Polyakov
  0 siblings, 2 replies; 41+ messages in thread
From: steve @ 2008-10-21 13:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 03:14:48PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, steve@chygwyn.com
> > > Is there a case where retrying in case of !PageUptodate() makes any
> > > sense?
> > >
> > Yes... cluster filesystems. Its very important in case a readpage
> > races with a lock demotion. Since the introduction of page_mkwrite
> > that hasn't worked quite right, but by retrying when the page is
> > not uptodate, that should fix the problem,
> 
> I see.
> 
> Could you please give some more details?  In particular I don't know
> what's lock demotion in this context.  And how page_mkwrite() come
> into the picture?
> 
> Thanks,
> Miklos

page_mkwrite is only in the picture really because its the last
time that code was changed. At that point GFS2 adopted
->filemap_fault() rather than using its own page fault
routine.

So here are the basics of locking, so far as GFS2 goes, although
other cluster filesystems are similar. Lets suppose that on
node A, an inode is in cache and its being read/written and
on node B, another process wants to perform some operation
(read/write/etc) on the same inode.

Node B requests a lock via the dlm which causes Node A to
receive a callback. The callback sets a flag in the glock[*]
state on Node A corresponding to the inode in question. This
results in all future requests for that particular lock on
Node A blocking. Also at that time, we unmap any mapped pages
relating to that inode, flush any dirty data back onto the
disk (and if the request was for an exclusive lock, invalidate
the pages as well). So thats what I was refering to above as
lock demotion.

Once thats done, the dlm/glock is dropped (again notification is via
the dlm) and if Node A has outstanding requests queued up, it
re-requests the glock. This is a slightly simplified explanation
but, I hope it gives the general drift.

So to return to the original subject, in order to allow all
this locking to occur with no lock ordering problems, we have
to define a suitable ordering of page locks vs. glocks, and the
ordering that we use is that glocks must come before page locks. The
full ordering of locks in GFS2 is in Documentation/filesystems/gfs2-glocks.txt

As a result of that, the VFS needs reads (and page_mkwrite) to
retry when !PageUptodate() in case the returned page has been
invalidated at any time when the page lock has been dropped.

Obviously we hope that this doesn't happen too often since its
very inefficient (and we have a system to try and reduce the
frequency of such events) but it can and does happen at more
or less any time, so the vfs needs to take that into account.

I hope that makes some kind of sense... let me know if its
not clear,

Steve.

[*] The glock layer is a state machine which is associated with each
dlm lock and performs the required actions is response to dlm messages
and filesystem requests to keep the page cache coherent.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 13:38       ` steve
@ 2008-10-21 14:32         ` Miklos Szeredi
  2008-10-21 15:09           ` steve
  2008-10-21 14:35         ` Evgeniy Polyakov
  1 sibling, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 14:32 UTC (permalink / raw)
  To: steve; +Cc: miklos, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, steve@chygwyn.com
> Once thats done, the dlm/glock is dropped (again notification is via
> the dlm) and if Node A has outstanding requests queued up, it
> re-requests the glock. This is a slightly simplified explanation
> but, I hope it gives the general drift.

Yes, thanks.

> So to return to the original subject, in order to allow all
> this locking to occur with no lock ordering problems, we have
> to define a suitable ordering of page locks vs. glocks, and the
> ordering that we use is that glocks must come before page locks. The
> full ordering of locks in GFS2 is in Documentation/filesystems/gfs2-glocks.txt
> 
> As a result of that, the VFS needs reads (and page_mkwrite) to
> retry when !PageUptodate() in case the returned page has been
> invalidated at any time when the page lock has been dropped.

Since this commit PG_uptodate isn't cleared on invalidate:

  commit 84209e02de48d72289650cc5a7ae8dd18223620f
  Author: Miklos Szeredi <mszeredi@suse.cz>
  Date:   Fri Aug 1 20:28:47 2008 +0200
  
      mm: dont clear PG_uptodate on truncate/invalidate

Testing for !page->mapping, however, is a reliable way to detect both
truncation and invalidation.

So the page can have the following states:

  !PG_uptodate                    -> page has not been read
  PG_uptodate && page->mapping    -> page has been read and is valid
  PG_uptodate && !page->mapping   -> page has been read but no longer valid

So PG_uptodate does not reflect the validity of the data, only whether
the data was ever made up-to-date.

Does this make sense?  Should it be documented somewhere?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 13:38       ` steve
  2008-10-21 14:32         ` Miklos Szeredi
@ 2008-10-21 14:35         ` Evgeniy Polyakov
  2008-10-21 14:59           ` steve
  1 sibling, 1 reply; 41+ messages in thread
From: Evgeniy Polyakov @ 2008-10-21 14:35 UTC (permalink / raw)
  To: steve; +Cc: Miklos Szeredi, npiggin, akpm, linux-mm, linux-fsdevel

Hi.

On Tue, Oct 21, 2008 at 02:38:14PM +0100, steve@chygwyn.com (steve@chygwyn.com) wrote:
> As a result of that, the VFS needs reads (and page_mkwrite) to
> retry when !PageUptodate() in case the returned page has been
> invalidated at any time when the page lock has been dropped.

Doesn't it happen under appropriate inode lock being held,
which is a main serialization point?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 14:35         ` Evgeniy Polyakov
@ 2008-10-21 14:59           ` steve
  2008-10-21 16:20             ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: steve @ 2008-10-21 14:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Miklos Szeredi, npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 06:35:18PM +0400, Evgeniy Polyakov wrote:
> Hi.
> 
> On Tue, Oct 21, 2008 at 02:38:14PM +0100, steve@chygwyn.com (steve@chygwyn.com) wrote:
> > As a result of that, the VFS needs reads (and page_mkwrite) to
> > retry when !PageUptodate() in case the returned page has been
> > invalidated at any time when the page lock has been dropped.
> 
> Doesn't it happen under appropriate inode lock being held,
> which is a main serialization point?
> 
> -- 
> 	Evgeniy Polyakov

No, I guess it might be possible, but for the time being it is
its own "glock" plus the page lock dependency. I'd have to
think quite hard about what the consequences of using the
inode lock would be.

Of course we do demand the inode lock as well in some cases
since the vfs has already grabbed it before calling
into the filesystem when its required. Because of that and
where we run the glock state machine from, it would be rather
complicated to make that work I suspect,

Steve.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 14:32         ` Miklos Szeredi
@ 2008-10-21 15:09           ` steve
  2008-10-21 16:13             ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: steve @ 2008-10-21 15:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 04:32:21PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, steve@chygwyn.com
> > Once thats done, the dlm/glock is dropped (again notification is via
> > the dlm) and if Node A has outstanding requests queued up, it
> > re-requests the glock. This is a slightly simplified explanation
> > but, I hope it gives the general drift.
> 
> Yes, thanks.
> 
> > So to return to the original subject, in order to allow all
> > this locking to occur with no lock ordering problems, we have
> > to define a suitable ordering of page locks vs. glocks, and the
> > ordering that we use is that glocks must come before page locks. The
> > full ordering of locks in GFS2 is in Documentation/filesystems/gfs2-glocks.txt
> > 
> > As a result of that, the VFS needs reads (and page_mkwrite) to
> > retry when !PageUptodate() in case the returned page has been
> > invalidated at any time when the page lock has been dropped.
> 
> Since this commit PG_uptodate isn't cleared on invalidate:
> 
>   commit 84209e02de48d72289650cc5a7ae8dd18223620f
>   Author: Miklos Szeredi <mszeredi@suse.cz>
>   Date:   Fri Aug 1 20:28:47 2008 +0200
>   
>       mm: dont clear PG_uptodate on truncate/invalidate
> 
> Testing for !page->mapping, however, is a reliable way to detect both
> truncation and invalidation.
> 
> So the page can have the following states:
> 
>   !PG_uptodate                    -> page has not been read
>   PG_uptodate && page->mapping    -> page has been read and is valid
>   PG_uptodate && !page->mapping   -> page has been read but no longer valid
> 
> So PG_uptodate does not reflect the validity of the data, only whether
> the data was ever made up-to-date.
> 
> Does this make sense?  Should it be documented somewhere?
>
Well I'm not sure why we'd need to distinguish between "page has not been read"
and "page has been read but no longer valid". I guess I don't understand why
those two cases are not the same from the vfs and filesystem points of
view.

I'm sure it should be documented :-) it certainly seems confusing and if we
want to keep this scheme, can we change PG_uptodate to PG_wasread or
PG_usedonce or something like that which more clearly reflects its
purpose in that case,

Steve.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 15:09           ` steve
@ 2008-10-21 16:13             ` Miklos Szeredi
  2008-10-22 12:51               ` Jamie Lokier
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 16:13 UTC (permalink / raw)
  To: steve; +Cc: miklos, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, steve@chygwyn.co wrote:
> Well I'm not sure why we'd need to distinguish between "page has not
> been read" and "page has been read but no longer valid". I guess I
> don't understand why those two cases are not the same from the vfs
> and filesystem points of view.

In the first case the page contains random bytes, in the second case
it contains actual file data, which has become stale, but at some
point in time it _was_ the contents of the file.

This is a very important distinction for splice(2) for example.
Splice does not actually copy data into the pipe buffer, only
references the pages.  And it can reference pages which are not yet
up-to-date.  So when the buffers are consumed from the pipe, the
splice code needs to know if the page contains random junk (never
brought up-to-date) or data that is, or once was, valid.

> I'm sure it should be documented :-) it certainly seems confusing and if we
> want to keep this scheme, can we change PG_uptodate to PG_wasread or
> PG_usedonce or something like that which more clearly reflects its
> purpose in that case,

I'm not going to argue about the name :)

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 11:21 [patch] fs: improved handling of page and buffer IO errors Nick Piggin
  2008-10-21 12:52 ` Miklos Szeredi
@ 2008-10-21 16:16 ` Andi Kleen
  2008-10-21 16:30   ` steve
  2008-10-22 10:31   ` Nick Piggin
  1 sibling, 2 replies; 41+ messages in thread
From: Andi Kleen @ 2008-10-21 16:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel

Nick Piggin <npiggin@suse.de> writes:

> IO error handling in the core mm/fs still doesn't seem perfect, but with
> the recent round of patches and this one, it should be getting on the
> right track.
>
> I kind of get the feeling some people would rather forget about all this
> and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> disagrees with my assertion that error handling, and data integrity
> semantics are first-class correctness issues, and therefore are more
> important than all other non-correctness problems... speak now and let's
> discuss that, please.
>
> Otherwise, unless anybody sees obvious problems with this, hopefully it
> can go into -mm for some wider testing (I've tested it with a few filesystems
> so far and no immediate problems)

I think the first step to get these more robust in the future would be to
have a standard regression test testing these paths.  Otherwise it'll
bit-rot sooner or later again.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 14:59           ` steve
@ 2008-10-21 16:20             ` Miklos Szeredi
  2008-10-21 16:25               ` steve
  2008-10-21 16:28               ` Miklos Szeredi
  0 siblings, 2 replies; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 16:20 UTC (permalink / raw)
  To: steve; +Cc: zbr, miklos, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, steve@chygwyn.co wrote:
> No, I guess it might be possible, but for the time being it is
> its own "glock" plus the page lock dependency. I'd have to
> think quite hard about what the consequences of using the
> inode lock would be.
> 
> Of course we do demand the inode lock as well in some cases
> since the vfs has already grabbed it before calling
> into the filesystem when its required. Because of that and
> where we run the glock state machine from, it would be rather
> complicated to make that work I suspect,

BTW, why do you want strict coherency for memory mappings?  It's not
something POSIX mandates.  It's not even something that Linux always
did.

If I were an application writer, I'd never try to rely on mmap
coherency without the appropriate magic msync() calls.

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:20             ` Miklos Szeredi
@ 2008-10-21 16:25               ` steve
  2008-10-21 16:28               ` Miklos Szeredi
  1 sibling, 0 replies; 41+ messages in thread
From: steve @ 2008-10-21 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: zbr, npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 06:20:17PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, steve@chygwyn.co wrote:
> > No, I guess it might be possible, but for the time being it is
> > its own "glock" plus the page lock dependency. I'd have to
> > think quite hard about what the consequences of using the
> > inode lock would be.
> > 
> > Of course we do demand the inode lock as well in some cases
> > since the vfs has already grabbed it before calling
> > into the filesystem when its required. Because of that and
> > where we run the glock state machine from, it would be rather
> > complicated to make that work I suspect,
> 
> BTW, why do you want strict coherency for memory mappings?  It's not
> something POSIX mandates.  It's not even something that Linux always
> did.
>
Its something that GFS has always done, and so we've tried to keep
that feature in GFS2. I think we do (at least I do) try to suggest
to people that they shouldn't be relying on this, but we've always
tried to make it work anyway, at least on the principle of least
surprise.
 
> If I were an application writer, I'd never try to rely on mmap
> coherency without the appropriate magic msync() calls.
> 
> Miklos

Yes, I'd agree, but I write kernel code, not applications :-)
Thanks for the explanation on splice, I'll take a look at that
code now and try to understand it in more detail,

Steve.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:20             ` Miklos Szeredi
  2008-10-21 16:25               ` steve
@ 2008-10-21 16:28               ` Miklos Szeredi
  2008-10-21 16:29                 ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-21 16:28 UTC (permalink / raw)
  To: steve; +Cc: zbr, miklos, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, 21 Oct 2008, Miklos Szeredi wrote:
> BTW, why do you want strict coherency for memory mappings?  It's not
> something POSIX mandates.  It's not even something that Linux always
> did.

Or does, for that matter, on those architectures which have virtually
addressed caches.

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:28               ` Miklos Szeredi
@ 2008-10-21 16:29                 ` Matthew Wilcox
  2008-10-22 12:48                   ` Jamie Lokier
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2008-10-21 16:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: steve, zbr, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, Oct 21, 2008 at 06:28:01PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, Miklos Szeredi wrote:
> > BTW, why do you want strict coherency for memory mappings?  It's not
> > something POSIX mandates.  It's not even something that Linux always
> > did.
> 
> Or does, for that matter, on those architectures which have virtually
> addressed caches.

Careful with those slurs you're throwing around.  PA-RISC carefully
aligns its mmaps so they are coherent.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:16 ` Andi Kleen
@ 2008-10-21 16:30   ` steve
  2008-10-22 10:31   ` Nick Piggin
  1 sibling, 0 replies; 41+ messages in thread
From: steve @ 2008-10-21 16:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

Hi,

On Tue, Oct 21, 2008 at 06:16:24PM +0200, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > the recent round of patches and this one, it should be getting on the
> > right track.
> >
> > I kind of get the feeling some people would rather forget about all this
> > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > disagrees with my assertion that error handling, and data integrity
> > semantics are first-class correctness issues, and therefore are more
> > important than all other non-correctness problems... speak now and let's
> > discuss that, please.
> >
> > Otherwise, unless anybody sees obvious problems with this, hopefully it
> > can go into -mm for some wider testing (I've tested it with a few filesystems
> > so far and no immediate problems)
> 
> I think the first step to get these more robust in the future would be to
> have a standard regression test testing these paths.  Otherwise it'll
> bit-rot sooner or later again.
> 
> -Andi
>
I have a plan to (at some stage, when I get some time!) create some
mechanism which will allow the mounting of multiple GFS2 filesystems
on a single device, on the same node. i.e. like a cluster but multiple
mounts from a single node. Currently we can get half way there by
"cloning" a block device with dm, but our locking doesn't support that
configuration at the moment.

Given that, it should then be possible to run cluster tests on a single
node across several mounts of the same filesystem, and thus allow
much easier testing (there is of course no practical reason to allow
such a configuration aside from testing),

Steve.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:16 ` Andi Kleen
  2008-10-21 16:30   ` steve
@ 2008-10-22 10:31   ` Nick Piggin
  2008-10-22 18:46     ` Brad Boyer
  2008-10-22 23:07     ` Dave Chinner
  1 sibling, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2008-10-22 10:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel

On Tue, Oct 21, 2008 at 06:16:24PM +0200, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > the recent round of patches and this one, it should be getting on the
> > right track.
> >
> > I kind of get the feeling some people would rather forget about all this
> > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > disagrees with my assertion that error handling, and data integrity
> > semantics are first-class correctness issues, and therefore are more
> > important than all other non-correctness problems... speak now and let's
> > discuss that, please.
> >
> > Otherwise, unless anybody sees obvious problems with this, hopefully it
> > can go into -mm for some wider testing (I've tested it with a few filesystems
> > so far and no immediate problems)
> 
> I think the first step to get these more robust in the future would be to
> have a standard regression test testing these paths.  Otherwise it'll
> bit-rot sooner or later again.

The problem I've had with testing is that it's hard to trigger a specific
path for a given error, because write IO especially can be quite non
deterministic, and the filesystem or kernel may give up at various points.

I agree, but I just don't know exactly how they can be turned into
standard tests. Some filesystems like XFS seem to completely shut down
quite easily on IO errors. Others like ext2 can't really unwind from
a failure in a multi-block operation (eg. allocating a block to an
inode) if an error is detected, and it just gets ignored.

I am testing, but mainly just random failure injections and seeing if
things go bug or go undetected etc.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:29                 ` Matthew Wilcox
@ 2008-10-22 12:48                   ` Jamie Lokier
  2008-10-22 13:45                     ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Jamie Lokier @ 2008-10-22 12:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Miklos Szeredi, steve, zbr, npiggin, akpm, linux-mm,
	linux-fsdevel

Matthew Wilcox wrote:
> On Tue, Oct 21, 2008 at 06:28:01PM +0200, Miklos Szeredi wrote:
> > On Tue, 21 Oct 2008, Miklos Szeredi wrote:
> > > BTW, why do you want strict coherency for memory mappings?  It's not
> > > something POSIX mandates.  It's not even something that Linux always
> > > did.
> > 
> > Or does, for that matter, on those architectures which have virtually
> > addressed caches.
> 
> Careful with those slurs you're throwing around.  PA-RISC carefully
> aligns its mmaps so they are coherent.

(Unless you use MAP_FIXED?)

Last time I looked at the coherency code, there appeared to be a few
bugs on some architectures, but I didn't have the architectures to
test and confirm them.  It was a long time ago, in the 2.4 era though.

-- Jamie

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 16:13             ` Miklos Szeredi
@ 2008-10-22 12:51               ` Jamie Lokier
  2008-10-22 14:08                 ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Jamie Lokier @ 2008-10-22 12:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: steve, npiggin, akpm, linux-mm, linux-fsdevel

Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, steve@chygwyn.co wrote:
> > Well I'm not sure why we'd need to distinguish between "page has not
> > been read" and "page has been read but no longer valid". I guess I
> > don't understand why those two cases are not the same from the vfs
> > and filesystem points of view.
> 
> In the first case the page contains random bytes, in the second case
> it contains actual file data, which has become stale, but at some
> point in time it _was_ the contents of the file.
> 
> This is a very important distinction for splice(2) for example.
> Splice does not actually copy data into the pipe buffer, only
> references the pages.  And it can reference pages which are not yet
> up-to-date.  So when the buffers are consumed from the pipe, the
> splice code needs to know if the page contains random junk (never
> brought up-to-date) or data that is, or once was, valid.

So GFS goes to great lengths to ensure that read/write are coherent,
so are mmaps (writable or not), but _splice_ is not coherent in the
sense that it can send invalid but non-random data? :-)

Also, is there still a problem where the data is "valid" but part of
the page may have been zero'd by truncate, which is then transmitted
by splice?

-- Jamie

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 12:52 ` Miklos Szeredi
  2008-10-21 12:59   ` steve
@ 2008-10-22 13:16   ` Nick Piggin
  2008-10-22 20:09     ` Miklos Szeredi
  1 sibling, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2008-10-22 13:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-mm, linux-fsdevel

On Tue, Oct 21, 2008 at 02:52:45PM +0200, Miklos Szeredi wrote:
> On Tue, 21 Oct 2008, Nick Piggin wrote:
> > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > the recent round of patches and this one, it should be getting on the
> > right track.
> > 
> > I kind of get the feeling some people would rather forget about all this
> > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > disagrees with my assertion that error handling, and data integrity
> > semantics are first-class correctness issues, and therefore are more
> > important than all other non-correctness problems... speak now and let's
> > discuss that, please.
> 
> I agree that error handling is important.  But careful: some
> filesystems (NFS I know) don't set PG_error on async read errors, and
> changing the semantics of ->readpage() from returning EIO to retrying
> will potentially cause infinite loops.  And no casual testing will

OK, they'll just need to be fixed.


> reveal those because peristent read errors are extremely rare.

Same as other read errors I guess. We need to be doing more testing
of error cases. I've been doing it a little bit recently for a couple
of block based filesystems... but the hardest code I think is actually
ensuring each filesystem actually does sane things.
  

> So I think a better aproach would be to do
> 
> 			error = lock_page_killable(page);
> 			if (unlikely(error))
> 				goto readpage_error;
> 			if (PageError(page) || !PageUptodate(page)) {
> 				unlock_page(page);
> 				shrink_readahead_size_eio(filp, ra);
> 				error = -EIO;
> 				goto readpage_error;
> 			}
> 			if (!page->mapping) {
> 				unlock_page(page);
> 				page_cache_release(page);
> 				goto find_page;
> 			}
> 
> etc...
> 
> Is there a case where retrying in case of !PageUptodate() makes any
> sense?

Invalidate I guess is covered now (I don't exactly like the solution,
but it's what we have for now). Truncate hmm, I thought that still
clears PageUptodate, but it doesn't seem to either?

Maybe we can use !PageUptodate, with care, for read errors. It might 
actually be a bit preferable in the sense that PageError can just be
used for write errors only.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 12:48                   ` Jamie Lokier
@ 2008-10-22 13:45                     ` Matthew Wilcox
  2008-10-22 14:02                       ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2008-10-22 13:45 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Miklos Szeredi, steve, zbr, npiggin, akpm, linux-mm,
	linux-fsdevel

On Wed, Oct 22, 2008 at 01:48:29PM +0100, Jamie Lokier wrote:
> Matthew Wilcox wrote:
> > Careful with those slurs you're throwing around.  PA-RISC carefully
> > aligns its mmaps so they are coherent.
> 
> (Unless you use MAP_FIXED?)

Doctor, it hurts when I point this gun at my foot and pull the trigger
...

> Last time I looked at the coherency code, there appeared to be a few
> bugs on some architectures, but I didn't have the architectures to
> test and confirm them.  It was a long time ago, in the 2.4 era though.

I think there's a few parisc machines floating around looking for good
homes if you're interested.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 13:45                     ` Matthew Wilcox
@ 2008-10-22 14:02                       ` Miklos Szeredi
  2008-10-22 14:35                         ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-22 14:02 UTC (permalink / raw)
  To: matthew; +Cc: jamie, miklos, steve, zbr, npiggin, akpm, linux-mm, linux-fsdevel

On Wed, 22 Oct 2008, Matthew Wilcox wrote:
> On Wed, Oct 22, 2008 at 01:48:29PM +0100, Jamie Lokier wrote:
> > Matthew Wilcox wrote:
> > > Careful with those slurs you're throwing around.  PA-RISC carefully
> > > aligns its mmaps so they are coherent.
> > 
> > (Unless you use MAP_FIXED?)
> 
> Doctor, it hurts when I point this gun at my foot and pull the trigger
> ...

And remap_file_pages() also.  Neither that nor MAP_FIXED are widely
used, but still, coherency is not a completely solved issue.

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 12:51               ` Jamie Lokier
@ 2008-10-22 14:08                 ` Miklos Szeredi
  0 siblings, 0 replies; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-22 14:08 UTC (permalink / raw)
  To: jamie; +Cc: miklos, steve, npiggin, akpm, linux-mm, linux-fsdevel

On Wed, 22 Oct 2008, Jamie Lokier wrote:
> So GFS goes to great lengths to ensure that read/write are coherent,
> so are mmaps (writable or not), but _splice_ is not coherent in the
> sense that it can send invalid but non-random data? :-)

Spice is not coherent in any sense on any filesystem :)

Your idea about COWing the page would be nice, and I think it may even
be implementable.  Currently the biggest problem with splice is the
lack of users, we'd have to solve that first somehow.

> Also, is there still a problem where the data is "valid" but part of
> the page may have been zero'd by truncate, which is then transmitted
> by splice?

Yes.

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 14:02                       ` Miklos Szeredi
@ 2008-10-22 14:35                         ` Matthew Wilcox
  2008-10-22 14:45                           ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2008-10-22 14:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jamie, steve, zbr, npiggin, akpm, linux-mm, linux-fsdevel

On Wed, Oct 22, 2008 at 04:02:22PM +0200, Miklos Szeredi wrote:
> On Wed, 22 Oct 2008, Matthew Wilcox wrote:
> > On Wed, Oct 22, 2008 at 01:48:29PM +0100, Jamie Lokier wrote:
> > > Matthew Wilcox wrote:
> > > > Careful with those slurs you're throwing around.  PA-RISC carefully
> > > > aligns its mmaps so they are coherent.
> > > 
> > > (Unless you use MAP_FIXED?)
> > 
> > Doctor, it hurts when I point this gun at my foot and pull the trigger
> > ...
> 
> And remap_file_pages() also.  Neither that nor MAP_FIXED are widely
> used, but still, coherency is not a completely solved issue.

remap_file_pages() only hurts if you map the same page more than once
(which is permitted, but again, I don't think anyone actually does
that).

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 14:35                         ` Matthew Wilcox
@ 2008-10-22 14:45                           ` Miklos Szeredi
  2008-10-23 13:48                             ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-22 14:45 UTC (permalink / raw)
  To: matthew; +Cc: miklos, jamie, steve, zbr, npiggin, akpm, linux-mm, linux-fsdevel

On Wed, 22 Oct 2008, Matthew Wilcox wrote:
> remap_file_pages() only hurts if you map the same page more than once
> (which is permitted, but again, I don't think anyone actually does
> that).

This is getting very offtopic... but remap_file_pages() is just like
MAP_FIXED, that the address at which a page is mapped is determined by
the caller, not the kernel.  So coherency with other, independent
mapping of the file is basically up to chance.

Do we care?  I very much hope not.  Why do PA-RISC and friends care at
all about mmap coherecy?  Is it real-world problem driven or just to
be safe?

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 10:31   ` Nick Piggin
@ 2008-10-22 18:46     ` Brad Boyer
  2008-10-22 20:19       ` Andi Kleen
  2008-10-23  7:08       ` Nick Piggin
  2008-10-22 23:07     ` Dave Chinner
  1 sibling, 2 replies; 41+ messages in thread
From: Brad Boyer @ 2008-10-22 18:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Wed, Oct 22, 2008 at 12:31:13PM +0200, Nick Piggin wrote:
> The problem I've had with testing is that it's hard to trigger a specific
> path for a given error, because write IO especially can be quite non
> deterministic, and the filesystem or kernel may give up at various points.
> 
> I agree, but I just don't know exactly how they can be turned into
> standard tests. Some filesystems like XFS seem to completely shut down
> quite easily on IO errors. Others like ext2 can't really unwind from
> a failure in a multi-block operation (eg. allocating a block to an
> inode) if an error is detected, and it just gets ignored.
> 
> I am testing, but mainly just random failure injections and seeing if
> things go bug or go undetected etc.

Something that might be useful for this kind of testing is a block
device that is just a map onto a real block device but allows the
user to configure it to generate various errors. If we could set it
to always error on either read or write of particular block ranges,
or randomly choose blocks to error from a pattern it could easily
trigger many of the error paths. There was something like this on the
Sega Dreamcast developer kit. The special version of the system used
by developers had this sort of thing implemented in hardware in the
link to the optical drive and this made error testing much easier. It
should be possible to do something similar with a software driver.

	Brad Boyer
	flar@allandria.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 13:16   ` Nick Piggin
@ 2008-10-22 20:09     ` Miklos Szeredi
  0 siblings, 0 replies; 41+ messages in thread
From: Miklos Szeredi @ 2008-10-22 20:09 UTC (permalink / raw)
  To: npiggin; +Cc: miklos, akpm, linux-mm, linux-fsdevel

On Wed, 22 Oct 2008, Nick Piggin wrote:
> Invalidate I guess is covered now (I don't exactly like the solution,
> but it's what we have for now). Truncate hmm, I thought that still
> clears PageUptodate, but it doesn't seem to either?

Right.  Linus's reasons for this change:

  "But I'd really like to get that PG_uptodate bit just fixed - both
   wrt writeout errors and wrt truncate/holepunch. We had some similar
   issues wrt ext3 (?) inode buffers, where removing the uptodate bit
   actually ended up being a mistake."

My thoughts are:

 a) clearing both PG_uptodate *and* page->mapping is redundant

 b) the page contents do not actually change in either the whole-page
    truncate or the invalidate case, so the up-to-date state shouldn't
    change either.

> Maybe we can use !PageUptodate, with care, for read errors. It might 
> actually be a bit preferable in the sense that PageError can just be
> used for write errors only.

That's fine by me, some filesystems do set PageError even on read, but
it doesn't matter, since they obviously won't set PageUptodate in that
case.

Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 18:46     ` Brad Boyer
@ 2008-10-22 20:19       ` Andi Kleen
  2008-10-23  7:08       ` Nick Piggin
  1 sibling, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2008-10-22 20:19 UTC (permalink / raw)
  To: Brad Boyer
  Cc: Nick Piggin, Andi Kleen, Andrew Morton,
	Linux Memory Management List, linux-fsdevel

> Something that might be useful for this kind of testing is a block
> device that is just a map onto a real block device but allows the
> user to configure it to generate various errors. If we could set it

Both MD and DM have such injectors. That's not the problem. The problem
is the lack of a standard test suite that sets it all up and exercises
the relevant paths in the VM and FS.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-21 12:59   ` steve
  2008-10-21 13:14     ` Miklos Szeredi
@ 2008-10-22 22:23     ` Mark Fasheh
  2008-10-23  9:59       ` steve
  1 sibling, 1 reply; 41+ messages in thread
From: Mark Fasheh @ 2008-10-22 22:23 UTC (permalink / raw)
  To: steve; +Cc: Miklos Szeredi, npiggin, akpm, linux-mm, linux-fsdevel

On Tue, Oct 21, 2008 at 01:59:15PM +0100, steve@chygwyn.com wrote:
> Hi,
> 
> On Tue, Oct 21, 2008 at 02:52:45PM +0200, Miklos Szeredi wrote:
> > On Tue, 21 Oct 2008, Nick Piggin wrote:
> > > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > > the recent round of patches and this one, it should be getting on the
> > > right track.
> > > 
> > > I kind of get the feeling some people would rather forget about all this
> > > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > > disagrees with my assertion that error handling, and data integrity
> > > semantics are first-class correctness issues, and therefore are more
> > > important than all other non-correctness problems... speak now and let's
> > > discuss that, please.
> > 
> > I agree that error handling is important.  But careful: some
> > filesystems (NFS I know) don't set PG_error on async read errors, and
> > changing the semantics of ->readpage() from returning EIO to retrying
> > will potentially cause infinite loops.  And no casual testing will
> > reveal those because peristent read errors are extremely rare.
> > 
> > So I think a better aproach would be to do
> > 
> > 			error = lock_page_killable(page);
> > 			if (unlikely(error))
> > 				goto readpage_error;
> > 			if (PageError(page) || !PageUptodate(page)) {
> > 				unlock_page(page);
> > 				shrink_readahead_size_eio(filp, ra);
> > 				error = -EIO;
> > 				goto readpage_error;
> > 			}
> > 			if (!page->mapping) {
> > 				unlock_page(page);
> > 				page_cache_release(page);
> > 				goto find_page;
> > 			}
> > 
> > etc...
> > 
> > Is there a case where retrying in case of !PageUptodate() makes any
> > sense?
> >
> Yes... cluster filesystems. Its very important in case a readpage
> races with a lock demotion. Since the introduction of page_mkwrite
> that hasn't worked quite right, but by retrying when the page is
> not uptodate, that should fix the problem,

Btw, at least for the readpage case, a return of AOP_TRUNCATED_PAGE should
be checked for, which would indicate (along with !PageUptodate()) whether we
need to retry the read. page_mkwrite though, as you point out, is a
different story.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 10:31   ` Nick Piggin
  2008-10-22 18:46     ` Brad Boyer
@ 2008-10-22 23:07     ` Dave Chinner
  2008-10-23  7:07       ` Nick Piggin
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2008-10-22 23:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Wed, Oct 22, 2008 at 12:31:13PM +0200, Nick Piggin wrote:
> On Tue, Oct 21, 2008 at 06:16:24PM +0200, Andi Kleen wrote:
> > Nick Piggin <npiggin@suse.de> writes:
> > 
> > > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > > the recent round of patches and this one, it should be getting on the
> > > right track.
> > >
> > > I kind of get the feeling some people would rather forget about all this
> > > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > > disagrees with my assertion that error handling, and data integrity
> > > semantics are first-class correctness issues, and therefore are more
> > > important than all other non-correctness problems... speak now and let's
> > > discuss that, please.
> > >
> > > Otherwise, unless anybody sees obvious problems with this, hopefully it
> > > can go into -mm for some wider testing (I've tested it with a few filesystems
> > > so far and no immediate problems)
> > 
> > I think the first step to get these more robust in the future would be to
> > have a standard regression test testing these paths.  Otherwise it'll
> > bit-rot sooner or later again.
> 
> The problem I've had with testing is that it's hard to trigger a specific
> path for a given error, because write IO especially can be quite non
> deterministic, and the filesystem or kernel may give up at various points.
> 
> I agree, but I just don't know exactly how they can be turned into
> standard tests. Some filesystems like XFS seem to completely shut down
> quite easily on IO errors.

XFS only shuts down on unrecoverable metadata I/O errors, and those
I/Os do not go through the generic code paths at all (they go
through the XFS buffer layer instead). Errors in data I/O will
not trigger shutdowns at all - instead they get reported to the
application doing the I/O...

> Others like ext2 can't really unwind from
> a failure in a multi-block operation (eg. allocating a block to an
> inode) if an error is detected, and it just gets ignored.

It's these sorts of situations that XFS will trigger a shutdown.

> I am testing, but mainly just random failure injections and seeing if
> things go bug or go undetected etc.

I'd start with just data I/O. e.g. preallocate a large file, then do
reads and writes to it and inject errors into those I/Os. That
should avoid all the problems of I/O errors on metadata I/O and the
problems that generates. If you do synchronous I/O, then the errors
should pop straight out to the test application as well. All
filesystems should report the same errors with this sort of test.

You could do the same thing for metadata read operations. e.g. build
a large directory structure, then do read operations on it (readdir,
stat, etc) and inject errors into each of those. All filesystems
should return the (EIO) error to the application in this case.

Those two cases should be pretty generic and deterministic - they
both avoid the difficult problem of determining what the response
to an I/O error during metadata modifcation should be....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 23:07     ` Dave Chinner
@ 2008-10-23  7:07       ` Nick Piggin
  2008-10-23  9:44         ` steve
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2008-10-23  7:07 UTC (permalink / raw)
  To: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Thu, Oct 23, 2008 at 10:07:15AM +1100, Dave Chinner wrote:
> On Wed, Oct 22, 2008 at 12:31:13PM +0200, Nick Piggin wrote:
> > On Tue, Oct 21, 2008 at 06:16:24PM +0200, Andi Kleen wrote:
> > > Nick Piggin <npiggin@suse.de> writes:
> > > 
> > > > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > > > the recent round of patches and this one, it should be getting on the
> > > > right track.
> > > >
> > > > I kind of get the feeling some people would rather forget about all this
> > > > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > > > disagrees with my assertion that error handling, and data integrity
> > > > semantics are first-class correctness issues, and therefore are more
> > > > important than all other non-correctness problems... speak now and let's
> > > > discuss that, please.
> > > >
> > > > Otherwise, unless anybody sees obvious problems with this, hopefully it
> > > > can go into -mm for some wider testing (I've tested it with a few filesystems
> > > > so far and no immediate problems)
> > > 
> > > I think the first step to get these more robust in the future would be to
> > > have a standard regression test testing these paths.  Otherwise it'll
> > > bit-rot sooner or later again.
> > 
> > The problem I've had with testing is that it's hard to trigger a specific
> > path for a given error, because write IO especially can be quite non
> > deterministic, and the filesystem or kernel may give up at various points.
> > 
> > I agree, but I just don't know exactly how they can be turned into
> > standard tests. Some filesystems like XFS seem to completely shut down
> > quite easily on IO errors.
> 
> XFS only shuts down on unrecoverable metadata I/O errors, and those
> I/Os do not go through the generic code paths at all (they go
> through the XFS buffer layer instead). Errors in data I/O will
> not trigger shutdowns at all - instead they get reported to the
> application doing the I/O...

Yes, but just for my testing of injecting random errors, it's hard to
get meaningful tests because of the shutdown problem. Not a criticism :)
Of the few fses I tested, XFS definitely seemed to be the best (at least
it detects and makes effort to handle failures somehow).

 
> > Others like ext2 can't really unwind from
> > a failure in a multi-block operation (eg. allocating a block to an
> > inode) if an error is detected, and it just gets ignored.
> 
> It's these sorts of situations that XFS will trigger a shutdown.
> 
> > I am testing, but mainly just random failure injections and seeing if
> > things go bug or go undetected etc.
> 
> I'd start with just data I/O. e.g. preallocate a large file, then do
> reads and writes to it and inject errors into those I/Os. That
> should avoid all the problems of I/O errors on metadata I/O and the
> problems that generates. If you do synchronous I/O, then the errors
> should pop straight out to the test application as well. All
> filesystems should report the same errors with this sort of test.
 
Yeah, that's probably a good idea. That should still provide all or
nearly all coverage for the core VM.


> You could do the same thing for metadata read operations. e.g. build
> a large directory structure, then do read operations on it (readdir,
> stat, etc) and inject errors into each of those. All filesystems
> should return the (EIO) error to the application in this case.
> 
> Those two cases should be pretty generic and deterministic - they
> both avoid the difficult problem of determining what the response
> to an I/O error during metadata modifcation should be....

Good suggestion.

I'll see what I can do. I'm using the fault injection stuff, which I
don't think can distinguish metadata, so I might just have to work
out a bio flag or something we can send down to the block layer to
distinguish.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 18:46     ` Brad Boyer
  2008-10-22 20:19       ` Andi Kleen
@ 2008-10-23  7:08       ` Nick Piggin
  1 sibling, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2008-10-23  7:08 UTC (permalink / raw)
  To: Brad Boyer
  Cc: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Wed, Oct 22, 2008 at 11:46:24AM -0700, Brad Boyer wrote:
> On Wed, Oct 22, 2008 at 12:31:13PM +0200, Nick Piggin wrote:
> > The problem I've had with testing is that it's hard to trigger a specific
> > path for a given error, because write IO especially can be quite non
> > deterministic, and the filesystem or kernel may give up at various points.
> > 
> > I agree, but I just don't know exactly how they can be turned into
> > standard tests. Some filesystems like XFS seem to completely shut down
> > quite easily on IO errors. Others like ext2 can't really unwind from
> > a failure in a multi-block operation (eg. allocating a block to an
> > inode) if an error is detected, and it just gets ignored.
> > 
> > I am testing, but mainly just random failure injections and seeing if
> > things go bug or go undetected etc.
> 
> Something that might be useful for this kind of testing is a block
> device that is just a map onto a real block device but allows the
> user to configure it to generate various errors. If we could set it
> to always error on either read or write of particular block ranges,
> or randomly choose blocks to error from a pattern it could easily
> trigger many of the error paths. There was something like this on the
> Sega Dreamcast developer kit. The special version of the system used
> by developers had this sort of thing implemented in hardware in the
> link to the optical drive and this made error testing much easier. It
> should be possible to do something similar with a software driver.

We have this failure injection stuff for the block layer, which is
what I've been using. Probably it doesn't quite cover the possible
failure modes of block devices, but it seems to be a reasonable start.

It could use some extending to distinguish reads vs writes, and data
vs metadata, as Dave pointed out.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23  7:07       ` Nick Piggin
@ 2008-10-23  9:44         ` steve
  2008-10-23 11:15           ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: steve @ 2008-10-23  9:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

Hi,

On Thu, Oct 23, 2008 at 09:07:11AM +0200, Nick Piggin wrote:
> On Thu, Oct 23, 2008 at 10:07:15AM +1100, Dave Chinner wrote:
[snip]
> 
> > You could do the same thing for metadata read operations. e.g. build
> > a large directory structure, then do read operations on it (readdir,
> > stat, etc) and inject errors into each of those. All filesystems
> > should return the (EIO) error to the application in this case.
> > 
> > Those two cases should be pretty generic and deterministic - they
> > both avoid the difficult problem of determining what the response
> > to an I/O error during metadata modifcation should be....
> 
> Good suggestion.
> 
> I'll see what I can do. I'm using the fault injection stuff, which I
> don't think can distinguish metadata, so I might just have to work
> out a bio flag or something we can send down to the block layer to
> distinguish.
> 
> Thanks,
> Nick
>

Don't we already have such a flag? I know that its not set in all
the correct places in GFS2 so far, but I've gradually been fixing
them to include BIO_RW_META where appropriate.

Also it occurs to me that we can use FIEMAP to discover where a
parciular file is and that would then allow us to target error
injection at particular blocks of the file.

Given that we can cover xattr blocks with FIEMAP too[*], and that at
least with GFS2 and similar filesystems the inode number is the
block number of the inode, the only thing that would be missing,
in terms of locating inode data & metadata would be the indirect blocks,

Steve.

[*] GFS2's FIEMAP doesn't support xattrs yet, but its on my todo list


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 22:23     ` Mark Fasheh
@ 2008-10-23  9:59       ` steve
  2008-10-23 10:21         ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: steve @ 2008-10-23  9:59 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Miklos Szeredi, npiggin, akpm, linux-mm, linux-fsdevel

Hi,

On Wed, Oct 22, 2008 at 03:23:16PM -0700, Mark Fasheh wrote:
> On Tue, Oct 21, 2008 at 01:59:15PM +0100, steve@chygwyn.com wrote:
> > Hi,
> > 
> > On Tue, Oct 21, 2008 at 02:52:45PM +0200, Miklos Szeredi wrote:
> > > On Tue, 21 Oct 2008, Nick Piggin wrote:
> > > > IO error handling in the core mm/fs still doesn't seem perfect, but with
> > > > the recent round of patches and this one, it should be getting on the
> > > > right track.
> > > > 
> > > > I kind of get the feeling some people would rather forget about all this
> > > > and brush it under the carpet. Hopefully I'm mistaken, but if anybody
> > > > disagrees with my assertion that error handling, and data integrity
> > > > semantics are first-class correctness issues, and therefore are more
> > > > important than all other non-correctness problems... speak now and let's
> > > > discuss that, please.
> > > 
> > > I agree that error handling is important.  But careful: some
> > > filesystems (NFS I know) don't set PG_error on async read errors, and
> > > changing the semantics of ->readpage() from returning EIO to retrying
> > > will potentially cause infinite loops.  And no casual testing will
> > > reveal those because peristent read errors are extremely rare.
> > > 
> > > So I think a better aproach would be to do
> > > 
> > > 			error = lock_page_killable(page);
> > > 			if (unlikely(error))
> > > 				goto readpage_error;
> > > 			if (PageError(page) || !PageUptodate(page)) {
> > > 				unlock_page(page);
> > > 				shrink_readahead_size_eio(filp, ra);
> > > 				error = -EIO;
> > > 				goto readpage_error;
> > > 			}
> > > 			if (!page->mapping) {
> > > 				unlock_page(page);
> > > 				page_cache_release(page);
> > > 				goto find_page;
> > > 			}
> > > 
> > > etc...
> > > 
> > > Is there a case where retrying in case of !PageUptodate() makes any
> > > sense?
> > >
> > Yes... cluster filesystems. Its very important in case a readpage
> > races with a lock demotion. Since the introduction of page_mkwrite
> > that hasn't worked quite right, but by retrying when the page is
> > not uptodate, that should fix the problem,
> 
> Btw, at least for the readpage case, a return of AOP_TRUNCATED_PAGE should
> be checked for, which would indicate (along with !PageUptodate()) whether we
> need to retry the read. page_mkwrite though, as you point out, is a
> different story.
> 	--Mark
>
Yes, and although I probably didn't make it clear I was thinking
specifically of the page fault path there where both readpage and
page_mkwrite hang out.

Also, I've looked through all the current GFS2 code and it seems to
be correct in relation to Miklos' point on PageUptodate() vs
page->mapping == NULL so I don't think any changes are required there,
but obviously that needs to be taken into account in filemap_fault wrt
to retrying in the lock demotion case. In other words we should be
testing for page->mapping == NULL rather than !PageUptodate() in that
case,

Steve.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23  9:59       ` steve
@ 2008-10-23 10:21         ` Nick Piggin
  2008-10-23 10:52           ` steve
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2008-10-23 10:21 UTC (permalink / raw)
  To: steve; +Cc: Mark Fasheh, Miklos Szeredi, akpm, linux-mm, linux-fsdevel

On Thu, Oct 23, 2008 at 10:59:49AM +0100, steve@chygwyn.com wrote:
> > Btw, at least for the readpage case, a return of AOP_TRUNCATED_PAGE should
> > be checked for, which would indicate (along with !PageUptodate()) whether we
> > need to retry the read. page_mkwrite though, as you point out, is a
> > different story.
> > 	--Mark
> >
> Yes, and although I probably didn't make it clear I was thinking
> specifically of the page fault path there where both readpage and
> page_mkwrite hang out.
> 
> Also, I've looked through all the current GFS2 code and it seems to
> be correct in relation to Miklos' point on PageUptodate() vs
> page->mapping == NULL so I don't think any changes are required there,
> but obviously that needs to be taken into account in filemap_fault wrt
> to retrying in the lock demotion case. In other words we should be
> testing for page->mapping == NULL rather than !PageUptodate() in that
> case,

PageUptodate is OK for the filemap_fault check AFAIKS, because it does
a find_lock_page and runs the check under lock (so it can't be truncated
or invalidated), in order to prevent fault vs truncate / invalidate races.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23 10:21         ` Nick Piggin
@ 2008-10-23 10:52           ` steve
  2008-10-23 11:07             ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: steve @ 2008-10-23 10:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mark Fasheh, Miklos Szeredi, akpm, linux-mm, linux-fsdevel

Hi,

On Thu, Oct 23, 2008 at 12:21:00PM +0200, Nick Piggin wrote:
> On Thu, Oct 23, 2008 at 10:59:49AM +0100, steve@chygwyn.com wrote:
> > > Btw, at least for the readpage case, a return of AOP_TRUNCATED_PAGE should
> > > be checked for, which would indicate (along with !PageUptodate()) whether we
> > > need to retry the read. page_mkwrite though, as you point out, is a
> > > different story.
> > > 	--Mark
> > >
> > Yes, and although I probably didn't make it clear I was thinking
> > specifically of the page fault path there where both readpage and
> > page_mkwrite hang out.
> > 
> > Also, I've looked through all the current GFS2 code and it seems to
> > be correct in relation to Miklos' point on PageUptodate() vs
> > page->mapping == NULL so I don't think any changes are required there,
> > but obviously that needs to be taken into account in filemap_fault wrt
> > to retrying in the lock demotion case. In other words we should be
> > testing for page->mapping == NULL rather than !PageUptodate() in that
> > case,
> 
> PageUptodate is OK for the filemap_fault check AFAIKS, because it does
> a find_lock_page and runs the check under lock (so it can't be truncated
> or invalidated), in order to prevent fault vs truncate / invalidate races.
> 

Ah yes, I see now. Sorry, my fault (no pun intended!). Its the test after
readpage that I was thinking of, for which I'd previously posted a
patch under the subject "Potential fix to filemap_fault()" and you'd
reponded with a more comprehensive patch, back in July,

Steve.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23 10:52           ` steve
@ 2008-10-23 11:07             ` Nick Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2008-10-23 11:07 UTC (permalink / raw)
  To: steve; +Cc: Mark Fasheh, Miklos Szeredi, akpm, linux-mm, linux-fsdevel

On Thu, Oct 23, 2008 at 11:52:11AM +0100, steve@chygwyn.com wrote:
> Hi,
> 
> On Thu, Oct 23, 2008 at 12:21:00PM +0200, Nick Piggin wrote:
> > On Thu, Oct 23, 2008 at 10:59:49AM +0100, steve@chygwyn.com wrote:
> > > > Btw, at least for the readpage case, a return of AOP_TRUNCATED_PAGE should
> > > > be checked for, which would indicate (along with !PageUptodate()) whether we
> > > > need to retry the read. page_mkwrite though, as you point out, is a
> > > > different story.
> > > > 	--Mark
> > > >
> > > Yes, and although I probably didn't make it clear I was thinking
> > > specifically of the page fault path there where both readpage and
> > > page_mkwrite hang out.
> > > 
> > > Also, I've looked through all the current GFS2 code and it seems to
> > > be correct in relation to Miklos' point on PageUptodate() vs
> > > page->mapping == NULL so I don't think any changes are required there,
> > > but obviously that needs to be taken into account in filemap_fault wrt
> > > to retrying in the lock demotion case. In other words we should be
> > > testing for page->mapping == NULL rather than !PageUptodate() in that
> > > case,
> > 
> > PageUptodate is OK for the filemap_fault check AFAIKS, because it does
> > a find_lock_page and runs the check under lock (so it can't be truncated
> > or invalidated), in order to prevent fault vs truncate / invalidate races.
> > 
> 
> Ah yes, I see now. Sorry, my fault (no pun intended!). Its the test after
> readpage that I was thinking of, for which I'd previously posted a
> patch under the subject "Potential fix to filemap_fault()" and you'd
> reponded with a more comprehensive patch, back in July,

Yes, this is roughly the same patch, although probably that part of it
was valid when you posted it as we probably still cleared PG_uptodate
on invalidate at that time...

I still think there are some improvements we can make though, so I
have picked up the patch again and started trying to do more testing
of it


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23  9:44         ` steve
@ 2008-10-23 11:15           ` Nick Piggin
  2008-10-23 22:48             ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2008-10-23 11:15 UTC (permalink / raw)
  To: steve
  Cc: Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Thu, Oct 23, 2008 at 10:44:16AM +0100, steve@chygwyn.com wrote:
> Hi,
> 
> On Thu, Oct 23, 2008 at 09:07:11AM +0200, Nick Piggin wrote:
> > On Thu, Oct 23, 2008 at 10:07:15AM +1100, Dave Chinner wrote:
> [snip]
> > 
> > > You could do the same thing for metadata read operations. e.g. build
> > > a large directory structure, then do read operations on it (readdir,
> > > stat, etc) and inject errors into each of those. All filesystems
> > > should return the (EIO) error to the application in this case.
> > > 
> > > Those two cases should be pretty generic and deterministic - they
> > > both avoid the difficult problem of determining what the response
> > > to an I/O error during metadata modifcation should be....
> > 
> > Good suggestion.
> > 
> > I'll see what I can do. I'm using the fault injection stuff, which I
> > don't think can distinguish metadata, so I might just have to work
> > out a bio flag or something we can send down to the block layer to
> > distinguish.
> > 
> > Thanks,
> > Nick
> >
> 
> Don't we already have such a flag? I know that its not set in all
> the correct places in GFS2 so far, but I've gradually been fixing
> them to include BIO_RW_META where appropriate.
 
That should probably work. It seems to be very incomplete (GFS2
being one of the few exceptions). Though adding more support in
ext2 and buffer layer should be enough for me to start with,
and shouldn't be too hard.


> Also it occurs to me that we can use FIEMAP to discover where a
> parciular file is and that would then allow us to target error
> injection at particular blocks of the file.
> 
> Given that we can cover xattr blocks with FIEMAP too[*], and that at
> least with GFS2 and similar filesystems the inode number is the
> block number of the inode, the only thing that would be missing,
> in terms of locating inode data & metadata would be the indirect blocks,

That would be interesting. I'm finding it hard to come up with
good ways to trigger a lot of the cases just for single-file case,
so I won't need to do anything so advanced yet ;)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-22 14:45                           ` Miklos Szeredi
@ 2008-10-23 13:48                             ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2008-10-23 13:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jamie, steve, zbr, npiggin, akpm, linux-mm, linux-fsdevel

On Wed, Oct 22, 2008 at 04:45:24PM +0200, Miklos Szeredi wrote:
> On Wed, 22 Oct 2008, Matthew Wilcox wrote:
> > remap_file_pages() only hurts if you map the same page more than once
> > (which is permitted, but again, I don't think anyone actually does
> > that).
> 
> This is getting very offtopic... but remap_file_pages() is just like
> MAP_FIXED, that the address at which a page is mapped is determined by
> the caller, not the kernel.  So coherency with other, independent
> mapping of the file is basically up to chance.

Oh, right, I see.

> Do we care?  I very much hope not.  Why do PA-RISC and friends care at
> all about mmap coherecy?  Is it real-world problem driven or just to
> be safe?

One reason to care is performance.  If two tasks have libc mapped
coherently, then the addresses will collide in the cache and they'll
use the same cachelines.  I don't know how much applications care about
correctness with mmap -- they certainly do with sysv shm.  I probably
knew about some applications that broke when I was working on this code --
back in 2002.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23 11:15           ` Nick Piggin
@ 2008-10-23 22:48             ` Dave Chinner
  2008-10-24  1:05               ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2008-10-23 22:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: steve, Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Thu, Oct 23, 2008 at 01:15:00PM +0200, Nick Piggin wrote:
> On Thu, Oct 23, 2008 at 10:44:16AM +0100, steve@chygwyn.com wrote:
> > Hi,
> > 
> > On Thu, Oct 23, 2008 at 09:07:11AM +0200, Nick Piggin wrote:
> > > On Thu, Oct 23, 2008 at 10:07:15AM +1100, Dave Chinner wrote:
> > [snip]
> > > 
> > > > You could do the same thing for metadata read operations. e.g. build
> > > > a large directory structure, then do read operations on it (readdir,
> > > > stat, etc) and inject errors into each of those. All filesystems
> > > > should return the (EIO) error to the application in this case.
> > > > 
> > > > Those two cases should be pretty generic and deterministic - they
> > > > both avoid the difficult problem of determining what the response
> > > > to an I/O error during metadata modifcation should be....
> > > 
> > > Good suggestion.
> > > 
> > > I'll see what I can do. I'm using the fault injection stuff, which I
> > > don't think can distinguish metadata, so I might just have to work
> > > out a bio flag or something we can send down to the block layer to
> > > distinguish.
> > > 
> > > Thanks,
> > > Nick
> > >
> > 
> > Don't we already have such a flag? I know that its not set in all
> > the correct places in GFS2 so far, but I've gradually been fixing
> > them to include BIO_RW_META where appropriate.
>  
> That should probably work. It seems to be very incomplete (GFS2
> being one of the few exceptions). Though adding more support in
> ext2 and buffer layer should be enough for me to start with,
> and shouldn't be too hard.

I've posted patches to tag XFS metadata with BIO_RW_META in the
past, but that patch set had performance implications for different I/O
schedulers so it never went further than just a patch. If I
leave all the BIO_RW_SYNC tagging for the metadata bios, then
a single line change to add BIO_RW_META should not have any
performance impact....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] fs: improved handling of page and buffer IO errors
  2008-10-23 22:48             ` Dave Chinner
@ 2008-10-24  1:05               ` Nick Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2008-10-24  1:05 UTC (permalink / raw)
  To: steve, Andi Kleen, Andrew Morton, Linux Memory Management List,
	linux-fsdevel

On Fri, Oct 24, 2008 at 09:48:04AM +1100, Dave Chinner wrote:
> On Thu, Oct 23, 2008 at 01:15:00PM +0200, Nick Piggin wrote:
> > On Thu, Oct 23, 2008 at 10:44:16AM +0100, steve@chygwyn.com wrote:
> > > Hi,
> > > 
> > > On Thu, Oct 23, 2008 at 09:07:11AM +0200, Nick Piggin wrote:
> > > > On Thu, Oct 23, 2008 at 10:07:15AM +1100, Dave Chinner wrote:
> > > [snip]
> > > > 
> > > > > You could do the same thing for metadata read operations. e.g. build
> > > > > a large directory structure, then do read operations on it (readdir,
> > > > > stat, etc) and inject errors into each of those. All filesystems
> > > > > should return the (EIO) error to the application in this case.
> > > > > 
> > > > > Those two cases should be pretty generic and deterministic - they
> > > > > both avoid the difficult problem of determining what the response
> > > > > to an I/O error during metadata modifcation should be....
> > > > 
> > > > Good suggestion.
> > > > 
> > > > I'll see what I can do. I'm using the fault injection stuff, which I
> > > > don't think can distinguish metadata, so I might just have to work
> > > > out a bio flag or something we can send down to the block layer to
> > > > distinguish.
> > > > 
> > > > Thanks,
> > > > Nick
> > > >
> > > 
> > > Don't we already have such a flag? I know that its not set in all
> > > the correct places in GFS2 so far, but I've gradually been fixing
> > > them to include BIO_RW_META where appropriate.
> >  
> > That should probably work. It seems to be very incomplete (GFS2
> > being one of the few exceptions). Though adding more support in
> > ext2 and buffer layer should be enough for me to start with,
> > and shouldn't be too hard.

I have a patch for (most of) buffer and ext2 btw. Seems to work OK.

 
> I've posted patches to tag XFS metadata with BIO_RW_META in the
> past, but that patch set had performance implications for different I/O
> schedulers so it never went further than just a patch. If I

Hmm, yes CFQ does do something with meta requests. It's a pity you
can't just add the annotations and file bugs with CFQ if it hurts
performance :P


> leave all the BIO_RW_SYNC tagging for the metadata bios, then
> a single line change to add BIO_RW_META should not have any
> performance impact....

Sorry, I don't understand what you mean here. 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2008-10-24  1:05 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21 11:21 [patch] fs: improved handling of page and buffer IO errors Nick Piggin
2008-10-21 12:52 ` Miklos Szeredi
2008-10-21 12:59   ` steve
2008-10-21 13:14     ` Miklos Szeredi
2008-10-21 13:38       ` steve
2008-10-21 14:32         ` Miklos Szeredi
2008-10-21 15:09           ` steve
2008-10-21 16:13             ` Miklos Szeredi
2008-10-22 12:51               ` Jamie Lokier
2008-10-22 14:08                 ` Miklos Szeredi
2008-10-21 14:35         ` Evgeniy Polyakov
2008-10-21 14:59           ` steve
2008-10-21 16:20             ` Miklos Szeredi
2008-10-21 16:25               ` steve
2008-10-21 16:28               ` Miklos Szeredi
2008-10-21 16:29                 ` Matthew Wilcox
2008-10-22 12:48                   ` Jamie Lokier
2008-10-22 13:45                     ` Matthew Wilcox
2008-10-22 14:02                       ` Miklos Szeredi
2008-10-22 14:35                         ` Matthew Wilcox
2008-10-22 14:45                           ` Miklos Szeredi
2008-10-23 13:48                             ` Matthew Wilcox
2008-10-22 22:23     ` Mark Fasheh
2008-10-23  9:59       ` steve
2008-10-23 10:21         ` Nick Piggin
2008-10-23 10:52           ` steve
2008-10-23 11:07             ` Nick Piggin
2008-10-22 13:16   ` Nick Piggin
2008-10-22 20:09     ` Miklos Szeredi
2008-10-21 16:16 ` Andi Kleen
2008-10-21 16:30   ` steve
2008-10-22 10:31   ` Nick Piggin
2008-10-22 18:46     ` Brad Boyer
2008-10-22 20:19       ` Andi Kleen
2008-10-23  7:08       ` Nick Piggin
2008-10-22 23:07     ` Dave Chinner
2008-10-23  7:07       ` Nick Piggin
2008-10-23  9:44         ` steve
2008-10-23 11:15           ` Nick Piggin
2008-10-23 22:48             ` Dave Chinner
2008-10-24  1:05               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).