Clear PG_error before reading a page

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Clear PG_error before reading a page
@ 2007-05-15 14:37 Johann Lombardi
  2007-05-15 17:11 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Johann Lombardi @ 2007-05-15 14:37 UTC (permalink / raw)
  To: linux-kernel

We've observed that transient disk errors can result in persistent I/O issues
at the filesystem level. I can at least put forward the problem with ext2/ext3
and scsci_debug.

Here are the steps to reproduce:
* format an ext3 fs on a SCSI disk simulated by scsci_debug
* mount the fs and create a 10MB file
* unmount/remount the fs
* enable the medium error flag in scsi_debug (please note that I've slightly
  modified scsi_debug to return a medium error indication when any sector >=
  0x1234 is read)
* try to read the file (as expected, got -EIO)
* disable medium error injection in scsi_debug
* try to read the file (keep getting -EIO)

In fact, when the underlying device experiences errors, block_read_full_page()
calls ext3_get_block() which correctly returns an I/O error.
Consequently, block_read_full_page() set the PG_error flag.
However, the problem is that afterwards, when the device no longer returns
any errors, PG_error is never cleared and, as a consequence, we are still
getting -EIO at the filesystem level.

Granted that block_read_full_page() always tries to fill a full page,
I think it should first clear the PG_error flag and set it back if a call
to get_block() fails. Besides, we should handle this in block_read_full_page
but also in other places that do full page reads. Any comments?
I am not able to reproduce the problem with the patch below.

Johann

Signed-off-by: Johann Lombardi <johann@clusterfs.com>
--

--- linux-2.6.12.6.orig/fs/buffer.c	2005-08-29 18:55:27.000000000 +0200
+++ linux-2.6.12.6/fs/buffer.c	2007-05-11 15:51:03.000000000 +0200
@@ -2078,6 +2078,7 @@ int block_read_full_page(struct page *pa
 	int fully_mapped = 1;

 	BUG_ON(!PageLocked(page));
+	ClearPageError(page);
 	blocksize = 1 << inode->i_blkbits;
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, blocksize, 0);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-15 14:37 Clear PG_error before reading a page Johann Lombardi
@ 2007-05-15 17:11 ` Andrew Morton
  2007-05-15 21:01   ` Johann Lombardi
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-05-15 17:11 UTC (permalink / raw)
  To: Johann Lombardi; +Cc: linux-kernel

On Tue, 15 May 2007 16:37:26 +0200 Johann Lombardi <johann@clusterfs.com> wrote:

> We've observed that transient disk errors can result in persistent I/O issues
> at the filesystem level. I can at least put forward the problem with ext2/ext3
> and scsci_debug.
> 
> Here are the steps to reproduce:
> * format an ext3 fs on a SCSI disk simulated by scsci_debug
> * mount the fs and create a 10MB file
> * unmount/remount the fs
> * enable the medium error flag in scsi_debug (please note that I've slightly
>   modified scsi_debug to return a medium error indication when any sector >=
>   0x1234 is read)
> * try to read the file (as expected, got -EIO)
> * disable medium error injection in scsi_debug
> * try to read the file (keep getting -EIO)
> 
> In fact, when the underlying device experiences errors, block_read_full_page()
> calls ext3_get_block() which correctly returns an I/O error.
> Consequently, block_read_full_page() set the PG_error flag.
> However, the problem is that afterwards, when the device no longer returns
> any errors, PG_error is never cleared and, as a consequence, we are still
> getting -EIO at the filesystem level.
> 
> Granted that block_read_full_page() always tries to fill a full page,
> I think it should first clear the PG_error flag and set it back if a call
> to get_block() fails. Besides, we should handle this in block_read_full_page
> but also in other places that do full page reads. Any comments?
> I am not able to reproduce the problem with the patch below.
> 
> Johann
> 
> Signed-off-by: Johann Lombardi <johann@clusterfs.com>
> --
> 
> --- linux-2.6.12.6.orig/fs/buffer.c	2005-08-29 18:55:27.000000000 +0200
> +++ linux-2.6.12.6/fs/buffer.c	2007-05-11 15:51:03.000000000 +0200
> @@ -2078,6 +2078,7 @@ int block_read_full_page(struct page *pa
>  	int fully_mapped = 1;
>  
>  	BUG_ON(!PageLocked(page));
> +	ClearPageError(page);
>  	blocksize = 1 << inode->i_blkbits;
>  	if (!page_has_buffers(page))
>  		create_empty_buffers(page, blocksize, 0);

We need to make sure that this page has PG_uptodate cleared, so
that a re-read is forced.  And the affected buffer_head, if any, should have
buffer_uptodate() cleared.

This change might have horrid interactions with readahead and various
application access patterns: if for some reason a dud sector takes 30
seconds of driver futzing to return -EIO and someone (application or
kernel) tries to read the same sector tens or hundreds of times, suckiness
ensues.  This will be hard to test for.

Most reads don't (or shouldn't) go through block_read_full_page(). 
mpage_readpages() does the heavy lifting.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-15 17:11 ` Andrew Morton
@ 2007-05-15 21:01   ` Johann Lombardi
  2007-05-15 21:23     ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Johann Lombardi @ 2007-05-15 21:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Tue, May 15, 2007 at 10:11:44AM -0700, Andrew Morton wrote:
> We need to make sure that this page has PG_uptodate cleared, so
> that a re-read is forced.  And the affected buffer_head, if any, should have
> buffer_uptodate() cleared.

ok.

> This change might have horrid interactions with readahead and various
> application access patterns: if for some reason a dud sector takes 30
> seconds of driver futzing to return -EIO and someone (application or
> kernel) tries to read the same sector tens or hundreds of times, suckiness
> ensues.  This will be hard to test for.

Yes, I agree. So according to you, what's the best way to address the
initial problem (i.e. PG_error never cleared)?

> Most reads don't (or shouldn't) go through block_read_full_page().
> mpage_readpages() does the heavy lifting.

Yes, indeed. However, as soon as a call to get_block() fails,
do_mpage_readpage() will call block_read_full_page() which will attach
buffers to this page.
Consequently, all subsequent reads will go through block_read_full_page().

Johann

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-15 21:01   ` Johann Lombardi
@ 2007-05-15 21:23     ` Andrew Morton
  2007-05-16 15:39       ` Johann Lombardi
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-05-15 21:23 UTC (permalink / raw)
  To: Johann Lombardi; +Cc: linux-kernel

On Tue, 15 May 2007 23:01:24 +0200
Johann Lombardi <johann@clusterfs.com> wrote:

> On Tue, May 15, 2007 at 10:11:44AM -0700, Andrew Morton wrote:
> > We need to make sure that this page has PG_uptodate cleared, so
> > that a re-read is forced.  And the affected buffer_head, if any, should have
> > buffer_uptodate() cleared.
> 
> ok.
> 
> > This change might have horrid interactions with readahead and various
> > application access patterns: if for some reason a dud sector takes 30
> > seconds of driver futzing to return -EIO and someone (application or
> > kernel) tries to read the same sector tens or hundreds of times, suckiness
> > ensues.  This will be hard to test for.
> 
> Yes, I agree. So according to you, what's the best way to address the
> initial problem (i.e. PG_error never cleared)?

ooh, hard.

umm, a very safe way would be to free the dang page.  Make sure that we
have no references on it and that we have a ref on the inode (ie: we're
within a syscall which was passed an fd) then run
invalidate_mapping_pages()

> > Most reads don't (or shouldn't) go through block_read_full_page().
> > mpage_readpages() does the heavy lifting.
> 
> Yes, indeed. However, as soon as a call to get_block() fails,
> do_mpage_readpage() will call block_read_full_page() which will attach
> buffers to this page.
> Consequently, all subsequent reads will go through block_read_full_page().

hm, confused.  Why is get_block() failing?  That has to go and read
metadata.  If get_block() failed then we don't know what blocks to read to
bring this page uptodate, so the pagecache page should remain in state
!PageUptodate(), !PageError().  But then, we shouldn't have populated
pagecache at that offset at all.

I think I'm missing something here.  I suspect you're referring to a mix
of reading the blockdev via /dev/hda1 and then using the already-populated
pagecache as filesystem metadata, or something?

Is the PageError page part of an S_ISREG file, or is it part of an S_ISBLK
file?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-15 21:23     ` Andrew Morton
@ 2007-05-16 15:39       ` Johann Lombardi
  2007-05-16 15:49         ` Nick Piggin
  2007-05-16 16:12         ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Johann Lombardi @ 2007-05-16 15:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote:
> > Yes, indeed. However, as soon as a call to get_block() fails,
> > do_mpage_readpage() will call block_read_full_page() which will attach
> > buffers to this page.
> > Consequently, all subsequent reads will go through block_read_full_page().
> 
> hm, confused.  Why is get_block() failing?  That has to go and read
> metadata. 

In fact, I am referring to the first part of my test case (i.e. mount the ext3
fs, enable medium errors in scsi_debug and try to read a file from the fs).

So, when I try to read a file, ext3_get_block() needs to read metadata from the
disk. However, given that the SCSI disk simulated by scsi_debug reports medium
errors, ext3_get_block() returns EIO to the caller (i.e. do_mpage_readpage()).
That's why get_block() is failing.

Then, do_mpage_readpage() calls block_read_full_page() (via "goto confused").
block_read_full_page() attaches buffers to this page and calls ext3_get_block()
which fails for the same reason as before. Consequently, block_read_full_page()
sets the PG_error flag.
Moreover, all subsequent readpage calls will go through block_read_full_page()
because the page has now buffers attached.

Basically, my problem is that afterwards, when the device no longer returns
any errors, the PG_error flag is never cleared and, as a result, I keep
getting -EIO. That's the problem I'd like to address.

> If get_block() failed then we don't know what blocks to read to
> bring this page uptodate, so the pagecache page should remain in state
> !PageUptodate(), !PageError().  But then, we shouldn't have populated
> pagecache at that offset at all.

Yes, indeed. do_generic_mapping_read() doesn't populate the pagecache if an
error occurred.
Still, __do_page_cache_readahead()->read_pages()->ext3_readpages()->mpage_readpages()
does populate the pagecache even if the page reads failed.

> I think I'm missing something here.  I suspect you're referring to a mix
> of reading the blockdev via /dev/hda1 and then using the already-populated
> pagecache as filesystem metadata, or something?

I don't think so, unless I'm missing something.

> Is the PageError page part of an S_ISREG file, or is it part of an S_ISBLK
> file?

The PageError pages are part of a regular file and have been read through
readahead.

Johann

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-16 15:39       ` Johann Lombardi
@ 2007-05-16 15:49         ` Nick Piggin
  2007-05-16 16:12         ` Andrew Morton
  1 sibling, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2007-05-16 15:49 UTC (permalink / raw)
  To: Johann Lombardi; +Cc: Andrew Morton, linux-kernel

Johann Lombardi wrote:
> On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote:

>>If get_block() failed then we don't know what blocks to read to
>>bring this page uptodate, so the pagecache page should remain in state
>>!PageUptodate(), !PageError().  But then, we shouldn't have populated
>>pagecache at that offset at all.
> 
> 
> Yes, indeed. do_generic_mapping_read() doesn't populate the pagecache if an
> error occurred.
> Still, __do_page_cache_readahead()->read_pages()->ext3_readpages()->mpage_readpages()
> does populate the pagecache even if the page reads failed.

Doesn't really matter. If the read failed then the page is not uptodate.
If it didn't, then it isn't error.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-16 15:39       ` Johann Lombardi
  2007-05-16 15:49         ` Nick Piggin
@ 2007-05-16 16:12         ` Andrew Morton
  2007-05-17 11:42           ` Johann Lombardi
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-05-16 16:12 UTC (permalink / raw)
  To: Johann Lombardi; +Cc: linux-kernel

On Wed, 16 May 2007 17:39:19 +0200 Johann Lombardi <johann@clusterfs.com> wrote:

> On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote:
> > > Yes, indeed. However, as soon as a call to get_block() fails,
> > > do_mpage_readpage() will call block_read_full_page() which will attach
> > > buffers to this page.
> > > Consequently, all subsequent reads will go through block_read_full_page().
> > 
> > hm, confused.  Why is get_block() failing?  That has to go and read
> > metadata. 
> 
> In fact, I am referring to the first part of my test case (i.e. mount the ext3
> fs, enable medium errors in scsi_debug and try to read a file from the fs).
> 
> So, when I try to read a file, ext3_get_block() needs to read metadata from the
> disk. However, given that the SCSI disk simulated by scsi_debug reports medium
> errors, ext3_get_block() returns EIO to the caller (i.e. do_mpage_readpage()).
> That's why get_block() is failing.
> 
> Then, do_mpage_readpage() calls block_read_full_page() (via "goto confused").
> block_read_full_page() attaches buffers to this page and calls ext3_get_block()
> which fails for the same reason as before. Consequently, block_read_full_page()
> sets the PG_error flag.
> Moreover, all subsequent readpage calls will go through block_read_full_page()
> because the page has now buffers attached.
> 
> Basically, my problem is that afterwards, when the device no longer returns
> any errors, the PG_error flag is never cleared and, as a result, I keep
> getting -EIO. That's the problem I'd like to address.
> 

hm, OK.  So, where are we up to?

I still worry about the fact that changes in this area could cause the
kernel to do a *lot* more IO attempts against failed devices, or failed
sectors.  We already have a few problems in that area.

What is the actual real-world operational scenario here?  Would it be a
hotplugged disk?  A transient network failure in a SAN?  IOW, is it
something from which the kernel should automatically recover, or it is a
situation in which manual intervention would be better?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-16 16:12         ` Andrew Morton
@ 2007-05-17 11:42           ` Johann Lombardi
  2007-05-17 16:47             ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Johann Lombardi @ 2007-05-17 11:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed, May 16, 2007 at 09:12:17AM -0700, Andrew Morton wrote:
> > Basically, my problem is that afterwards, when the device no longer returns
> > any errors, the PG_error flag is never cleared and, as a result, I keep
> > getting -EIO. That's the problem I'd like to address.
> > 
> 
> hm, OK.  So, where are we up to?

Once the errors reported by the underlying device are corrected, we must
unmount/remount the filesystem if we want to use it.
In fact, since readahead ignores I/O errors, the pagecache is populated
with pages having the PG_error flag set and buffers attached.
Since PG_error is then never cleared, we keep getting EIO despite that
the underlying device works just fine.

> What is the actual real-world operational scenario here?  Would it be a
> hotplugged disk?  A transient network failure in a SAN?  IOW, is it
> something from which the kernel should automatically recover, or it is a
> situation in which manual intervention would be better?

The real-world operational scenario is a storage system reporting medium
errors which can be corrected by a manual intervention.

Johann

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-17 11:42           ` Johann Lombardi
@ 2007-05-17 16:47             ` Andrew Morton
  2007-05-29 17:06               ` Johann Lombardi
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2007-05-17 16:47 UTC (permalink / raw)
  To: Johann Lombardi; +Cc: linux-kernel

On Thu, 17 May 2007 13:42:50 +0200 Johann Lombardi <johann@clusterfs.com> wrote:

> > What is the actual real-world operational scenario here?  Would it be a
> > hotplugged disk?  A transient network failure in a SAN?  IOW, is it
> > something from which the kernel should automatically recover, or it is a
> > situation in which manual intervention would be better?
> 
> The real-world operational scenario is a storage system reporting medium
> errors which can be corrected by a manual intervention.

So running ioctl(BLKFLSBUF) against the device will fix things up?

Perhaps not, and we should invalidate all the file data as well. The
only ways we have of doing that are umount+mount or /proc/sys/vm/drop_caches.

I'm not sure what is the best thing to do here - there are advantages to
caching the fact that the device (or some of it) is unreadable.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Clear PG_error before reading a page
  2007-05-17 16:47             ` Andrew Morton
@ 2007-05-29 17:06               ` Johann Lombardi
  0 siblings, 0 replies; 10+ messages in thread
From: Johann Lombardi @ 2007-05-29 17:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 17, 2007 at 09:47:30AM -0700, Andrew Morton wrote:
> So running ioctl(BLKFLSBUF) against the device will fix things up?

No, 'blockdev --flushbufs /dev/sdb' doesn't help.

> Perhaps not, and we should invalidate all the file data as well. The
> only ways we have of doing that are umount+mount or /proc/sys/vm/drop_caches.

'echo > /proc/sys/vm/drop_caches' works. Thanks.

Johann

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-05-29 17:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 14:37 Clear PG_error before reading a page Johann Lombardi
2007-05-15 17:11 ` Andrew Morton
2007-05-15 21:01   ` Johann Lombardi
2007-05-15 21:23     ` Andrew Morton
2007-05-16 15:39       ` Johann Lombardi
2007-05-16 15:49         ` Nick Piggin
2007-05-16 16:12         ` Andrew Morton
2007-05-17 11:42           ` Johann Lombardi
2007-05-17 16:47             ` Andrew Morton
2007-05-29 17:06               ` Johann Lombardi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox