* Clear PG_error before reading a page @ 2007-05-15 14:37 Johann Lombardi 2007-05-15 17:11 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Johann Lombardi @ 2007-05-15 14:37 UTC (permalink / raw) To: linux-kernel We've observed that transient disk errors can result in persistent I/O issues at the filesystem level. I can at least put forward the problem with ext2/ext3 and scsci_debug. Here are the steps to reproduce: * format an ext3 fs on a SCSI disk simulated by scsci_debug * mount the fs and create a 10MB file * unmount/remount the fs * enable the medium error flag in scsi_debug (please note that I've slightly modified scsi_debug to return a medium error indication when any sector >= 0x1234 is read) * try to read the file (as expected, got -EIO) * disable medium error injection in scsi_debug * try to read the file (keep getting -EIO) In fact, when the underlying device experiences errors, block_read_full_page() calls ext3_get_block() which correctly returns an I/O error. Consequently, block_read_full_page() set the PG_error flag. However, the problem is that afterwards, when the device no longer returns any errors, PG_error is never cleared and, as a consequence, we are still getting -EIO at the filesystem level. Granted that block_read_full_page() always tries to fill a full page, I think it should first clear the PG_error flag and set it back if a call to get_block() fails. Besides, we should handle this in block_read_full_page but also in other places that do full page reads. Any comments? I am not able to reproduce the problem with the patch below. Johann Signed-off-by: Johann Lombardi <johann@clusterfs.com> -- --- linux-2.6.12.6.orig/fs/buffer.c 2005-08-29 18:55:27.000000000 +0200 +++ linux-2.6.12.6/fs/buffer.c 2007-05-11 15:51:03.000000000 +0200 @@ -2078,6 +2078,7 @@ int block_read_full_page(struct page *pa int fully_mapped = 1; BUG_ON(!PageLocked(page)); + ClearPageError(page); blocksize = 1 << inode->i_blkbits; if (!page_has_buffers(page)) create_empty_buffers(page, blocksize, 0); ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-15 14:37 Clear PG_error before reading a page Johann Lombardi @ 2007-05-15 17:11 ` Andrew Morton 2007-05-15 21:01 ` Johann Lombardi 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2007-05-15 17:11 UTC (permalink / raw) To: Johann Lombardi; +Cc: linux-kernel On Tue, 15 May 2007 16:37:26 +0200 Johann Lombardi <johann@clusterfs.com> wrote: > We've observed that transient disk errors can result in persistent I/O issues > at the filesystem level. I can at least put forward the problem with ext2/ext3 > and scsci_debug. > > Here are the steps to reproduce: > * format an ext3 fs on a SCSI disk simulated by scsci_debug > * mount the fs and create a 10MB file > * unmount/remount the fs > * enable the medium error flag in scsi_debug (please note that I've slightly > modified scsi_debug to return a medium error indication when any sector >= > 0x1234 is read) > * try to read the file (as expected, got -EIO) > * disable medium error injection in scsi_debug > * try to read the file (keep getting -EIO) > > In fact, when the underlying device experiences errors, block_read_full_page() > calls ext3_get_block() which correctly returns an I/O error. > Consequently, block_read_full_page() set the PG_error flag. > However, the problem is that afterwards, when the device no longer returns > any errors, PG_error is never cleared and, as a consequence, we are still > getting -EIO at the filesystem level. > > Granted that block_read_full_page() always tries to fill a full page, > I think it should first clear the PG_error flag and set it back if a call > to get_block() fails. Besides, we should handle this in block_read_full_page > but also in other places that do full page reads. Any comments? > I am not able to reproduce the problem with the patch below. > > Johann > > Signed-off-by: Johann Lombardi <johann@clusterfs.com> > -- > > --- linux-2.6.12.6.orig/fs/buffer.c 2005-08-29 18:55:27.000000000 +0200 > +++ linux-2.6.12.6/fs/buffer.c 2007-05-11 15:51:03.000000000 +0200 > @@ -2078,6 +2078,7 @@ int block_read_full_page(struct page *pa > int fully_mapped = 1; > > BUG_ON(!PageLocked(page)); > + ClearPageError(page); > blocksize = 1 << inode->i_blkbits; > if (!page_has_buffers(page)) > create_empty_buffers(page, blocksize, 0); We need to make sure that this page has PG_uptodate cleared, so that a re-read is forced. And the affected buffer_head, if any, should have buffer_uptodate() cleared. This change might have horrid interactions with readahead and various application access patterns: if for some reason a dud sector takes 30 seconds of driver futzing to return -EIO and someone (application or kernel) tries to read the same sector tens or hundreds of times, suckiness ensues. This will be hard to test for. Most reads don't (or shouldn't) go through block_read_full_page(). mpage_readpages() does the heavy lifting. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-15 17:11 ` Andrew Morton @ 2007-05-15 21:01 ` Johann Lombardi 2007-05-15 21:23 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Johann Lombardi @ 2007-05-15 21:01 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Tue, May 15, 2007 at 10:11:44AM -0700, Andrew Morton wrote: > We need to make sure that this page has PG_uptodate cleared, so > that a re-read is forced. And the affected buffer_head, if any, should have > buffer_uptodate() cleared. ok. > This change might have horrid interactions with readahead and various > application access patterns: if for some reason a dud sector takes 30 > seconds of driver futzing to return -EIO and someone (application or > kernel) tries to read the same sector tens or hundreds of times, suckiness > ensues. This will be hard to test for. Yes, I agree. So according to you, what's the best way to address the initial problem (i.e. PG_error never cleared)? > Most reads don't (or shouldn't) go through block_read_full_page(). > mpage_readpages() does the heavy lifting. Yes, indeed. However, as soon as a call to get_block() fails, do_mpage_readpage() will call block_read_full_page() which will attach buffers to this page. Consequently, all subsequent reads will go through block_read_full_page(). Johann ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-15 21:01 ` Johann Lombardi @ 2007-05-15 21:23 ` Andrew Morton 2007-05-16 15:39 ` Johann Lombardi 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2007-05-15 21:23 UTC (permalink / raw) To: Johann Lombardi; +Cc: linux-kernel On Tue, 15 May 2007 23:01:24 +0200 Johann Lombardi <johann@clusterfs.com> wrote: > On Tue, May 15, 2007 at 10:11:44AM -0700, Andrew Morton wrote: > > We need to make sure that this page has PG_uptodate cleared, so > > that a re-read is forced. And the affected buffer_head, if any, should have > > buffer_uptodate() cleared. > > ok. > > > This change might have horrid interactions with readahead and various > > application access patterns: if for some reason a dud sector takes 30 > > seconds of driver futzing to return -EIO and someone (application or > > kernel) tries to read the same sector tens or hundreds of times, suckiness > > ensues. This will be hard to test for. > > Yes, I agree. So according to you, what's the best way to address the > initial problem (i.e. PG_error never cleared)? ooh, hard. umm, a very safe way would be to free the dang page. Make sure that we have no references on it and that we have a ref on the inode (ie: we're within a syscall which was passed an fd) then run invalidate_mapping_pages() > > Most reads don't (or shouldn't) go through block_read_full_page(). > > mpage_readpages() does the heavy lifting. > > Yes, indeed. However, as soon as a call to get_block() fails, > do_mpage_readpage() will call block_read_full_page() which will attach > buffers to this page. > Consequently, all subsequent reads will go through block_read_full_page(). hm, confused. Why is get_block() failing? That has to go and read metadata. If get_block() failed then we don't know what blocks to read to bring this page uptodate, so the pagecache page should remain in state !PageUptodate(), !PageError(). But then, we shouldn't have populated pagecache at that offset at all. I think I'm missing something here. I suspect you're referring to a mix of reading the blockdev via /dev/hda1 and then using the already-populated pagecache as filesystem metadata, or something? Is the PageError page part of an S_ISREG file, or is it part of an S_ISBLK file? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-15 21:23 ` Andrew Morton @ 2007-05-16 15:39 ` Johann Lombardi 2007-05-16 15:49 ` Nick Piggin 2007-05-16 16:12 ` Andrew Morton 0 siblings, 2 replies; 10+ messages in thread From: Johann Lombardi @ 2007-05-16 15:39 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote: > > Yes, indeed. However, as soon as a call to get_block() fails, > > do_mpage_readpage() will call block_read_full_page() which will attach > > buffers to this page. > > Consequently, all subsequent reads will go through block_read_full_page(). > > hm, confused. Why is get_block() failing? That has to go and read > metadata. In fact, I am referring to the first part of my test case (i.e. mount the ext3 fs, enable medium errors in scsi_debug and try to read a file from the fs). So, when I try to read a file, ext3_get_block() needs to read metadata from the disk. However, given that the SCSI disk simulated by scsi_debug reports medium errors, ext3_get_block() returns EIO to the caller (i.e. do_mpage_readpage()). That's why get_block() is failing. Then, do_mpage_readpage() calls block_read_full_page() (via "goto confused"). block_read_full_page() attaches buffers to this page and calls ext3_get_block() which fails for the same reason as before. Consequently, block_read_full_page() sets the PG_error flag. Moreover, all subsequent readpage calls will go through block_read_full_page() because the page has now buffers attached. Basically, my problem is that afterwards, when the device no longer returns any errors, the PG_error flag is never cleared and, as a result, I keep getting -EIO. That's the problem I'd like to address. > If get_block() failed then we don't know what blocks to read to > bring this page uptodate, so the pagecache page should remain in state > !PageUptodate(), !PageError(). But then, we shouldn't have populated > pagecache at that offset at all. Yes, indeed. do_generic_mapping_read() doesn't populate the pagecache if an error occurred. Still, __do_page_cache_readahead()->read_pages()->ext3_readpages()->mpage_readpages() does populate the pagecache even if the page reads failed. > I think I'm missing something here. I suspect you're referring to a mix > of reading the blockdev via /dev/hda1 and then using the already-populated > pagecache as filesystem metadata, or something? I don't think so, unless I'm missing something. > Is the PageError page part of an S_ISREG file, or is it part of an S_ISBLK > file? The PageError pages are part of a regular file and have been read through readahead. Johann ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-16 15:39 ` Johann Lombardi @ 2007-05-16 15:49 ` Nick Piggin 2007-05-16 16:12 ` Andrew Morton 1 sibling, 0 replies; 10+ messages in thread From: Nick Piggin @ 2007-05-16 15:49 UTC (permalink / raw) To: Johann Lombardi; +Cc: Andrew Morton, linux-kernel Johann Lombardi wrote: > On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote: >>If get_block() failed then we don't know what blocks to read to >>bring this page uptodate, so the pagecache page should remain in state >>!PageUptodate(), !PageError(). But then, we shouldn't have populated >>pagecache at that offset at all. > > > Yes, indeed. do_generic_mapping_read() doesn't populate the pagecache if an > error occurred. > Still, __do_page_cache_readahead()->read_pages()->ext3_readpages()->mpage_readpages() > does populate the pagecache even if the page reads failed. Doesn't really matter. If the read failed then the page is not uptodate. If it didn't, then it isn't error. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-16 15:39 ` Johann Lombardi 2007-05-16 15:49 ` Nick Piggin @ 2007-05-16 16:12 ` Andrew Morton 2007-05-17 11:42 ` Johann Lombardi 1 sibling, 1 reply; 10+ messages in thread From: Andrew Morton @ 2007-05-16 16:12 UTC (permalink / raw) To: Johann Lombardi; +Cc: linux-kernel On Wed, 16 May 2007 17:39:19 +0200 Johann Lombardi <johann@clusterfs.com> wrote: > On Tue, May 15, 2007 at 02:23:39PM -0700, Andrew Morton wrote: > > > Yes, indeed. However, as soon as a call to get_block() fails, > > > do_mpage_readpage() will call block_read_full_page() which will attach > > > buffers to this page. > > > Consequently, all subsequent reads will go through block_read_full_page(). > > > > hm, confused. Why is get_block() failing? That has to go and read > > metadata. > > In fact, I am referring to the first part of my test case (i.e. mount the ext3 > fs, enable medium errors in scsi_debug and try to read a file from the fs). > > So, when I try to read a file, ext3_get_block() needs to read metadata from the > disk. However, given that the SCSI disk simulated by scsi_debug reports medium > errors, ext3_get_block() returns EIO to the caller (i.e. do_mpage_readpage()). > That's why get_block() is failing. > > Then, do_mpage_readpage() calls block_read_full_page() (via "goto confused"). > block_read_full_page() attaches buffers to this page and calls ext3_get_block() > which fails for the same reason as before. Consequently, block_read_full_page() > sets the PG_error flag. > Moreover, all subsequent readpage calls will go through block_read_full_page() > because the page has now buffers attached. > > Basically, my problem is that afterwards, when the device no longer returns > any errors, the PG_error flag is never cleared and, as a result, I keep > getting -EIO. That's the problem I'd like to address. > hm, OK. So, where are we up to? I still worry about the fact that changes in this area could cause the kernel to do a *lot* more IO attempts against failed devices, or failed sectors. We already have a few problems in that area. What is the actual real-world operational scenario here? Would it be a hotplugged disk? A transient network failure in a SAN? IOW, is it something from which the kernel should automatically recover, or it is a situation in which manual intervention would be better? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-16 16:12 ` Andrew Morton @ 2007-05-17 11:42 ` Johann Lombardi 2007-05-17 16:47 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Johann Lombardi @ 2007-05-17 11:42 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Wed, May 16, 2007 at 09:12:17AM -0700, Andrew Morton wrote: > > Basically, my problem is that afterwards, when the device no longer returns > > any errors, the PG_error flag is never cleared and, as a result, I keep > > getting -EIO. That's the problem I'd like to address. > > > > hm, OK. So, where are we up to? Once the errors reported by the underlying device are corrected, we must unmount/remount the filesystem if we want to use it. In fact, since readahead ignores I/O errors, the pagecache is populated with pages having the PG_error flag set and buffers attached. Since PG_error is then never cleared, we keep getting EIO despite that the underlying device works just fine. > What is the actual real-world operational scenario here? Would it be a > hotplugged disk? A transient network failure in a SAN? IOW, is it > something from which the kernel should automatically recover, or it is a > situation in which manual intervention would be better? The real-world operational scenario is a storage system reporting medium errors which can be corrected by a manual intervention. Johann ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-17 11:42 ` Johann Lombardi @ 2007-05-17 16:47 ` Andrew Morton 2007-05-29 17:06 ` Johann Lombardi 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2007-05-17 16:47 UTC (permalink / raw) To: Johann Lombardi; +Cc: linux-kernel On Thu, 17 May 2007 13:42:50 +0200 Johann Lombardi <johann@clusterfs.com> wrote: > > What is the actual real-world operational scenario here? Would it be a > > hotplugged disk? A transient network failure in a SAN? IOW, is it > > something from which the kernel should automatically recover, or it is a > > situation in which manual intervention would be better? > > The real-world operational scenario is a storage system reporting medium > errors which can be corrected by a manual intervention. So running ioctl(BLKFLSBUF) against the device will fix things up? Perhaps not, and we should invalidate all the file data as well. The only ways we have of doing that are umount+mount or /proc/sys/vm/drop_caches. I'm not sure what is the best thing to do here - there are advantages to caching the fact that the device (or some of it) is unreadable. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Clear PG_error before reading a page 2007-05-17 16:47 ` Andrew Morton @ 2007-05-29 17:06 ` Johann Lombardi 0 siblings, 0 replies; 10+ messages in thread From: Johann Lombardi @ 2007-05-29 17:06 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Thu, May 17, 2007 at 09:47:30AM -0700, Andrew Morton wrote: > So running ioctl(BLKFLSBUF) against the device will fix things up? No, 'blockdev --flushbufs /dev/sdb' doesn't help. > Perhaps not, and we should invalidate all the file data as well. The > only ways we have of doing that are umount+mount or /proc/sys/vm/drop_caches. 'echo > /proc/sys/vm/drop_caches' works. Thanks. Johann ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-05-29 17:06 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-15 14:37 Clear PG_error before reading a page Johann Lombardi 2007-05-15 17:11 ` Andrew Morton 2007-05-15 21:01 ` Johann Lombardi 2007-05-15 21:23 ` Andrew Morton 2007-05-16 15:39 ` Johann Lombardi 2007-05-16 15:49 ` Nick Piggin 2007-05-16 16:12 ` Andrew Morton 2007-05-17 11:42 ` Johann Lombardi 2007-05-17 16:47 ` Andrew Morton 2007-05-29 17:06 ` Johann Lombardi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox