Re: [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)
@ 2004-07-26 23:36 Martins Krikis
  2004-07-26 23:57 ` Marcelo Tosatti
  0 siblings, 1 reply; 5+ messages in thread
From: Martins Krikis @ 2004-07-26 23:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, arjanv, mkrikis

[-- Attachment #1: Type: text/plain, Size: 3481 bytes --]

Marcelo,

I had already written the subject off as "nobody cares"; I'm very 
glad you responded. Some more comments are below and I'm attaching
a new patch.

> The correct thing seems to pass IO error ("!uptodate") to the upper 
> ->b_end_io callback only in case both components of an IO operation
> have 
> failed. As long as at least one of the mirrors have successfully 
> finished the IO, we should continue operation, providing reliability,
> AFAICS. 

I think you are talking about the RAID-1 read case. In this case
the ataraid subdrivers just pick one disk among all the possible
mirrors and submit the request to that. ataraid_end_request() is
not involved. The fact that most ataraid subdrivers won't resubmit
the IO to a different mirror if it fails is a very small problem
compared to what I'm picking on here.

And that is the following. For RAID1 writes, the IOs have to be
submitted to all the mirrors. If the IO for _any_ of them fails,
the upper levels should be notified of the error, because the
data integrity is broken across mirrors. (With the exception of
those few subdrivers that are capable of keeping working with the
volume in "degraded" mode.) Ataraid_end_request() is widely used
for the RAID1 write case and thus carries the danger of never
notifying the upper levels that anything went wrong. 

Furthermore, for RAID0 reads and writes, the ataraid subdrivers
widely employ the ataraid_split_request() function in order to 
get the IO size to fit in just one strip and thus be applicable
to just one disk. The ataraid_split_request() just splits the 
request in halves and may get invoked again on each part if they
still don't fit. Each pair of new IOs that are produced this way 
are tied together through the use of ataraid_end_request(). Thus, 
in theory there could be a whole binary tree of requests formed 
from the "original request". And again, the trouble from using
ataraid_end_request() is that most of these IOs in the tree could
go wrong, but as long as the last one succeeds, a chain of
ataraid_end_request()-s will happily report a successful IO
completion to each other and eventually back to the upper levels.
The userland application may never learn about the massive data
corruption that has just occured. Again, it is my strong belief
that if _any_ "component IO" fails, the whole "original IO" should
be failed. 

Therefore, the patch keeps track of cumulative IO status by using
an otherwise unused bit from the b_state field in each IO's "parent
buffer_head". Other solutions exist and I mentioned it before, but I
think this is the least intrusive and does not require extra memory.

> About error reporting, indeed the current scheme is very bad and
> doesnt
> correctly report errors. We should at least print some warning on 
> ataraid_end_request() informing of IO errors. 

I added a printk to the patch to follow your suggestion,
but I think the main thing is to let the upper levels know.

> Its not OK to mess with BH_PrivateStart, no, jbd/xfs use it
> for their own purposes.

That's very unfortunate, because the comments in fs.h for 2.4 kernels
and in buffer_head.h for 2.6 kernels make it look like all the bits
starting with BH_PrivateStart are fair game. Any chance of replacing
the inviting comment with a big warning, or better yet, officially
incorporating all the bits used in jbd.h into the b_state?

Best regards,

  Martins Krikis

P.S. The patch is untested. I don't have a setup to do this easily.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: ataraid.patch --]
[-- Type: text/x-patch, Size: 1248 bytes --]

--- linux/drivers/ide/raid/ataraid.c.orig	2004-07-13 23:45:21.000000000 -0400
+++ linux/drivers/ide/raid/ataraid.c	2004-07-26 18:52:21.000000000 -0400
@@ -34,7 +34,8 @@

 #include "ataraid.h"

-                                        
+#define BH_IOFailure (BH_PrivateStart + 5) /* jbd.h uses +0 to +4 */
+
 static int ataraid_hardsect_size[256];
 static int ataraid_blksize_size[256];

@@ -153,7 +154,17 @@ void ataraid_end_request(struct buffer_h
 	if (private==NULL)
 		BUG();

+	if (!uptodate) { 
+		set_bit(BH_IOFailure, &private->parent->b_state);
+		printk(KERN_ERR "ataraid: IO error on major %d minor %d\n",
+		       MAJOR(bh->b_rdev), MINOR(bh->b_rdev));
+	}		       
+	
 	if (atomic_dec_and_test(&private->count)) {
+		if (test_bit(BH_IOFailure, &private->parent->b_state)) {
+			uptodate = 0; /* fail the completed original I/O */
+			clear_bit(BH_IOFailure, &private->parent->b_state);
+		}
 		private->parent->b_end_io(private->parent,uptodate);
 		private->parent = NULL;
 		kfree(private);
@@ -194,6 +205,8 @@ static void ataraid_split_request(reques

 	bh2->b_data +=  bh->b_size/2;

+	clear_bit(BH_IOFailure, &bh->b_state); /* this bit tracks success */
+
 	generic_make_request(rw,bh1);
 	generic_make_request(rw,bh2);
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)
  2004-07-26 23:36 [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels) Martins Krikis
@ 2004-07-26 23:57 ` Marcelo Tosatti
  2004-07-27  5:15   ` Martins Krikis
  0 siblings, 1 reply; 5+ messages in thread
From: Marcelo Tosatti @ 2004-07-26 23:57 UTC (permalink / raw)
  To: Martins Krikis, arjanv; +Cc: linux-kernel

On Mon, Jul 26, 2004 at 07:36:01PM -0400, Martins Krikis wrote:
> Marcelo,
> 
> I had already written the subject off as "nobody cares"; I'm very 
> glad you responded. Some more comments are below and I'm attaching
> a new patch.

:)

> > The correct thing seems to pass IO error ("!uptodate") to the upper 
> > ->b_end_io callback only in case both components of an IO operation
> > have 
> > failed. As long as at least one of the mirrors have successfully 
> > finished the IO, we should continue operation, providing reliability,
> > AFAICS. 

Note, I'm really ignorant about ataraid, so I might miss something, 
as you just pointed out that I have :) 

Anyway, lets keep going...

> I think you are talking about the RAID-1 read case. In this case
> the ataraid subdrivers just pick one disk among all the possible
> mirrors and submit the request to that. ataraid_end_request() is
> not involved. The fact that most ataraid subdrivers won't resubmit
> the IO to a different mirror if it fails is a very small problem
> compared to what I'm picking on here.
> 
> And that is the following. For RAID1 writes, the IOs have to be
> submitted to all the mirrors. If the IO for _any_ of them fails,
> the upper levels should be notified of the error, because the
> data integrity is broken across mirrors. (With the exception of
> those few subdrivers that are capable of keeping working with the
> volume in "degraded" mode.) 

I supposed all of them would be able to work in degraded mode? No?

> Ataraid_end_request() is widely used
> for the RAID1 write case and thus carries the danger of never
> notifying the upper levels that anything went wrong. 
> 
> Furthermore, for RAID0 reads and writes, the ataraid subdrivers
> widely employ the ataraid_split_request() function in order to 
> get the IO size to fit in just one strip and thus be applicable
> to just one disk. The ataraid_split_request() just splits the 
> request in halves and may get invoked again on each part if they
> still don't fit. Each pair of new IOs that are produced this way 
> are tied together through the use of ataraid_end_request(). Thus, 
> in theory there could be a whole binary tree of requests formed 
> from the "original request". And again, the trouble from using
> ataraid_end_request() is that most of these IOs in the tree could
> go wrong, but as long as the last one succeeds, a chain of
> ataraid_end_request()-s will happily report a successful IO
> completion to each other and eventually back to the upper levels.
> The userland application may never learn about the massive data
> corruption that has just occured. Again, it is my strong belief
> that if _any_ "component IO" fails, the whole "original IO" should
> be failed. 
> 
> Therefore, the patch keeps track of cumulative IO status by using
> an otherwise unused bit from the b_state field in each IO's "parent
> buffer_head". Other solutions exist and I mentioned it before, but I
> think this is the least intrusive and does not require extra memory.
> 
> > About error reporting, indeed the current scheme is very bad and
> > doesnt
> > correctly report errors. We should at least print some warning on 
> > ataraid_end_request() informing of IO errors. 
> 
> I added a printk to the patch to follow your suggestion,
> but I think the main thing is to let the upper levels know.
> 
> > Its not OK to mess with BH_PrivateStart, no, jbd/xfs use it
> > for their own purposes.
> 
> That's very unfortunate, because the comments in fs.h for 2.4 kernels
> and in buffer_head.h for 2.6 kernels make it look like all the bits
> starting with BH_PrivateStart are fair game. Any chance of replacing
> the inviting comment with a big warning, or better yet, officially
> incorporating all the bits used in jbd.h into the b_state?

Well, the comment says

        BH_PrivateStart,/* not a state bit, but the first bit available
                         * for private allocation by other entities
                         */

And "other entities" are some of the journalling filesystems.

When I read the comment, I can't figure out that using it 
freely from anywhere is "fair game". But you seem to understand
otherwise.

Yes, we could stick a stronger indication of "journalling fs'es for instance 
use this". Incorporating the bits in jbd.h into b_state doesnt seem right
because BH_PrivateStart is a generic thing, and not a JBD only thing. 
Does that make sense?

> Best regards,
> 
>   Martins Krikis
> 
> P.S. The patch is untested. I don't have a setup to do this easily.


The "#define BH_IOFailure (BH_PrivateStart + 5) /* jbd.h uses +0 to +4 */" 
is not the cleanest thing in the world, but...

Lets ask the guy who wrote this. Arjan, what you think of this patch?

>From my POV, better to report IO errors than not at all.

> --- linux/drivers/ide/raid/ataraid.c.orig	2004-07-13 23:45:21.000000000 -0400
> +++ linux/drivers/ide/raid/ataraid.c	2004-07-26 18:52:21.000000000 -0400
> @@ -34,7 +34,8 @@
>  
>  #include "ataraid.h"
>  
> -                                        
> +#define BH_IOFailure (BH_PrivateStart + 5) /* jbd.h uses +0 to +4 */
> +
>  static int ataraid_hardsect_size[256];
>  static int ataraid_blksize_size[256];
>  
> @@ -153,7 +154,17 @@ void ataraid_end_request(struct buffer_h
>  	if (private==NULL)
>  		BUG();
>  
> +	if (!uptodate) { 
> +		set_bit(BH_IOFailure, &private->parent->b_state);
> +		printk(KERN_ERR "ataraid: IO error on major %d minor %d\n",
> +		       MAJOR(bh->b_rdev), MINOR(bh->b_rdev));
> +	}		       
> +	
>  	if (atomic_dec_and_test(&private->count)) {
> +		if (test_bit(BH_IOFailure, &private->parent->b_state)) {
> +			uptodate = 0; /* fail the completed original I/O */
> +			clear_bit(BH_IOFailure, &private->parent->b_state);
> +		}
>  		private->parent->b_end_io(private->parent,uptodate);
>  		private->parent = NULL;
>  		kfree(private);
> @@ -194,6 +205,8 @@ static void ataraid_split_request(reques
>  
>  	bh2->b_data +=  bh->b_size/2;
>  
> +	clear_bit(BH_IOFailure, &bh->b_state); /* this bit tracks success */
> +
>  	generic_make_request(rw,bh1);
>  	generic_make_request(rw,bh2);
>  }


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)
  2004-07-26 23:57 ` Marcelo Tosatti
@ 2004-07-27  5:15   ` Martins Krikis
  0 siblings, 0 replies; 5+ messages in thread
From: Martins Krikis @ 2004-07-27  5:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, mkrikis, arjanv

--- Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:

> > And that is the following. For RAID1 writes, the IOs have to be
> > submitted to all the mirrors. If the IO for _any_ of them fails,
> > the upper levels should be notified of the error, because the
> > data integrity is broken across mirrors. (With the exception of
> > those few subdrivers that are capable of keeping working with the
> > volume in "degraded" mode.) 
> 
> I supposed all of them would be able to work in degraded mode? No?

Well, I don't want to give definite answers for drivers I haven't
been involved with. Yet my belief is that none of the ataraid
subdrivers that are currently included with the official kernel
sources can do that. Actually, I should not have chosen the words
"keeping working" because they surely will, in blissful ignorance
of any errors that are happening or any metadata that may need
updating. I'll stop here as this is not a direction that I want
to take this discussion. And my apologies if what I said is not
correct. I think the main idea is that this is ataraid, not md,
and the bar is set much lower on purpose. But that shouldn't
preclude letting the user know that errors are happening, and 
that's why I first wrote.

> > That's very unfortunate, because the comments in fs.h for 2.4
> kernels
> > and in buffer_head.h for 2.6 kernels make it look like all the bits
> > starting with BH_PrivateStart are fair game. Any chance of
> replacing
> > the inviting comment with a big warning, or better yet, officially
> > incorporating all the bits used in jbd.h into the b_state?
> 
> Well, the comment says
> 
>         BH_PrivateStart,/* not a state bit, but the first bit
> available
>                          * for private allocation by other entities
>                          */
> 
> And "other entities" are some of the journalling filesystems.
> 
> When I read the comment, I can't figure out that using it 
> freely from anywhere is "fair game". But you seem to understand
> otherwise.

Yes, I had indeed understood that any new modules may make free
use of this bit as it suits their needs. A new block device
driver, for example, is also an "other entity". The author may
check what the layers below it do (if any, e.g., scsi hba drivers),
see that nobody is touching BH_PrivateSync and assume that it
is OK to use it in the block device driver. Or in an ataraid
patch :-). Turns out that the bit is off-limits, however, because
journaling filesystems use it and block device drivers living below 
the filesystems could cause problems by using this bit. That's
not good, the block device driver writer should not have to
worry about what the higher layers are doing. While it may not
be the best analogy, when we kmalloc memory, it is ours to use,
no worries about who else will use it after we're done. We may
have to initialize it but we certainly don't have to return it
to the state that we received it in before releasing. Going back 
to the BH_PrivateSync bit, even a person who notices jbd.h using
the bit, may assume that perhaps the filesystems only use the bit
when nobody can interfere with its use. So now I know that it's
apparently not the case, but I sure had my hopes up still very
recently... (Yes, I know that the argument can be made that the
block device driver shouldn't use it either, because somebody
on an even lower level may decide to use it... I don't have a
good response to that; I can't demand of filesystems what I'm
not doing myself; I was just hoping for a little miracle, I guess).

> Yes, we could stick a stronger indication of "journalling fs'es for
> instance 
> use this". Incorporating the bits in jbd.h into b_state doesnt seem
> right
> because BH_PrivateStart is a generic thing, and not a JBD only thing.
> 
> Does that make sense?

Yes and no. If it's a truly "generic thing" then a block device
driver should be able to use it. It sounds more like it's a
"filesystem-specific thing, not to be used by any module that's
not a filesystem". Well, I don't know what the ideal solution is, 
but I hope I explained how I found the comment misleading. It seems
to me that such bit-ownership disputes may come up again, unless
either the kernel decides who owns it or the comment says that
the bit is used on first-come-first-served basis (and therefore
new module authors must check the whole source tree for previous
uses). Or perhaps, the kernel can just specify that so many bits
are for use at filesystem level, so many next bits for use by
block devices (and so on if there are any other interested parties).
But we can probably just leave all this as is. This discussion
will leave enough of an email-trace for anybody who is serious
to figure out where things stand.

> The "#define BH_IOFailure (BH_PrivateStart + 5) /* jbd.h uses +0 to
> +4 */" 
> is not the cleanest thing in the world, but...

I know :-).   In my original post I expressed an alternative---have
an extra field in ataraid_bh_private structure for this purpose.
But I hate using more space for something that really is just
one bit worth of information (per original IO). 

  Martins Krikis

__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)
@ 2004-07-14  4:11 Martins Krikis
  2004-07-25 22:38 ` Marcelo Tosatti
  0 siblings, 1 reply; 5+ messages in thread
From: Martins Krikis @ 2004-07-14  4:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: mkrikis

[-- Attachment #1: Type: text/plain, Size: 1592 bytes --]

I know that interest in 2.4 kernels and ataraid at this point
is probably minimal, and I myself don't use ataraid_end_request.

However, I would appreciate if somebody could tell me whether a
patch like the one attached is acceptable or whether the use of
the BH_PrivateStart flag (style issues aside) can introduce any
new problems. In particular, I've noticed that ext3 and xfs also
use this flag.

The bug that the patch attempts to solve is the following.
Ataraid_end_request() uses the success/failure of the last
one of component I/Os as the success/failure of the complete
I/O (to a RAID volume). Thus, even if all the first components
fail, as long as the last one succeeds, it will report the
complete I/O as a success, and the user will not even get
an indication of any errors. The code is used by all ataraid
subdrivers except iswraid, as far as I know.

If using the BH_PrivateStart is not appropriate in this module,
a different free bit from b_state can be chosen. Protection
against component-I/O failure can also be achieved by introducing
a new field in ataraid_bh_private. Ideally the subdrivers would
also clear the flag/field, although for a truly unused flag
in b_state this can be skipped, so a flag-based solution can
be both quicker and need no extra space, I think.

Anyway, I'm very interested to hear whether anybody cares
and whether it is OK to use BH_PrivateStart in block device
drivers.

Thanks,

  Martins Krikis

__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo 

[-- Attachment #2: ataraid.patch --]
[-- Type: application/octet-stream, Size: 949 bytes --]

--- linux/drivers/ide/raid/ataraid.c.orig	2004-07-13 23:45:21.000000000 -0400
+++ linux/drivers/ide/raid/ataraid.c	2004-07-13 23:45:32.000000000 -0400
@@ -153,7 +153,14 @@ void ataraid_end_request(struct buffer_h
 	if (private==NULL)
 		BUG();

+	if (!uptodate) /* record failures of components of the original I/O */
+		set_bit(BH_PrivateStart, &private->parent->b_state);
+	
 	if (atomic_dec_and_test(&private->count)) {
+		if (test_bit(BH_PrivateStart, &private->parent->b_state)) {
+			uptodate = 0; /* fail the completed original I/O */
+			clear_bit(BH_PrivateStart, &private->parent->b_state);
+		}
 		private->parent->b_end_io(private->parent,uptodate);
 		private->parent = NULL;
 		kfree(private);
@@ -194,6 +201,8 @@ static void ataraid_split_request(reques

 	bh2->b_data +=  bh->b_size/2;

+	clear_bit(BH_PrivateStart, &bh->b_state); /* this bit tracks success */
+
 	generic_make_request(rw,bh1);
 	generic_make_request(rw,bh2);
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels)
  2004-07-14  4:11 Martins Krikis
@ 2004-07-25 22:38 ` Marcelo Tosatti
  0 siblings, 0 replies; 5+ messages in thread
From: Marcelo Tosatti @ 2004-07-25 22:38 UTC (permalink / raw)
  To: Martins Krikis; +Cc: linux-kernel, arjanv


Hi Martins,

On Tue, Jul 13, 2004 at 09:11:47PM -0700, Martins Krikis wrote:
> I know that interest in 2.4 kernels and ataraid at this point
> is probably minimal, and I myself don't use ataraid_end_request.
> 
> However, I would appreciate if somebody could tell me whether a
> patch like the one attached is acceptable or whether the use of
> the BH_PrivateStart flag (style issues aside) can introduce any
> new problems. In particular, I've noticed that ext3 and xfs also
> use this flag.
> 
> The bug that the patch attempts to solve is the following.
> Ataraid_end_request() uses the success/failure of the last
> one of component I/Os as the success/failure of the complete
> I/O (to a RAID volume). Thus, even if all the first components
> fail, as long as the last one succeeds, it will report the
> complete I/O as a success, and the user will not even get
> an indication of any errors. The code is used by all ataraid
> subdrivers except iswraid, as far as I know.

Right, ataraid uses the status ("uptodate") of the last component of 
an IO to inform the real buffer_head IO status, and that is quite bad.

The correct thing seems to pass IO error ("!uptodate") to the upper 
->b_end_io callback only in case both components of an IO operation have 
failed. As long as at least one of the mirrors have successfully 
finished the IO, we should continue operation, providing reliability,
AFAICS. 

About error reporting, indeed the current scheme is very bad and doesnt
correctly report errors. We should at least print some warning on 
ataraid_end_request() informing of IO errors. 

ataraid seems to have been designed with poor IO error handling from 
the beginning.

Arjan, could you please comment on this? 

> If using the BH_PrivateStart is not appropriate in this module,
> a different free bit from b_state can be chosen. Protection
> against component-I/O failure can also be achieved by introducing
> a new field in ataraid_bh_private. Ideally the subdrivers would
> also clear the flag/field, although for a truly unused flag
> in b_state this can be skipped, so a flag-based solution can
> be both quicker and need no extra space, I think.
> 
> Anyway, I'm very interested to hear whether anybody cares
> and whether it is OK to use BH_PrivateStart in block device
> drivers.

Its not OK to mess with BH_PrivateStart, no, jbd/xfs use it
for their own purposes.

Anyway, thanks for the bug report. 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-07-27  5:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-26 23:36 [RFC][PATCH] ataraid_end_request hides errors (all? 2.4 kernels) Martins Krikis
2004-07-26 23:57 ` Marcelo Tosatti
2004-07-27  5:15   ` Martins Krikis
  -- strict thread matches above, loose matches on Subject: below --
2004-07-14  4:11 Martins Krikis
2004-07-25 22:38 ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.