* [PATCH 1/2] Avoid bio_endio recursion
@ 2008-06-24 5:22 Mikulas Patocka
2008-06-24 6:08 ` Neil Brown
2008-06-24 8:07 ` Jens Axboe
0 siblings, 2 replies; 20+ messages in thread
From: Mikulas Patocka @ 2008-06-24 5:22 UTC (permalink / raw)
To: linux-kernel; +Cc: axboe
Hi
bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm),
bio_end_io may call bio_endio again, up to an unspecified length.
The crash because of stack overflow was really observed on sparc64. And
this recursion was one of the contributing factors (using 9 stack frames
--- that is 1728 bytes).
This patch removes the recursion.
Mikulas
--
Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in turn
call bio_endio again. When this recursion happens, put the new bio to the queue
and process it later, from the top-level bio_endio.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Index: linux-2.6.26-rc5-devel/fs/bio.c
===================================================================
--- linux-2.6.26-rc5-devel.orig/fs/bio.c 2008-06-18 23:48:45.000000000 +0200
+++ linux-2.6.26-rc5-devel/fs/bio.c 2008-06-19 00:15:56.000000000 +0200
@@ -1168,6 +1168,27 @@
**/
void bio_endio(struct bio *bio, int error)
{
+ static DEFINE_PER_CPU(struct bio **, bio_end_queue) = { NULL };
+ struct bio ***bio_end_queue_ptr;
+ struct bio *bio_queue;
+
+ unsigned long flags;
+
+ local_irq_save(flags);
+ bio_end_queue_ptr = &__get_cpu_var(bio_end_queue);
+
+ if (*bio_end_queue_ptr) {
+ **bio_end_queue_ptr = bio;
+ *bio_end_queue_ptr = &bio->bi_next;
+ bio->bi_next = NULL;
+ goto ret;
+ }
+
+ bio_queue = NULL;
+queue_empty_next_bio:
+ *bio_end_queue_ptr = &bio_queue;
+next_bio:
+
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -1175,6 +1196,17 @@
if (bio->bi_end_io)
bio->bi_end_io(bio, error);
+
+ if (bio_queue) {
+ bio = bio_queue;
+ bio_queue = bio->bi_next;
+ if (!bio_queue) goto queue_empty_next_bio;
+ goto next_bio;
+ }
+ *bio_end_queue_ptr = NULL;
+
+ret:
+ local_irq_restore(flags);
}
void bio_pair_release(struct bio_pair *bp)
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-24 5:22 [PATCH 1/2] Avoid bio_endio recursion Mikulas Patocka @ 2008-06-24 6:08 ` Neil Brown 2008-06-24 14:36 ` Mikulas Patocka 2008-06-24 8:07 ` Jens Axboe 1 sibling, 1 reply; 20+ messages in thread From: Neil Brown @ 2008-06-24 6:08 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel, axboe On Tuesday June 24, mpatocka@redhat.com wrote: > Hi > > bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), > bio_end_io may call bio_endio again, up to an unspecified length. > > The crash because of stack overflow was really observed on sparc64. And > this recursion was one of the contributing factors (using 9 stack frames > --- that is 1728 bytes). > > This patch removes the recursion. This is very cool, thanks! A close mirror of the recursion avoidance in generic_make_request. You use a per-cpu queue were generic_make_request uses a per-task queue. This is fitting as bi_end_io doesn't have a process context, but is supposed to be fast and able to run with interrupts disabled, so tying to a cpu is no problem. > + > + bio_queue = NULL; > +queue_empty_next_bio: > + *bio_end_queue_ptr = &bio_queue; > +next_bio: > + > if (error) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) > @@ -1175,6 +1196,17 @@ > > if (bio->bi_end_io) > bio->bi_end_io(bio, error); > + > + if (bio_queue) { > + bio = bio_queue; > + bio_queue = bio->bi_next; > + if (!bio_queue) goto queue_empty_next_bio; > + goto next_bio; checkpatch.pl doesn't like that: ERROR: trailing statements should be on next line and I don't either. I would not bother with the mini-optimisation at all. Discard the queue_empty_next_bio label and replace the "if () goto" with if (!bio_queue) *bio_end_queue_ptr = &bio_queue; and leave gcc to optimise the assignment if it wants to. Reviewed-by: NeilBrown <neilb@suse.de> Thanks, NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-24 6:08 ` Neil Brown @ 2008-06-24 14:36 ` Mikulas Patocka 0 siblings, 0 replies; 20+ messages in thread From: Mikulas Patocka @ 2008-06-24 14:36 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, axboe On Tue, 24 Jun 2008, Neil Brown wrote: > On Tuesday June 24, mpatocka@redhat.com wrote: >> Hi >> >> bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), >> bio_end_io may call bio_endio again, up to an unspecified length. >> >> The crash because of stack overflow was really observed on sparc64. And >> this recursion was one of the contributing factors (using 9 stack frames >> --- that is 1728 bytes). >> >> This patch removes the recursion. > > This is very cool, thanks! A close mirror of the recursion > avoidance in generic_make_request. > > You use a per-cpu queue were generic_make_request uses a per-task > queue. This is fitting as bi_end_io doesn't have a process context, > but is supposed to be fast and able to run with interrupts disabled, > so tying to a cpu is no problem. Yes. I think "current" variable can't be used in irq context, it would blow with irq-stacks (or access some weird unknown memory). I had another version of the patch that doesn't disable interrupts and only disables preempt and uses local_t atomic cpu-local variables. It is somehow more tricky, because interrupt can be triggered any time while processing the queue and it can add anything to the queue. Then I realized that bio_endio runs most time with disabled interrupts anyway, so it'd be better to just disable interrupts and don't do that local_cmpxchg trickery. Mikulas >> + >> + bio_queue = NULL; >> +queue_empty_next_bio: >> + *bio_end_queue_ptr = &bio_queue; >> +next_bio: >> + >> if (error) >> clear_bit(BIO_UPTODATE, &bio->bi_flags); >> else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) >> @@ -1175,6 +1196,17 @@ >> >> if (bio->bi_end_io) >> bio->bi_end_io(bio, error); >> + >> + if (bio_queue) { >> + bio = bio_queue; >> + bio_queue = bio->bi_next; >> + if (!bio_queue) goto queue_empty_next_bio; >> + goto next_bio; > > checkpatch.pl doesn't like that: > ERROR: trailing statements should be on next line > > and I don't either. I would not bother with the mini-optimisation at > all. > Discard the queue_empty_next_bio label and replace the "if () goto" > with > if (!bio_queue) > *bio_end_queue_ptr = &bio_queue; > > and leave gcc to optimise the assignment if it wants to. > > Reviewed-by: NeilBrown <neilb@suse.de> > > Thanks, > NeilBrown > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-24 5:22 [PATCH 1/2] Avoid bio_endio recursion Mikulas Patocka 2008-06-24 6:08 ` Neil Brown @ 2008-06-24 8:07 ` Jens Axboe 2008-06-24 14:27 ` Mikulas Patocka 1 sibling, 1 reply; 20+ messages in thread From: Jens Axboe @ 2008-06-24 8:07 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel On Tue, Jun 24 2008, Mikulas Patocka wrote: > Hi > > bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), > bio_end_io may call bio_endio again, up to an unspecified length. > > The crash because of stack overflow was really observed on sparc64. And > this recursion was one of the contributing factors (using 9 stack frames > --- that is 1728 bytes). Looks good, I like the concept. Can you please make it a little less goto driven, though? The next_bio and goto next_bio could just be a while(). -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-24 8:07 ` Jens Axboe @ 2008-06-24 14:27 ` Mikulas Patocka 2008-06-25 8:24 ` Jens Axboe 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-06-24 14:27 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, Neil Brown On Tue, 24 Jun 2008, Jens Axboe wrote: > On Tue, Jun 24 2008, Mikulas Patocka wrote: >> Hi >> >> bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), >> bio_end_io may call bio_endio again, up to an unspecified length. >> >> The crash because of stack overflow was really observed on sparc64. And >> this recursion was one of the contributing factors (using 9 stack frames >> --- that is 1728 bytes). > > Looks good, I like the concept. Can you please make it a little less > goto driven, though? The next_bio and goto next_bio could just be a > while(). > > -- > Jens Axboe > Hi. This is the patch, slightly de-goto-ized. (it still contains one, I think that while (1) { ... break ... } is no better readable than goto). I found another problem in my previous patch, I forgot about the "error" variable (it would cause misbehavior for example if disk fails, submits an error and raid driver turns this failure into success). We need to save the error variable somewhere in the bio, there is no other place where it could be placed. I temporarily saved it to bi_idx, because it's unused at this place. Mikulas -- Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in turn call bio_endio again. When this recursion happens, put the new bio to the queue and process it later, from the top-level bio_endio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Index: linux-2.6.26-rc7-devel/fs/bio.c =================================================================== --- linux-2.6.26-rc7-devel.orig/fs/bio.c 2008-06-23 13:49:45.000000000 +0200 +++ linux-2.6.26-rc7-devel/fs/bio.c 2008-06-24 11:17:24.000000000 +0200 @@ -1168,13 +1168,44 @@ **/ void bio_endio(struct bio *bio, int error) { + static DEFINE_PER_CPU(struct bio **, bio_end_queue) = { NULL }; + struct bio ***bio_end_queue_ptr; + struct bio *bio_queue; + + unsigned long flags; + + bio->bi_idx = error; if (error) clear_bit(BIO_UPTODATE, &bio->bi_flags); else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) - error = -EIO; + bio->bi_idx = -EIO; + + local_irq_save(flags); + bio_end_queue_ptr = &__get_cpu_var(bio_end_queue); + + if (*bio_end_queue_ptr) { + **bio_end_queue_ptr = bio; + *bio_end_queue_ptr = &bio->bi_next; + bio->bi_next = NULL; + } else { + bio_queue = NULL; + *bio_end_queue_ptr = &bio_queue; + +next_bio: + if (bio->bi_end_io) + bio->bi_end_io(bio, (short)bio->bi_idx); + + if (bio_queue) { + bio = bio_queue; + bio_queue = bio->bi_next; + if (!bio_queue) + *bio_end_queue_ptr = &bio_queue; + goto next_bio; + } + *bio_end_queue_ptr = NULL; + } - if (bio->bi_end_io) - bio->bi_end_io(bio, error); + local_irq_restore(flags); } void bio_pair_release(struct bio_pair *bp) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-24 14:27 ` Mikulas Patocka @ 2008-06-25 8:24 ` Jens Axboe 2008-06-26 0:13 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2008-06-25 8:24 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel, Neil Brown On Tue, Jun 24 2008, Mikulas Patocka wrote: > > > On Tue, 24 Jun 2008, Jens Axboe wrote: > > >On Tue, Jun 24 2008, Mikulas Patocka wrote: > >>Hi > >> > >>bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), > >>bio_end_io may call bio_endio again, up to an unspecified length. > >> > >>The crash because of stack overflow was really observed on sparc64. And > >>this recursion was one of the contributing factors (using 9 stack frames > >>--- that is 1728 bytes). > > > >Looks good, I like the concept. Can you please make it a little less > >goto driven, though? The next_bio and goto next_bio could just be a > >while(). > > > >-- > >Jens Axboe > > > > Hi. > > This is the patch, slightly de-goto-ized. (it still contains one, I think > that while (1) { ... break ... } is no better readable than goto). Sure, that looks better. > I found another problem in my previous patch, I forgot about the "error" > variable (it would cause misbehavior for example if disk fails, submits an > error and raid driver turns this failure into success). We need to save > the error variable somewhere in the bio, there is no other place where it > could be placed. I temporarily saved it to bi_idx, because it's unused at > this place. I don't think bi_idx is a fantastic idea, I could easily imagine the bi_end_io function wanting to do a segment loop on the bio. Use bi_phys_segments instead (or bi_hw_segemnts, no difference), they should only be used when queuing and building IO, not for completion purposes. And put a big fat comment there explaining the overload. Plus they are just a cache, so if you use either of those and at the same time clear BIO_SEG_VALID in bi_flags, then it's guarenteed to be safe. Also please put the per-cpu definition outside of bio_endio(). And I don't think you need to disable interrupts, a plain preempt_disable() / preempt_enable() should be enough. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-25 8:24 ` Jens Axboe @ 2008-06-26 0:13 ` Mikulas Patocka 2008-06-26 7:07 ` Jens Axboe 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-06-26 0:13 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, Neil Brown On Wed, 25 Jun 2008, Jens Axboe wrote: > On Tue, Jun 24 2008, Mikulas Patocka wrote: >> >> >> On Tue, 24 Jun 2008, Jens Axboe wrote: >> >>> On Tue, Jun 24 2008, Mikulas Patocka wrote: >>>> Hi >>>> >>>> bio_endio calls bi_end_io callback. In case of stacked devices (raid, dm), >>>> bio_end_io may call bio_endio again, up to an unspecified length. >>>> >>>> The crash because of stack overflow was really observed on sparc64. And >>>> this recursion was one of the contributing factors (using 9 stack frames >>>> --- that is 1728 bytes). >>> >>> Looks good, I like the concept. Can you please make it a little less >>> goto driven, though? The next_bio and goto next_bio could just be a >>> while(). >>> >>> -- >>> Jens Axboe >>> >> >> Hi. >> >> This is the patch, slightly de-goto-ized. (it still contains one, I think >> that while (1) { ... break ... } is no better readable than goto). > > Sure, that looks better. > >> I found another problem in my previous patch, I forgot about the "error" >> variable (it would cause misbehavior for example if disk fails, submits an >> error and raid driver turns this failure into success). We need to save >> the error variable somewhere in the bio, there is no other place where it >> could be placed. I temporarily saved it to bi_idx, because it's unused at >> this place. > > I don't think bi_idx is a fantastic idea, I could easily imagine the > bi_end_io function wanting to do a segment loop on the bio. Use > bi_phys_segments instead (or bi_hw_segemnts, no difference), they should > only be used when queuing and building IO, not for completion purposes. > And put a big fat comment there explaining the overload. Plus they are > just a cache, so if you use either of those and at the same time clear > BIO_SEG_VALID in bi_flags, then it's guarenteed to be safe. Here is new patch that uses bi_phys_segments. > Also please put the per-cpu definition outside of bio_endio(). And I > don't think you need to disable interrupts, a plain preempt_disable() / > preempt_enable() should be enough. Disabling interrupts is needed. If you disable only preempt, interrupt can come at any point and add anything to the queue. You'd have to use local_t and local_cmpxchg tricks then and the code would be much more ugly. I found that the code is called with interrupts disabled most time, so it doesn't matter if I disable them again. Mikulas > -- > Jens Axboe Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in turn call bio_endio again. When this recursion happens, put the new bio to the queue and process it later, from the top-level bio_endio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- fs/bio.c | 38 +++++++++++++++++++++++++++++++++++--- include/linux/bio.h | 3 +++ 2 files changed, 38 insertions(+), 3 deletions(-) Index: linux-2.6.26-rc7-devel/fs/bio.c =================================================================== --- linux-2.6.26-rc7-devel.orig/fs/bio.c 2008-06-23 13:49:45.000000000 +0200 +++ linux-2.6.26-rc7-devel/fs/bio.c 2008-06-25 18:09:48.000000000 +0200 @@ -1166,15 +1166,47 @@ void bio_check_pages_dirty(struct bio *b * bio unless they own it and thus know that it has an end_io * function. **/ +static DEFINE_PER_CPU(struct bio **, bio_end_queue) = { NULL }; + void bio_endio(struct bio *bio, int error) { + struct bio ***bio_end_queue_ptr; + struct bio *bio_queue; + + unsigned long flags; + + bio->bi_phys_segments = error; if (error) clear_bit(BIO_UPTODATE, &bio->bi_flags); else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) - error = -EIO; + bio->bi_phys_segments = -EIO; + + local_irq_save(flags); + bio_end_queue_ptr = &__get_cpu_var(bio_end_queue); + + if (*bio_end_queue_ptr) { + **bio_end_queue_ptr = bio; + *bio_end_queue_ptr = &bio->bi_next; + bio->bi_next = NULL; + } else { + bio_queue = NULL; + *bio_end_queue_ptr = &bio_queue; + +next_bio: + if (bio->bi_end_io) + bio->bi_end_io(bio, (short)bio->bi_phys_segments); + + if (bio_queue) { + bio = bio_queue; + bio_queue = bio->bi_next; + if (!bio_queue) + *bio_end_queue_ptr = &bio_queue; + goto next_bio; + } + *bio_end_queue_ptr = NULL; + } - if (bio->bi_end_io) - bio->bi_end_io(bio, error); + local_irq_restore(flags); } void bio_pair_release(struct bio_pair *bp) Index: linux-2.6.26-rc7-devel/include/linux/bio.h =================================================================== --- linux-2.6.26-rc7-devel.orig/include/linux/bio.h 2008-06-25 18:06:59.000000000 +0200 +++ linux-2.6.26-rc7-devel/include/linux/bio.h 2008-06-25 18:08:08.000000000 +0200 @@ -86,6 +86,9 @@ struct bio { /* Number of segments in this BIO after * physical address coalescing is performed. + * + * When ending a bio request in bio_endio, this field is temporarily + * (ab)used to keep the error code. */ unsigned short bi_phys_segments; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-26 0:13 ` Mikulas Patocka @ 2008-06-26 7:07 ` Jens Axboe 2008-07-02 4:09 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2008-06-26 7:07 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel, Neil Brown On Wed, Jun 25 2008, Mikulas Patocka wrote: > On Wed, 25 Jun 2008, Jens Axboe wrote: > > >On Tue, Jun 24 2008, Mikulas Patocka wrote: > >> > >> > >>On Tue, 24 Jun 2008, Jens Axboe wrote: > >> > >>>On Tue, Jun 24 2008, Mikulas Patocka wrote: > >>>>Hi > >>>> > >>>>bio_endio calls bi_end_io callback. In case of stacked devices (raid, > >>>>dm), > >>>>bio_end_io may call bio_endio again, up to an unspecified length. > >>>> > >>>>The crash because of stack overflow was really observed on sparc64. And > >>>>this recursion was one of the contributing factors (using 9 stack frames > >>>>--- that is 1728 bytes). > >>> > >>>Looks good, I like the concept. Can you please make it a little less > >>>goto driven, though? The next_bio and goto next_bio could just be a > >>>while(). > >>> > >>>-- > >>>Jens Axboe > >>> > >> > >>Hi. > >> > >>This is the patch, slightly de-goto-ized. (it still contains one, I think > >>that while (1) { ... break ... } is no better readable than goto). > > > >Sure, that looks better. > > > >>I found another problem in my previous patch, I forgot about the "error" > >>variable (it would cause misbehavior for example if disk fails, submits an > >>error and raid driver turns this failure into success). We need to save > >>the error variable somewhere in the bio, there is no other place where it > >>could be placed. I temporarily saved it to bi_idx, because it's unused at > >>this place. > > > >I don't think bi_idx is a fantastic idea, I could easily imagine the > >bi_end_io function wanting to do a segment loop on the bio. Use > >bi_phys_segments instead (or bi_hw_segemnts, no difference), they should > >only be used when queuing and building IO, not for completion purposes. > >And put a big fat comment there explaining the overload. Plus they are > >just a cache, so if you use either of those and at the same time clear > >BIO_SEG_VALID in bi_flags, then it's guarenteed to be safe. > > Here is new patch that uses bi_phys_segments. > > >Also please put the per-cpu definition outside of bio_endio(). And I > >don't think you need to disable interrupts, a plain preempt_disable() / > >preempt_enable() should be enough. > > Disabling interrupts is needed. If you disable only preempt, interrupt can > come at any point and add anything to the queue. You'd have to use local_t > and local_cmpxchg tricks then and the code would be much more ugly. I > found that the code is called with interrupts disabled most time, so it > doesn't matter if I disable them again. Right, that wont work of course. Completions are typically done through a softirq, so it is not currently done with hard interrupts disabled. So it's a bit of a shame to impose this extra restriction, bio unrolling is necessarily a really short operation. We could reenable around the bi_end_io() call, but then we'd need to save and restore for each bio. > > Mikulas > > >-- > >Jens Axboe > > Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in > turn > call bio_endio again. When this recursion happens, put the new bio to the > queue > and process it later, from the top-level bio_endio. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > --- > fs/bio.c | 38 +++++++++++++++++++++++++++++++++++--- > include/linux/bio.h | 3 +++ > 2 files changed, 38 insertions(+), 3 deletions(-) > > Index: linux-2.6.26-rc7-devel/fs/bio.c > =================================================================== > --- linux-2.6.26-rc7-devel.orig/fs/bio.c 2008-06-23 > 13:49:45.000000000 +0200 > +++ linux-2.6.26-rc7-devel/fs/bio.c 2008-06-25 18:09:48.000000000 +0200 > @@ -1166,15 +1166,47 @@ void bio_check_pages_dirty(struct bio *b > * bio unless they own it and thus know that it has an end_io > * function. > **/ > +static DEFINE_PER_CPU(struct bio **, bio_end_queue) = { NULL }; > + > void bio_endio(struct bio *bio, int error) > { > + struct bio ***bio_end_queue_ptr; > + struct bio *bio_queue; > + > + unsigned long flags; > + > + bio->bi_phys_segments = error; Please make that bio->bi_flags &= ~(1 << BIO_SEG_VALID); bio->bi_phys_segments = error; as I mentioned, since then we have nothing to worry about. Otherwise bio_phys_segments() could return a completely wrong count. > if (error) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) > - error = -EIO; > + bio->bi_phys_segments = -EIO; > + > + local_irq_save(flags); > + bio_end_queue_ptr = &__get_cpu_var(bio_end_queue); > + > + if (*bio_end_queue_ptr) { > + **bio_end_queue_ptr = bio; > + *bio_end_queue_ptr = &bio->bi_next; > + bio->bi_next = NULL; > + } else { > + bio_queue = NULL; > + *bio_end_queue_ptr = &bio_queue; > + > +next_bio: > + if (bio->bi_end_io) > + bio->bi_end_io(bio, (short)bio->bi_phys_segments); > + > + if (bio_queue) { > + bio = bio_queue; > + bio_queue = bio->bi_next; > + if (!bio_queue) > + *bio_end_queue_ptr = &bio_queue; > + goto next_bio; > + } > + *bio_end_queue_ptr = NULL; > + } > > - if (bio->bi_end_io) > - bio->bi_end_io(bio, error); > + local_irq_restore(flags); > } > > void bio_pair_release(struct bio_pair *bp) > Index: linux-2.6.26-rc7-devel/include/linux/bio.h > =================================================================== > --- linux-2.6.26-rc7-devel.orig/include/linux/bio.h 2008-06-25 > 18:06:59.000000000 +0200 > +++ linux-2.6.26-rc7-devel/include/linux/bio.h 2008-06-25 > 18:08:08.000000000 +0200 > @@ -86,6 +86,9 @@ struct bio { > > /* Number of segments in this BIO after > * physical address coalescing is performed. > + * > + * When ending a bio request in bio_endio, this field is temporarily > + * (ab)used to keep the error code. > */ > unsigned short bi_phys_segments; > -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-06-26 7:07 ` Jens Axboe @ 2008-07-02 4:09 ` Mikulas Patocka 2008-07-02 8:00 ` Alan Cox 2008-07-02 8:25 ` Jens Axboe 0 siblings, 2 replies; 20+ messages in thread From: Mikulas Patocka @ 2008-07-02 4:09 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, Neil Brown > Right, that wont work of course. Completions are typically done through > a softirq, so it is not currently done with hard interrupts disabled. I thought, from hardirq - that's what IDE is doing. And they are called with interrupts disabled (maybe unless you specify unmaskirq, which is not default). What block driver does completions with softirq? ... and why? > So it's a bit of a shame to impose this extra restriction, bio unrolling > is necessarily a really short operation. We could reenable around the > bi_end_io() call, but then we'd need to save and restore for each bio. I found another weird thing --- why does local_irq_save() execute that microcoded "cli" instruction even if interrupts are already disabled? That could be definitely optimized. And does local_irq_restore() need to execute even more costy "popf" when it makes a transition from disabled to disabled state? What's local_irq_restore semantics --- is it allowed to use local_irq_restore for transition from interrupt-enabled state into interrupt-disabled state? > Please make that > > bio->bi_flags &= ~(1 << BIO_SEG_VALID); > bio->bi_phys_segments = error; OK, here is the updated patch: Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in turn call bio_endio again. When this recursion happens, put the new bio to the queue and process it later, from the top-level bio_endio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- fs/bio.c | 39 ++++++++++++++++++++++++++++++++++++--- include/linux/bio.h | 3 +++ 2 files changed, 39 insertions(+), 3 deletions(-) Index: linux-2.6.26-rc8-devel/fs/bio.c =================================================================== --- linux-2.6.26-rc8-devel.orig/fs/bio.c 2008-07-02 04:20:43.000000000 +0200 +++ linux-2.6.26-rc8-devel/fs/bio.c 2008-07-02 04:22:15.000000000 +0200 @@ -1166,15 +1166,48 @@ void bio_check_pages_dirty(struct bio *b * bio unless they own it and thus know that it has an end_io * function. **/ +static DEFINE_PER_CPU(struct bio **, bio_end_queue) = { NULL }; + void bio_endio(struct bio *bio, int error) { + struct bio ***bio_end_queue_ptr; + struct bio *bio_queue; + + unsigned long flags; + + bio->bi_flags &= ~(1 << BIO_SEG_VALID); + bio->bi_phys_segments = error; if (error) clear_bit(BIO_UPTODATE, &bio->bi_flags); else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) - error = -EIO; + bio->bi_phys_segments = -EIO; + + local_irq_save(flags); + bio_end_queue_ptr = &__get_cpu_var(bio_end_queue); + + if (*bio_end_queue_ptr) { + **bio_end_queue_ptr = bio; + *bio_end_queue_ptr = &bio->bi_next; + bio->bi_next = NULL; + } else { + bio_queue = NULL; + *bio_end_queue_ptr = &bio_queue; + +next_bio: + if (bio->bi_end_io) + bio->bi_end_io(bio, (short)bio->bi_phys_segments); + + if (bio_queue) { + bio = bio_queue; + bio_queue = bio->bi_next; + if (!bio_queue) + *bio_end_queue_ptr = &bio_queue; + goto next_bio; + } + *bio_end_queue_ptr = NULL; + } - if (bio->bi_end_io) - bio->bi_end_io(bio, error); + local_irq_restore(flags); } void bio_pair_release(struct bio_pair *bp) Index: linux-2.6.26-rc8-devel/include/linux/bio.h =================================================================== --- linux-2.6.26-rc8-devel.orig/include/linux/bio.h 2008-07-02 04:20:43.000000000 +0200 +++ linux-2.6.26-rc8-devel/include/linux/bio.h 2008-07-02 04:21:16.000000000 +0200 @@ -86,6 +86,9 @@ struct bio { /* Number of segments in this BIO after * physical address coalescing is performed. + * + * When ending a bio request in bio_endio, this field is temporarily + * (ab)used to keep the error code. */ unsigned short bi_phys_segments; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-02 4:09 ` Mikulas Patocka @ 2008-07-02 8:00 ` Alan Cox 2008-07-03 21:03 ` Mikulas Patocka 2008-07-02 8:25 ` Jens Axboe 1 sibling, 1 reply; 20+ messages in thread From: Alan Cox @ 2008-07-02 8:00 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, Neil Brown On Wed, 2 Jul 2008 00:09:22 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote: > > Right, that wont work of course. Completions are typically done through > > a softirq, so it is not currently done with hard interrupts disabled. > > I thought, from hardirq - that's what IDE is doing. And they are called Even IDE will sometimes complete from a timer on an error. > And does local_irq_restore() need to execute even more costy "popf" when > it makes a transition from disabled to disabled state? What's > local_irq_restore semantics --- is it allowed to use local_irq_restore for > transition from interrupt-enabled state into interrupt-disabled state? If you are worried about performance the network layer has _irq variants of various functions that are faster and can only be called from the right context (eg kfree_skb_irq), so you could do two versions of that code. Alan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-02 8:00 ` Alan Cox @ 2008-07-03 21:03 ` Mikulas Patocka 0 siblings, 0 replies; 20+ messages in thread From: Mikulas Patocka @ 2008-07-03 21:03 UTC (permalink / raw) To: Alan Cox; +Cc: Jens Axboe, linux-kernel, Neil Brown On Wed, 2 Jul 2008, Alan Cox wrote: > On Wed, 2 Jul 2008 00:09:22 -0400 (EDT) > Mikulas Patocka <mpatocka@redhat.com> wrote: > >>> Right, that wont work of course. Completions are typically done through >>> a softirq, so it is not currently done with hard interrupts disabled. >> >> I thought, from hardirq - that's what IDE is doing. And they are called > > Even IDE will sometimes complete from a timer on an error. > >> And does local_irq_restore() need to execute even more costy "popf" when >> it makes a transition from disabled to disabled state? What's >> local_irq_restore semantics --- is it allowed to use local_irq_restore for >> transition from interrupt-enabled state into interrupt-disabled state? > > If you are worried about performance the network layer has _irq variants > of various functions that are faster and can only be called from the > right context (eg kfree_skb_irq), so you could do two versions of that > code. > > Alan I was thinking about a general optimization --- something like this. On Core2 it doesn't make much sense --- cli is fast there (just 11 ticks). On Pentium 4 it would make sense, because cli is much more costy there. Mikulas --- include/asm-x86/irqflags.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.26-rc8/include/asm-x86/irqflags.h =================================================================== --- linux-2.6.26-rc8.orig/include/asm-x86/irqflags.h 2008-06-02 23:13:32.000000000 +0200 +++ linux-2.6.26-rc8/include/asm-x86/irqflags.h 2008-07-03 22:53:32.000000000 +0200 @@ -101,7 +101,8 @@ static inline unsigned long __raw_local_ { unsigned long flags = __raw_local_save_flags(); - raw_local_irq_disable(); + if (flags & X86_EFLAGS_IF) + raw_local_irq_disable(); return flags; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-02 4:09 ` Mikulas Patocka 2008-07-02 8:00 ` Alan Cox @ 2008-07-02 8:25 ` Jens Axboe 2008-07-03 21:08 ` Mikulas Patocka 1 sibling, 1 reply; 20+ messages in thread From: Jens Axboe @ 2008-07-02 8:25 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel, Neil Brown On Wed, Jul 02 2008, Mikulas Patocka wrote: > >Right, that wont work of course. Completions are typically done through > >a softirq, so it is not currently done with hard interrupts disabled. > > I thought, from hardirq - that's what IDE is doing. And they are called > with interrupts disabled (maybe unless you specify unmaskirq, which is not > default). What block driver does completions with softirq? ... and why? The key word is 'typically', the old IDE driver really isn't used very much. The SCSI layer and eg cciss uses the block layer softirq completions, so that is what 99% of the uses will be. > >So it's a bit of a shame to impose this extra restriction, bio unrolling > >is necessarily a really short operation. We could reenable around the > >bi_end_io() call, but then we'd need to save and restore for each bio. > > I found another weird thing --- why does local_irq_save() execute that > microcoded "cli" instruction even if interrupts are already disabled? That > could be definitely optimized. > > And does local_irq_restore() need to execute even more costy "popf" when > it makes a transition from disabled to disabled state? What's > local_irq_restore semantics --- is it allowed to use local_irq_restore for > transition from interrupt-enabled state into interrupt-disabled state? That's a question for the arch people. Probably nested _save and _restore aren't optimized. > > >Please make that > > > > bio->bi_flags &= ~(1 << BIO_SEG_VALID); > > bio->bi_phys_segments = error; > > OK, here is the updated patch: > > Avoid recursion on bio_endio. bio_endio calls bio->bi_end_io which may in > turn > call bio_endio again. When this recursion happens, put the new bio to the > queue > and process it later, from the top-level bio_endio. Thanks that looks good, I'll throw this into the testing mix and see what happens. > void bio_pair_release(struct bio_pair *bp) > Index: linux-2.6.26-rc8-devel/include/linux/bio.h > =================================================================== > --- linux-2.6.26-rc8-devel.orig/include/linux/bio.h 2008-07-02 > 04:20:43.000000000 +0200 > +++ linux-2.6.26-rc8-devel/include/linux/bio.h 2008-07-02 > 04:21:16.000000000 +0200 > @@ -86,6 +86,9 @@ struct bio { > > /* Number of segments in this BIO after > * physical address coalescing is performed. > + * > + * When ending a bio request in bio_endio, this field is temporarily > + * (ab)used to keep the error code. > */ > unsigned short bi_phys_segments; I'll drop this comment, we don't really need it. Users must use bio_phys_segments() to retrieve the segment count anyway, in which case the count is ALWAYS valid. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-02 8:25 ` Jens Axboe @ 2008-07-03 21:08 ` Mikulas Patocka 2008-07-03 21:04 ` Alan Cox 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-07-03 21:08 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, Neil Brown On Wed, 2 Jul 2008, Jens Axboe wrote: > On Wed, Jul 02 2008, Mikulas Patocka wrote: >>> Right, that wont work of course. Completions are typically done through >>> a softirq, so it is not currently done with hard interrupts disabled. >> >> I thought, from hardirq - that's what IDE is doing. And they are called >> with interrupts disabled (maybe unless you specify unmaskirq, which is not >> default). What block driver does completions with softirq? ... and why? > > The key word is 'typically', the old IDE driver really isn't used very > much. The SCSI layer and eg cciss uses the block layer softirq > completions, so that is what 99% of the uses will be. I use the old IDE driver, I don't see a reason why driver should create SCSI requests and lower layer translate them to ATA commands. Just to look, this is the first version of the patch without disabling interrupts. If you are concerned about disabling interrupts, you can go this way. Mikulas Index: linux-2.6.26-rc5-devel/fs/bio.c =================================================================== --- linux-2.6.26-rc5-devel.orig/fs/bio.c 2008-06-18 16:05:45.000000000 +0200 +++ linux-2.6.26-rc5-devel/fs/bio.c 2008-06-18 23:27:28.000000000 +0200 @@ -1168,6 +1168,25 @@ **/ void bio_endio(struct bio *bio, int error) { + static DEFINE_PER_CPU(local_t, bio_end_queue) = LOCAL_INIT(0); + local_t *bio_end_queue_ptr; + unsigned long l; + + bio->bi_next = NULL; + + bio_end_queue_ptr = &get_cpu_var(bio_end_queue); + if (local_read(bio_end_queue_ptr)) { +insert_race: + bio->bi_next = (void *)local_read(bio_end_queue_ptr); + if (unlikely(local_cmpxchg(bio_end_queue_ptr, (unsigned long)bio->bi_next, (unsigned long)bio) != (unsigned long)bio->bi_next)) + goto insert_race; + put_cpu_var(bio_end_queue); + return; + } + + local_set(bio_end_queue_ptr, 1); + +process_bio: if (error) clear_bit(BIO_UPTODATE, &bio->bi_flags); else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) @@ -1175,6 +1194,19 @@ if (bio->bi_end_io) bio->bi_end_io(bio, error); + +remove_race: + l = local_read(bio_end_queue_ptr); + if (l != 1) { + bio = (void *)l; + if (unlikely(local_cmpxchg(bio_end_queue_ptr, (unsigned long)bio, (unsigned long)bio->bi_next) != (unsigned long)bio)) + goto remove_race; + goto process_bio; + } + if (unlikely(local_cmpxchg(bio_end_queue_ptr, 1, 0) != 1)) + goto remove_race; + + put_cpu_var(bio_end_queue); } void bio_pair_release(struct bio_pair *bp) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 21:08 ` Mikulas Patocka @ 2008-07-03 21:04 ` Alan Cox 2008-07-03 22:54 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Alan Cox @ 2008-07-03 21:04 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, Neil Brown > I use the old IDE driver, I don't see a reason why driver should create > SCSI requests and lower layer translate them to ATA commands. Because modern drives are basically SCSI devices. Once you turn on NCQ and stuff you need all the midlayer queueing magic ... Either way the bio thing wants fixing. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 21:04 ` Alan Cox @ 2008-07-03 22:54 ` Mikulas Patocka 2008-07-03 23:00 ` Alan Cox 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-07-03 22:54 UTC (permalink / raw) To: Alan Cox; +Cc: Jens Axboe, linux-kernel, Neil Brown On Thu, 3 Jul 2008, Alan Cox wrote: >> I use the old IDE driver, I don't see a reason why driver should create >> SCSI requests and lower layer translate them to ATA commands. > > Because modern drives are basically SCSI devices. Once you turn on NCQ > and stuff you need all the midlayer queueing magic ... There is no piece of SCSI protocol in SATA disks. The command and response FIS of SATA standard contain legacy IDE registers, there's nothing about SCSI. Even the NCQ commands are done via IDE registers, not SCSI command block. It seems like someone wanted to save few weeks of coding by reusing the SCSI disk queuing ... and created a lot of other problems (example: if you have disk error with IDE driver, it dumps IDE registers into log ... if you have disk error with SATA driver, it dumps sense key ... but there's no sense key in SATA standard ... the driver just had to make up one because it pretends to be SCSI). Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 22:54 ` Mikulas Patocka @ 2008-07-03 23:00 ` Alan Cox 2008-07-03 23:51 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Alan Cox @ 2008-07-03 23:00 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, Neil Brown > SCSI. Even the NCQ commands are done via IDE registers, not SCSI command > block. It seems like someone wanted to save few weeks of coding by reusing Thats really a trivial question of encoding, in all but actual byte encoding it is SCSI. Once you look at SAS and ATAPI it's all the more clear. > the SCSI disk queuing ... and created a lot of other problems (example: if > you have disk error with IDE driver, it dumps IDE registers into log ... > if you have disk error with SATA driver, it dumps sense key ... but Actually it dumps both - and that behaviour was chosen to match existing drivers in a large part (eg 3ware). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 23:00 ` Alan Cox @ 2008-07-03 23:51 ` Mikulas Patocka 2008-07-03 23:44 ` Alan Cox 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-07-03 23:51 UTC (permalink / raw) To: Alan Cox; +Cc: Jens Axboe, linux-kernel, Neil Brown On Fri, 4 Jul 2008, Alan Cox wrote: >> SCSI. Even the NCQ commands are done via IDE registers, not SCSI command >> block. It seems like someone wanted to save few weeks of coding by reusing > > Thats really a trivial question of encoding, in all but actual > byte encoding it is SCSI. Once you look at SAS and ATAPI it's all the > more clear. For SAS and ATAPI you're right - that's SCSI. But not for SATA disks. The only thing SCSI and SATA disk have common is that they may process concurrently more requests. There's nothing else. Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 23:51 ` Mikulas Patocka @ 2008-07-03 23:44 ` Alan Cox 2008-07-04 3:26 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Alan Cox @ 2008-07-03 23:44 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, Neil Brown > But not for SATA disks. The only thing SCSI and SATA disk have common is > that they may process concurrently more requests. There's nothing else. The only material difference between SATA and SCSI is the encoding of the command blocks and responses. Alan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-03 23:44 ` Alan Cox @ 2008-07-04 3:26 ` Mikulas Patocka 2008-07-04 8:11 ` Alan Cox 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-07-04 3:26 UTC (permalink / raw) To: Alan Cox; +Cc: Jens Axboe, linux-kernel, Neil Brown On Fri, 4 Jul 2008, Alan Cox wrote: > >> But not for SATA disks. The only thing SCSI and SATA disk have common is >> that they may process concurrently more requests. There's nothing else. > > The only material difference between SATA and SCSI is the encoding of the > command blocks and responses. > > Alan That statement is true for any two block device protocol. Then, why doesn't Linux use SCSI for all block devices? --- add scsi command block to struct bio and we can pass them directly to controller driver :) Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] Avoid bio_endio recursion 2008-07-04 3:26 ` Mikulas Patocka @ 2008-07-04 8:11 ` Alan Cox 0 siblings, 0 replies; 20+ messages in thread From: Alan Cox @ 2008-07-04 8:11 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, Neil Brown > That statement is true for any two block device protocol. Then, why Not it isn't. The mentality of quite a few devices is very different to SCSI in the way the express queueing and caching. You cannot for example map I2O Block onto SCSI at all. > doesn't Linux use SCSI for all block devices? --- add scsi command block > to struct bio and we can pass them directly to controller driver That was discussed, along with continuing to split scsi and block aspects of queueing, tagging and error recovery apart. Windows btw does generally follow that 'everything is SCSI' approach ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-07-04 8:30 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-24 5:22 [PATCH 1/2] Avoid bio_endio recursion Mikulas Patocka 2008-06-24 6:08 ` Neil Brown 2008-06-24 14:36 ` Mikulas Patocka 2008-06-24 8:07 ` Jens Axboe 2008-06-24 14:27 ` Mikulas Patocka 2008-06-25 8:24 ` Jens Axboe 2008-06-26 0:13 ` Mikulas Patocka 2008-06-26 7:07 ` Jens Axboe 2008-07-02 4:09 ` Mikulas Patocka 2008-07-02 8:00 ` Alan Cox 2008-07-03 21:03 ` Mikulas Patocka 2008-07-02 8:25 ` Jens Axboe 2008-07-03 21:08 ` Mikulas Patocka 2008-07-03 21:04 ` Alan Cox 2008-07-03 22:54 ` Mikulas Patocka 2008-07-03 23:00 ` Alan Cox 2008-07-03 23:51 ` Mikulas Patocka 2008-07-03 23:44 ` Alan Cox 2008-07-04 3:26 ` Mikulas Patocka 2008-07-04 8:11 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox