* [BUG] raid1: barrier retry does not work correctly with write-behind
@ 2006-08-03 15:44 Paul Clements
2006-08-04 1:47 ` Neil Brown
0 siblings, 1 reply; 4+ messages in thread
From: Paul Clements @ 2006-08-03 15:44 UTC (permalink / raw)
To: linux-raid, neilb
Description:
-----------
When a filesystem sends a barrier write down to raid1, raid1 tries to
pass the write down to its component devices. However, if one or more of
the devices return EOPNOTSUPP, it means that they do not support
barriers, and raid1 must strip the barrier out of the write and retry
it. Now, this all works fine in most cases. However, in a raid1 mirror
with write-behind enabled, we end up getting a corrupt bio sent down for
the write retry. The bug is in the following code (raid1.c, line 1409):
} else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
/* some requests in the r1bio were BIO_RW_BARRIER
* requests which failed with -EOPNOTSUPP. Hohumm..
* Better resubmit without the barrier.
* We know which devices to resubmit for, because
* all others have had their bios[] entry cleared.
* We already have a nr_pending reference on these rdevs.
*/
int i;
clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
clear_bit(R1BIO_Barrier, &r1_bio->state);
for (i=0; i < conf->raid_disks; i++)
if (r1_bio->bios[i])
atomic_inc(&r1_bio->remaining);
for (i=0; i < conf->raid_disks; i++)
if (r1_bio->bios[i]) {
struct bio_vec *bvec;
int j;
bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
The r1_bio->master_bio may already have had end_io() called and been
freed by the time we bio_clone() it. This results in an invalid bio
being sent down to one (or more) of the raid1 components. In my testing,
the original 4K barrier write ended up being a zero length write when it
was retried.
Solution:
--------
Can we just bio_alloc instead of cloning? I think we have all the sector
and offset data in the r1_bio. We could then bio_add_page() the pages
into the new bio, possibly? Or could we just use the original bio that
failed, being careful to clean up fields (I think we used to do this but
Neil changed the behavior because of some problems).
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [BUG] raid1: barrier retry does not work correctly with write-behind
2006-08-03 15:44 [BUG] raid1: barrier retry does not work correctly with write-behind Paul Clements
@ 2006-08-04 1:47 ` Neil Brown
2006-08-04 3:09 ` Paul Clements
0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2006-08-04 1:47 UTC (permalink / raw)
To: Paul Clements; +Cc: linux-raid
On Thursday August 3, paul.clements@steeleye.com wrote:
>
>
> The r1_bio->master_bio may already have had end_io() called and been
> freed by the time we bio_clone() it. This results in an invalid bio
> being sent down to one (or more) of the raid1 components. In my testing,
> the original 4K barrier write ended up being a zero length write when it
> was retried.
Oh bother...
>
> Solution:
> --------
> Can we just bio_alloc instead of cloning? I think we have all the sector
> and offset data in the r1_bio. We could then bio_add_page() the pages
> into the new bio, possibly? Or could we just use the original bio that
> failed, being careful to clean up fields (I think we used to do this but
> Neil changed the behavior because of some problems).
Yes we did. I don't remember exactly what the problem was, but there
are lots of fields in a bio, and the rules about which ones can be
modified by drives are not terribly clear.
Let's see...
The drive can modify:
bi_io_vec->bv_offset end_that_request_first
bi_io_vec->bv_len end_that_request_first
bi_sector raid0/make_request
bi_bdev raid0/make_request
bi_idx end_that_request_first
bi_size
But not (I hope)
bi_io_vec->bv_page
bi_rw
bi_vcnt
bi_max_vecs
bi_end_io
bi_private
And these can easily be reset or zeroed
bi_next
bi_phys_segments
bi_hw_segments
bi_hw_front_size
bi_hw_back_size
bi_flags ???
sector, bdev, size are all remembered in r1_bio.
That leaves bi_idx and an array od len/offset pairs that we need
to preserve.
So I guess the first step is to change alloc_behind_pages to
return a new 'struct bio_vec' array rather than just a list of pages,
and we should keep that array attached to the raid1_bio.
Sound reasonable?
NeilBrown
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG] raid1: barrier retry does not work correctly with write-behind
2006-08-04 1:47 ` Neil Brown
@ 2006-08-04 3:09 ` Paul Clements
2006-08-04 3:35 ` Paul Clements
0 siblings, 1 reply; 4+ messages in thread
From: Paul Clements @ 2006-08-04 3:09 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Neil Brown wrote:
> sector, bdev, size are all remembered in r1_bio.
> That leaves bi_idx and an array od len/offset pairs that we need
> to preserve.
>
> So I guess the first step is to change alloc_behind_pages to
> return a new 'struct bio_vec' array rather than just a list of pages,
> and we should keep that array attached to the raid1_bio.
I think bio_clone gives us that already. I may have missed something but
I think we have everything we need:
When a bio comes into raid1's make_request we bio_clone for each drive
and attach those to r1_bio->bios. We also have behind_pages, which
contains the pages. I think maybe instead of cloning r1_bio->master_bio,
we can just clone r1_bio->bios[i]. Does that make sense?
Let me try that.
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG] raid1: barrier retry does not work correctly with write-behind
2006-08-04 3:09 ` Paul Clements
@ 2006-08-04 3:35 ` Paul Clements
0 siblings, 0 replies; 4+ messages in thread
From: Paul Clements @ 2006-08-04 3:35 UTC (permalink / raw)
To: linux-raid; +Cc: Neil Brown
Paul Clements wrote:
> I think bio_clone gives us that already. I may have missed something but
> I think we have everything we need:
>
> When a bio comes into raid1's make_request we bio_clone for each drive
> and attach those to r1_bio->bios. We also have behind_pages, which
> contains the pages. I think maybe instead of cloning r1_bio->master_bio,
> we can just clone r1_bio->bios[i]. Does that make sense?
Never mind, of course that won't work...it's basically the same as using
the failed bio.
Yes, so we need a copy of the bvec too. Would it be easier, or less
risky just to create an entire "spare" bio, cloned from the master bio
during make_request as the other r1_bio->bios[] are, and then use that
if we need it? Maybe too much overhead, but it should guarantee that
none of the fields have been monkey-ed with by an underlying disk. May
be easier than trying to allocate and construct bvecs by hand?
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-08-04 3:35 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-03 15:44 [BUG] raid1: barrier retry does not work correctly with write-behind Paul Clements
2006-08-04 1:47 ` Neil Brown
2006-08-04 3:09 ` Paul Clements
2006-08-04 3:35 ` Paul Clements
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).