From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [BUG,PATCH] raid1 behind write ordering (barrier) protection Date: Tue, 3 Dec 2013 10:08:13 +1100 Message-ID: <20131203100813.67814984@notabene.brown> References: <528E72C8.7050909@linux.vnet.ibm.com> <529CBFBD.9070009@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/tKc.1VA=Fj2A/eKEX_ui3+E"; protocol="application/pgp-signature" Return-path: In-Reply-To: <529CBFBD.9070009@linux.vnet.ibm.com> Sender: linux-raid-owner@vger.kernel.org To: Brett Russ Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --Sig_/tKc.1VA=Fj2A/eKEX_ui3+E Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Mon, 02 Dec 2013 12:13:33 -0500 Brett Russ wrote: > Let's first agree there's a bug here needing fixing. Sorry for not responding earlier. Yes it does look like a bug, so it needs fixing. >=20 > Then, we can discuss how to achieve a modern fix without the use of barri= ers as=20 > originally intended in the patch below. I'm not thrilled with the idea of keeping a list of outstanding requests. Maybe we'll have to go that way but I'd like to explore other options first. How about just keeping a record of whether there is a BIO_FLUSH request outstanding on each "behind" leg. While there is we don't submit new requests. So we have a queue of bios for each leg which are waiting for a BIO_FLUSH to complete, and we send them on down as soon as it does. Would that suit, do you think? Thanks, NeilBrown >=20 > thanks, > BR >=20 > On 11/21/2013 03:53 PM, Brett Russ wrote: > > Hi, > > > > We are in process of porting some mods we made against an older, barrier > > equipped, kernel to a newer kernel without barrier support. Our enviro= nment is > > unique in that we use both disk write cache and MD raid1 behind writes.= We see > > that the behind write feature introduces an I/O ordering bug at the MD = level; > > the bug is defined in greater detail in the patch header below. > > > > The patch header and content are below, as ported to a RHEL 6.4 based k= ernel > > which has moved to flush/FUA. However, bios that used to be marked > > BIO_RW_BARRIER are now, ineffectively for our purposes, marked > > BIO_FLUSH|BIO_FUA. Pretend we wanted a barrier here, as that's what we= need. > > > > Looking for thoughts from the community on the best way(s) to solve this > > ordering issue. One proposal we're considering is forcing overlapping = writes to > > block until the first behind write completes, as long as they're in pro= cess > > context. > > > > Appreciate the look, > > BR > > > > ---- > > > > This patch ensures that barriers will be synchronous on the write behin= d legs, > > i.e. they will not be allowed to go behind. > > > > This patch adds a new capability to the md raid1 behind write feature w= hich > > allows us to ensure ordering when reads and/or writes go to a member de= vice > > which allows behind writes. The problem we're trying to solve here is: > > > > 1. write1 to sector X comes to MD > > 1a. write1 sent to primary leg > > 1b. write1 sent to secondary leg which is write behind > > 2. write1 completes on primary leg, upper layer is ack'd > > 3. write2 to sector X comes to MD > > 3a. write2 sent to primary leg > > 3b. write2 sent to secondary leg which is write behind > > 4. secondary leg now has two overlapping writes to it, layers beneath h= ave > > freedom to reorder them, big problems ensue. > > > > Solution is for MD to maintain a list of outstanding behind writes to a= llow us > > to check all incoming I/O going to the behind member(s). I/O is handle= d as > > follows: > > > > Reads > > > > Reads to the behind leg used to block if any behind writes were outstan= ding. > > Now they only block if they overlap an outstanding behind write. > > > > Writes > > > > Writes to the behind leg check if they overlap any outstanding behind w= rites. > > If they do, they are marked as a barrier on the behind leg only so orde= ring is > > ensured. If they don't, they are issued as normal. In either case, ho= wever, > > they are added to a list of outstanding behind writes. > > > > Any writes that overlap with in-progress writes on the behind list, they > > need to be marked as a barrier to prevent reordering. However it is an > > overkill to send a barrier to both (primary and write-mostly) legs of t= he > > array. > > > > For raid1, the upper layer wouldn't send down an overlapping write unle= ss at > > least one non write-mostly leg has ack'd. So there is no reason to mark= that leg > > as a barrier. > > > > > > Signed-Off-By: Aniket Kulkarni > > Signed-off-by: Brett Russ > > Signed-off-by: Jay Wentworth > > > > Index: b/drivers/md/raid1.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- a/drivers/md/raid1.c > > +++ b/drivers/md/raid1.c > > @@ -368,10 +368,15 @@ static void close_write(struct r1bio *r1 > > { > > /* it really is the end of this request */ > > if (test_bit(R1BIO_BehindIO, &r1_bio->state)) { > > + unsigned long flags; > > + struct r1conf *conf =3D r1_bio->mddev->private; > > /* free extra copy of the data pages */ > > int i =3D r1_bio->behind_page_count; > > while (i--) > > safe_put_page(r1_bio->behind_bvecs[i].bv_page); > > + spin_lock_irqsave(&conf->device_lock, flags); > > + list_del(&r1_bio->behind_list); > > + spin_unlock_irqrestore(&conf->device_lock, flags); > > kfree(r1_bio->behind_bvecs); > > r1_bio->behind_bvecs =3D NULL; > > } > > @@ -958,6 +963,36 @@ do_sync_io: > > pr_debug("%dB behind alloc failed, doing sync I/O\n", bio->bi_siz= e); > > } > > > > + > > +/* This function allows us to determine if an incoming request overlap= s the > > + * sector range of any outstanding behind writes. Knowing this will a= llow us > > + * to prevent writes and/or reads from potential reordering below MD. > > + */ > > +static int test_overlap_and_enq_behind_writes(struct r1bio *newreq, in= t enq) > > +{ > > + unsigned long flags; > > + struct r1conf *conf; > > + struct r1bio *listreq; > > + sector_t newreq_end; > > + int olap =3D 0; > > + > > + conf =3D newreq->mddev->private; > > + newreq_end =3D newreq->sector + newreq->sectors; > > + > > + spin_lock_irqsave(&conf->device_lock, flags); > > + list_for_each_entry(listreq, &conf->behind_list, behind_list) { > > + if ((newreq_end > listreq->sector) && > > + (newreq->sector < listreq->sector + listreq->sectors)) { > > + olap =3D 1; > > + break; > > + } > > + } > > + if (enq) > > + list_add(&newreq->behind_list, &conf->behind_list); > > + spin_unlock_irqrestore(&conf->device_lock, flags); > > + return olap; > > +} > > + > > static int make_request(struct mddev *mddev, struct bio * bio) > > { > > struct r1conf *conf =3D mddev->private; > > @@ -969,11 +1004,11 @@ static int make_request(struct mddev *md > > unsigned long flags; > > const int rw =3D bio_data_dir(bio); > > const bool do_sync =3D bio_rw_flagged(bio, BIO_RW_SYNCIO); > > - const unsigned long do_flush_fua =3D (bio->bi_rw & (BIO_FLUSH | BI= O_FUA)); > > struct md_rdev *blocked_rdev; > > int first_clone; > > int sectors_handled; > > int max_sectors; > > + unsigned int do_flush_fua, do_behind_flush; > > > > /* > > * Register the new request and wait if the reconstruction > > @@ -1047,7 +1082,7 @@ read_again: > > mirror =3D conf->mirrors + rdisk; > > > > if (test_bit(WriteMostly, &mirror->rdev->flags) && > > - bitmap) { > > + bitmap && test_overlap_and_enq_behind_writes(r1_bio, 0)) { > > /* Reading from a write-mostly device must > > * take care not to over-take any writes > > * that are 'behind' > > @@ -1216,6 +1251,35 @@ read_again: > > } > > sectors_handled =3D r1_bio->sector + max_sectors - bio->bi_sector; > > > > + do_flush_fua =3D 0; > > + do_behind_flush =3D 0; > > + > > + /* If this is a flush/fua request don't > > + * ever let it go "behind". Keep all the > > + * mirrors in sync. > > + */ > > + if (bio_rw_flagged(bio, BIO_FLUSH | BIO_FUA)) { > > + set_bit(R1BIO_BehindIO, &r1_bio->state); > > + do_flush_fua =3D bio->bi_rw & (BIO_FLUSH | BIO_FUA); > > + } > > + else if (bitmap && mddev->bitmap_info.max_write_behind) { > > + if (atomic_read(&bitmap->behind_writes) > > + < mddev->bitmap_info.max_write_behind && > > + !waitqueue_active(&bitmap->behind_wait)) > > + set_bit(R1BIO_BehindIO, &r1_bio->state); > > + /* If this write is going behind, enqueue it on a > > + * behind_list. Also check if this write overlaps any existing > > + * behind writes. If it does, mark it as a flush(barrier) on = the > > + * behind leg(s) only so ordering is guaranteed. Any writes > > + * *beyond* the max_write_behind limit won't go behind and > > + * therefore can't result in an overlap and thus aren't > > + * enqueued in the behind_list. > > + */ > > + if (test_overlap_and_enq_behind_writes( > > + r1_bio, test_bit(R1BIO_BehindIO, &r1_bio->state))) > > + do_behind_flush =3D BIO_FLUSH; > > + } > > + > > atomic_set(&r1_bio->remaining, 1); > > atomic_set(&r1_bio->behind_remaining, 0); > > > > @@ -1261,7 +1325,6 @@ read_again: > > if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags)) > > atomic_inc(&r1_bio->behind_remaining); > > } > > - > > r1_bio->bios[i] =3D mbio; > > > > mbio->bi_sector =3D (r1_bio->sector + > > @@ -1271,6 +1334,15 @@ read_again: > > mbio->bi_rw =3D WRITE | do_flush_fua | (do_sync << BIO_RW_SYN= CIO); > > mbio->bi_private =3D r1_bio; > > > > + if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags)) > > + /* any write that goes to write mostly leg AND overlaps wi= th > > + * an outstanding write behind needs to be a barrier > > + * however the non-write-mostly leg doesn't need a barrier > > + * because we wouldn't have an overlapping write unless t= he > > + * non-write-mostly io has been ack'd > > + */ > > + mbio->bi_rw |=3D do_behind_flush; > > + > > atomic_inc(&r1_bio->remaining); > > spin_lock_irqsave(&conf->device_lock, flags); > > bio_list_add(&conf->pending_bio_list, mbio); > > @@ -2843,6 +2915,7 @@ static struct r1conf *setup_conf(struct > > conf->raid_disks =3D mddev->raid_disks; > > conf->mddev =3D mddev; > > INIT_LIST_HEAD(&conf->retry_list); > > + INIT_LIST_HEAD(&conf->behind_list); > > > > spin_lock_init(&conf->resync_lock); > > init_waitqueue_head(&conf->wait_barrier); > > Index: b/drivers/md/raid1.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- a/drivers/md/raid1.h > > +++ b/drivers/md/raid1.h > > @@ -51,6 +51,7 @@ struct r1conf { > > > > /* queue pending writes to be submitted on unplug */ > > struct bio_list pending_bio_list; > > + struct list_head behind_list; /* oustanding behind writes */ > > int pending_count; > > > > /* for use when syncing mirrors: > > @@ -126,6 +127,7 @@ struct r1bio { > > int read_disk; > > > > struct list_head retry_list; > > + struct list_head behind_list; > > /* Next two are only valid when R1BIO_BehindIO is set */ > > struct bio_vec *behind_bvecs; > > int behind_page_count; >=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --Sig_/tKc.1VA=Fj2A/eKEX_ui3+E Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBUp0S3jnsnt1WYoG5AQL1nA/+O5Kh63a+DhvdnCPTym5zjWVH9Y2FcYcA +FAmCYS5fOz52AVaunWRU0W0rlE+W+fcPMz2kiGPjWO5a+cn5l9wJqnuFLXJQVJZ gMfwq+cJBTXXiH6ifHAwJT8a+iq1cNxlEktt0kHXQFi380jtfMLXH5naANuTEFit qKYNzkaYxjIt4x2k84uAd4r55hFCb1cHBSxcwjfX+lelZBYqE2tTgd+u+9Gseg4N OLlvX5G8qC+da8Fq56GQv87WvieTs3XfIgjkJ/JDzc30aHyCjmUeQlVzg2d36UDL 31SEMvBgUunw49nVbTR3LNi4n25r0Llsy3mRwoVgxBeGC+EFptfDo9EIDTEtmMc2 8iiQBsW+YQkO7DkHUX3lhvLtb9DwoCxhCEYdpweenim3ZNzEkx4aOR0YydAXh/IR W7BcG0wXMGToUHypKZhhmIQJ9gghRiI8hCYM2EfxeFPHxXrD8kNEdeUWv67Jg5Zf WvMuaJ8hkrgL+NSr6kLpsOB8orAQ6ldqjIE+SceOeULBg3Ccro4xmWkG5BVfokoM udwc5sBmEXGstHzxbIMYRAR7dUz336DtsJUEAWB/wxGkOX/0U1kOy+zCCYZeolyi KL1lMiAvaKR/jvoPkSRcc1XnztMgwgxveBeEyjJIX0WwKJ5m5Cdx3oX/As59kAOs phAA3m3imTo= =tSQB -----END PGP SIGNATURE----- --Sig_/tKc.1VA=Fj2A/eKEX_ui3+E--