Re: async writes in raid1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: async writes in raid1
       [not found] <3E9B00F6.6C46EBBC@SteelEye.com>
@ 2003-04-14 19:24 ` Peter T. Breuer
  2003-04-15 18:14   ` Paul Clements
  0 siblings, 1 reply; 7+ messages in thread
From: Peter T. Breuer @ 2003-04-14 19:24 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

"A month of sundays ago Paul Clements wrote:"
> coding (the design is complete) on some stuff for md/raid1. I'm hoping
> to use your code as a starting point. On top of that, I'll be adding:
> 
> * disk storage for the bitmap

Sounds good.

Coincidently I just had to add an ondisk hashmap for ENBD (another project)
in order to cache request IDs for transport protocols that don't handle
repeats and retries on their own. I used a zone allocator to dispense
items of memory.

You have to be careful never to leave the map in an incoherent state
(even if somebody pulls the plug). That's much easier with a bitmap!

> * changing bits in the bitmaps to 16-bit counters (so that a bit won't
> be cleared prematurely when there are other outstanding I/Os to a block,
> which could result in data loss on the backup, if a failure coincides)

The current bitmap implementation already has 16-bit counters for each 
"page" (4096 bytes, i.e. mapping 32K ondisk blocks), to which it falls
back to in case it can't get the bitmap page itself. Actually, it
has one 16bit counter for every 2K ondisk blocks (256B of bitmap)

That was the most extra capital cost I was willing to invest at startup.
It's about one one-hundredth of the maximum size of the bitmap,
invested in counters that are "just in case".

So the bitmap can always be relied on to be accurate in lumps of 2MB or
so, even when there is no memory.

> * allowing the bitmap to be rescaled -- i.e., changing the amount of
> data that each bit represents

I really don't agree with that :-).  But maybe it can be done!  It's
just I personally see no way anyone can count in anything except 1K
blocks with the current raid code the way it is.

Originally I wrote the patch with more generality, but realized that the
generality could never be made use of - there are too many assumptions
in the extant raid code that one block = 1KB.

> * modifying the code so that it can be more easily leveraged by the
> other raid drivers, as Neil has indicated that that would be better

It is my intention also - I don't know of anything that stops it being
extended. I'll have a look at raid5 if you like.

> Some comments below on your newest additions...

Ta.

> >  1) in make_request:
> >     duplicating the bh->b_data from the original request into a local
> >     buffer obtained via kmalloc (if kmalloc fails, don't set the flag
> >     in the rh1_bh for async writes). 
> 
> You might want a pool of these...otherwise, under memory pressure, the
> whole thing degrades to synchronous mode...which may be OK for some
> applications, but probably not all...

Yes. My intention is only to do testing at the moment. I don't see
anything wrong with degrading to sync mode, since I actually don't
believe there is any advantage to async writes in normal use :-).

> >  2) in the mirrored bh's end_request:
> >     if we're the first mbh and async flag is set, ack the original bh.
> 
> Actually, I think you want to make sure, in the async case, that the
> write to the _primary_ disk completes, as opposed to just the first
> write...I think you could get the I/O completing out of order, right?

Hmm .. that's a point. It's not a question of out of order, so much as
we want to give the maximum chance of getting one successful write and
reporting success. This way we may be unlucky and hit a device that has
just failed, and report failure to the user, when we should have gone on
to try another device and report success back when we write to it.
It's just a case of adding "&& uptodate" to the below conditional
in raid1_end_bh_io (the end_req for a mirrored i/o)...

        /* if nobody has done the final end_io yet, do it now */
        if (!test_and_set_bit(R1BH_AsyncPhase, &r1_bh->state)) {

                PRINTK(KERN_DEBUG "raid1: sync end i/o on sectors %lu-%lu\n",
                        bh->b_rsector, bh->b_rsector + (bh->b_size >> 9) - 1);

                io_request_done(bh->b_rsector, conf,
                        test_bit(R1BH_SyncPhase, &r1_bh->state));
                bh->b_end_io(bh, uptodate);
        }

Should I add "&& uptodate" to the condition? I'm inclined to!

I don't see that there's any particular attempt to enforce write order
in raid.  Correct me if I'm wrong!  I think that requests are submitted
asynchronously to the various devices, and we just wait for the i/o to
complete (in sync mode, for all writes to complete). That can in
principle lead to arbitrary reordering all by itself.

Adding something to preserve write order might be sensible.

> > Doubts: do I have to point mbh->b_page at something new too? There are
> > occasional tests scattered in the code (on read, as far as I can see)
> > to make sure it corresponds correctly to the b_data field.
> 
> I don't believe this is necessary, since you already have logic to make
> sure that the page gets freed after the last I/O is complete.

OK - it's not my logic, I believe, but rather whatever naturally
happens in the existing raid and generic block device code. This
needs study, but the memory management in the raid code is not
completely transparant.

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: async writes in raid1
  2003-04-14 19:24 ` async writes in raid1 Peter T. Breuer
@ 2003-04-15 18:14   ` Paul Clements
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Clements @ 2003-04-15 18:14 UTC (permalink / raw)
  To: ptb; +Cc: linux-raid

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago Paul Clements wrote:"

> > * allowing the bitmap to be rescaled -- i.e., changing the amount of
> > data that each bit represents
> 
> I really don't agree with that :-).  But maybe it can be done!  It's
> just I personally see no way anyone can count in anything except 1K
> blocks with the current raid code the way it is.
> 
> Originally I wrote the patch with more generality, but realized that the
> generality could never be made use of - there are too many assumptions
> in the extant raid code that one block = 1KB.

The translation from 1K blocks to whatever-sized bitmap blocks (or
"chunks") will be done only when marking or unmarking the bitmap (and
incrementing/decrementing the block counters). The block counters allow
us to know when multiple 1K block writes are actually on the same
"chunk" so that everything is handled correctly.

--
Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: async writes in raid1
@ 2003-04-14 21:27 Chuck Ebbert
  2003-04-15 13:50 ` Paul Clements
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2003-04-14 21:27 UTC (permalink / raw)
  To: ptb@it.uc3m.es; +Cc: linux-raid

ptb wrote:


> I've done a few million async writes and haven't died yet. Of course I
> don't know if this procedure works always.


 What happens if the system crashes before all the writes complete?

 

--
 Chuck

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: async writes in raid1
  2003-04-14 21:27 Chuck Ebbert
@ 2003-04-15 13:50 ` Paul Clements
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Clements @ 2003-04-15 13:50 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: ptb@it.uc3m.es, linux-raid

Chuck Ebbert wrote:
> 
> ptb wrote:
> 
> > I've done a few million async writes and haven't died yet. Of course I
> > don't know if this procedure works always.
> 
>  What happens if the system crashes before all the writes complete?

It depends which system crashes...

Peter's current code does not handle the primary server crashing at all,
since there is no disk backing of the bitmap. If the network fails or
the secondary fails, however, this can be recovered from without data
loss.

--
Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: async writes in raid1
@ 2003-04-14 19:32 Peter T. Breuer
  2003-04-15 18:08 ` Paul Clements
  0 siblings, 1 reply; 7+ messages in thread
From: Peter T. Breuer @ 2003-04-14 19:32 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, linux-raid

"A month of sundays ago ptb wrote:"
> > Actually, I think you want to make sure, in the async case, that the
> > write to the _primary_ disk completes, as opposed to just the first
> > write...I think you could get the I/O completing out of order, right?
> 
> Hmm .. that's a point. It's not a question of out of order, so much as
> we want to give the maximum chance of getting one successful write and

Ooops .. I looked up the wrong bit of code in my reply. I looked at the
final end_req, instead of the mirrored i/o end_req. Yes, don't worry,
it's already done right. The ack happens early only if there is success
to report ("uptodate" is set)..


        /*
         * WRITE:
         *
         * Let's see if all mirrored write operations have finished 
         * already.
         *
         * In any case, do the end io early on the master bh if we are
         * uptodate, and AsyncIO is set on the bh. We set AsyncPhase
         * when this happens, so we don't do it twice, inadvertently.
         */

        if (uptodate
        &&  test_bit(R1BH_AsyncIO, &r1_bh->state)
        && !test_and_set_bit(R1BH_AsyncPhase, &r1_bh->state)) {

                struct buffer_head *mbh = r1_bh->master_bh;

                raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);

                PRINTK(KERN_DEBUG "raid1: async end i/o on sectors %lu-%lu\n",
                        mbh->b_rsector, mbh->b_rsector + (mbh->b_size >> 9) - 1);

                io_request_done(mbh->b_rsector, conf,
                        test_bit(R1BH_SyncPhase, &r1_bh->state));
                mbh->b_end_io(mbh, uptodate);
        }

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: async writes in raid1
  2003-04-14 19:32 Peter T. Breuer
@ 2003-04-15 18:08 ` Paul Clements
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Clements @ 2003-04-15 18:08 UTC (permalink / raw)
  To: ptb; +Cc: linux-raid

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago ptb wrote:"
> > > Actually, I think you want to make sure, in the async case, that the
> > > write to the _primary_ disk completes, as opposed to just the first
> > > write...I think you could get the I/O completing out of order, right?
> >
> > Hmm .. that's a point. It's not a question of out of order, so much as
> > we want to give the maximum chance of getting one successful write and
> 
> Ooops .. I looked up the wrong bit of code in my reply. I looked at the
> final end_req, instead of the mirrored i/o end_req. Yes, don't worry,
> it's already done right. The ack happens early only if there is success
> to report ("uptodate" is set)..

Right, but what if the data is successfully written to the secondary and
then ACKed back to userland and then the primary server fails before the
data gets written locally (I know this is unlikely, but possible,
nonetheless)? The primary will now have out-of-date data that userland
thinks is committed...

--
Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* async writes in raid1
@ 2003-04-14 14:58 Peter T. Breuer
  0 siblings, 0 replies; 7+ messages in thread
From: Peter T. Breuer @ 2003-04-14 14:58 UTC (permalink / raw)
  To: Paul.Clements; +Cc: neilb, linux-raid

I simply went ahead and implemented async writes by

 1) in make_request:
    duplicating the bh->b_data from the original request into a local
    buffer obtained via kmalloc (if kmalloc fails, don't set the flag
    in the rh1_bh for async writes). Then point all the mirrored bh's
    mbh->b_data at this.

 2) in the mirrored bh's end_request:
    if we're the first mbh and async flag is set, ack the original bh.
    if we're the last  mbh and async flag is set, kfree the mbh->b_data
    that we kmalloc'ed earlier.

I've done a few million async writes and haven't died yet. Of course I
don't know if this procedure works always.

Doubts: do I have to point mbh->b_page at something new too? There are
occasional tests scattered in the code (on read, as far as I can see)
to make sure it corresponds correctly to the b_data field.

This is all in the fr1 driver.

  ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.14.tgz

fr1.o needs to be loaded with async=1 to activate async writes.

There's still a couple of debugging printks, but you can kill them
yourself.

  raid1: async end i/o on sectors 3956-3957
  raid1: async end i/o on sectors 3958-3959
  raid1: async end i/o on sectors 3960-3961
  raid1: async end i/o on sectors 3962-3963
  raid1: async end i/o on sectors 3964-3965
  raid1: async end i/o on sectors 3966-3967
  bitmap: free page enters with page count 6
  bitmap: free page exits with page count 7

Async writes "can work" because of the bitmap support architecture in
fr1.  The bitmap is marked before the async write and cleared when the
last mirrored bh completes.

It all degrades nicely in out of memory conditions.

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-04-15 18:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <3E9B00F6.6C46EBBC@SteelEye.com>
2003-04-14 19:24 ` async writes in raid1 Peter T. Breuer
2003-04-15 18:14   ` Paul Clements
2003-04-14 21:27 Chuck Ebbert
2003-04-15 13:50 ` Paul Clements
  -- strict thread matches above, loose matches on Subject: below --
2003-04-14 19:32 Peter T. Breuer
2003-04-15 18:08 ` Paul Clements
2003-04-14 14:58 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).