Re: Fast RAID 1 Resync

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Fast RAID 1 Resync
@ 2003-01-03  1:22 Philip Cameron
  2003-01-16 23:22 ` bmoon
  0 siblings, 1 reply; 2+ messages in thread
From: Philip Cameron @ 2003-01-03  1:22 UTC (permalink / raw)
  To: linux-raid, neilb

Hi Neil,

(Sorry if this is a repost. I had an error with returned mail)

Thanks for your comments.

I also don't see a need to synchronize disks at mkraid time. Its nice to have
identical disks but not necessary as long as the result of reading a sector that
has never been written is undefined. An option to do either approach can be done
as long as there is a real need. Adding an option increases complexity
especially during test.

I have been thinking of tracking writes to each chunk with a counter. The
counters would be organized into a vector indexed by chunk number. The counter
is incremented by the number of mirrors including any currently unavailable
mirrors before the write starts. It is decremeted as each write completes. So
when all writes in a chunk are complete the counter returns to zero. If there is
a missing mirror, the counter will not return to zero (since one of the needed
writes was not done). 

When a disk is pulled, the counters increment but don't return to zero. When the
disk is reinserted, the resync needs to copy chunks where the counter doesn't go
to zero. When the resync of the chunk is complete, set the counter to zero.

To deal with a recovery after crash, I am thinking about using your approach.
Use a bit per counter and set the bit when the counter is non-zero. When a bit
goes from 0 to 1, the updated bit vector is written before starting the write to
the chunk. On reboot after a crash, the bit vector from the selected mirror is
used (the current mechanism is used to select the base disk). The counter is
incremented for each chunk that has a bit that is set. After this, the resync in
the above case can be performed. I don't see a need for timestamps beyond what
is currently being done.

The transitions from 1 to 0 are not all that important since the worst case is
syncing a chunk that is already mirrored. Overall performance can be improved by
delaying the 1 to 0 transition for a few seconds. A lazy write of the bits can
be done every 10 seconds or so if there are no 0 to 1 changes during that interval.

I have a third goal: minimize the resync time for a new (replacement) disk. In
this case all of the chunks that have ever been written need to be copied. I
have been thinking about controlling this through a second bit vector. When a
chunk is written for the first time the bit is set and it is never reset. When
the new disk is added, the counters corresponding to all of the bits that have
been set in the vector are incremented and a resync (as above) is performed. As
above the vector update needs to be done before the write to the chunk. The
length of resync is proportional to how much of the disk has been used.

A hardware note: the system has two IO assemblies each of which contains a PCI
bus, SCSI HBA and 3 hot plugable SCSI disk slots. We are using 72GB disks. Sets
of 2 mirror RAID1 raid sets is the most practical configuration.

Phil Cameron

>> 
>> Hi,
>  
>
>>  You have two quite different, though admittedly similar, goal here.
>>   1/ quick resync when a recently removed drive is re-added.
>>   2/ quick resync after an unclean shutdown.
>> 
>>  I would treat these quite separately.
>> 
>>  For the latter I would have a bitmap which was written to disk
>>  whenever a bit was set and eventually after a bit was cleared.
>>  I would use a 16 bit counter for each 'chunk'.
>>  If the high bit is clear, it stores the number of outstanding writes
>>  on that chunk.
>>  If the high bit is set, it stores some sort of time stamp of when the
>>  number of outstanding writes hit zero.
>>  Every time you write the bitmap, you increment this timestamp.
>>  So when you schedule a write, you only need to write out the bitmap
>>  first if the 16bit number of this chunk has the highbit set and has a
>>  timestamp which is different to the current one - which means that
>>  the bitmap has been written out with a zero in this slot.
>>  So:
>>    On write, if highbit clear, increment counter
>>              if highbit set and timestamp matches, set counter to 1
>> 		 and set bit in bitmap
>> 	     if highbit set and timestamp doesn't match, set
>> 		 bit in bitmap, schedule write, set counter to 1
>>    On write complete,
>> 	decrement counter.  If it hits zero, set to timestamp with
>>         high bit set, clear the bitmap bit, and schudle a bitmap
>> 	writout a few seconds hence.	
>> 
>>  For the former I would just hold a separate bitmap, one bit per
>>  chunk.
>>  While all drives are working, this bitmap would be all zeros.
>>  Whenever a write fails to write to all drives, the relevant bit gets
>>  set.
>>  When a recently failed drive comes back online, we resync all chunks
>>  that have that bit set.
>> 
>>  I don't see a particular need to sync the drives are device creation
>>  time, but I would like to keep the option of doing so.  I don't
>>  really care which behaviour is the default.
>> 
>> NeilBrown
>  
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Fast RAID 1 Resync
  2003-01-03  1:22 Fast RAID 1 Resync Philip Cameron
@ 2003-01-16 23:22 ` bmoon
  0 siblings, 0 replies; 2+ messages in thread
From: bmoon @ 2003-01-16 23:22 UTC (permalink / raw)
  To: neilb; +Cc: Philip Cameron, linux-raid

Neil,

It is very interesting but I have a couple of questions on your suggestion
as fellow;

1) Is a bit on the bitmap match with each chunk?
2) it has a counter( or timestamp) for each chunk on each mirror, right?
    then how do we find the timestamp matched or not?  what do you compare
it with?
3) I think "a bit on the bitmap has the same meaning as the MSB(high bit) on
the counter".
    That means you need only a counter for each chunk? Am I wrong?


 Bo

----- Original Message -----
From: "Philip Cameron" <pecameron@attbi.com>
To: <linux-raid@vger.kernel.org>; <neilb@cse.unsw.edu.au>
Sent: Thursday, January 02, 2003 5:22 PM
Subject: Re: Fast RAID 1 Resync


> Hi Neil,
>
> (Sorry if this is a repost. I had an error with returned mail)
>
> Thanks for your comments.
>
> I also don't see a need to synchronize disks at mkraid time. Its nice to
have
> identical disks but not necessary as long as the result of reading a
sector that
> has never been written is undefined. An option to do either approach can
be done
> as long as there is a real need. Adding an option increases complexity
> especially during test.
>
> I have been thinking of tracking writes to each chunk with a counter. The
> counters would be organized into a vector indexed by chunk number. The
counter
> is incremented by the number of mirrors including any currently
unavailable
> mirrors before the write starts. It is decremeted as each write completes.
So
> when all writes in a chunk are complete the counter returns to zero. If
there is
> a missing mirror, the counter will not return to zero (since one of the
needed
> writes was not done).
>
> When a disk is pulled, the counters increment but don't return to zero.
When the
> disk is reinserted, the resync needs to copy chunks where the counter
doesn't go
> to zero. When the resync of the chunk is complete, set the counter to
zero.
>
> To deal with a recovery after crash, I am thinking about using your
approach.
> Use a bit per counter and set the bit when the counter is non-zero. When a
bit
> goes from 0 to 1, the updated bit vector is written before starting the
write to
> the chunk. On reboot after a crash, the bit vector from the selected
mirror is
> used (the current mechanism is used to select the base disk). The counter
is
> incremented for each chunk that has a bit that is set. After this, the
resync in
> the above case can be performed. I don't see a need for timestamps beyond
what
> is currently being done.
>
> The transitions from 1 to 0 are not all that important since the worst
case is
> syncing a chunk that is already mirrored. Overall performance can be
improved by
> delaying the 1 to 0 transition for a few seconds. A lazy write of the bits
can
> be done every 10 seconds or so if there are no 0 to 1 changes during that
interval.
>
> I have a third goal: minimize the resync time for a new (replacement)
disk. In
> this case all of the chunks that have ever been written need to be copied.
I
> have been thinking about controlling this through a second bit vector.
When a
> chunk is written for the first time the bit is set and it is never reset.
When
> the new disk is added, the counters corresponding to all of the bits that
have
> been set in the vector are incremented and a resync (as above) is
performed. As
> above the vector update needs to be done before the write to the chunk.
The
> length of resync is proportional to how much of the disk has been used.
>
> A hardware note: the system has two IO assemblies each of which contains a
PCI
> bus, SCSI HBA and 3 hot plugable SCSI disk slots. We are using 72GB disks.
Sets
> of 2 mirror RAID1 raid sets is the most practical configuration.
>
> Phil Cameron
>
> >>
> >> Hi,
> >
> >
> >>  You have two quite different, though admittedly similar, goal here.
> >>   1/ quick resync when a recently removed drive is re-added.
> >>   2/ quick resync after an unclean shutdown.
> >>
> >>  I would treat these quite separately.
> >>
> >>  For the latter I would have a bitmap which was written to disk
> >>  whenever a bit was set and eventually after a bit was cleared.
> >>  I would use a 16 bit counter for each 'chunk'.
> >>  If the high bit is clear, it stores the number of outstanding writes
> >>  on that chunk.
> >>  If the high bit is set, it stores some sort of time stamp of when the
> >>  number of outstanding writes hit zero.
> >>  Every time you write the bitmap, you increment this timestamp.
> >>  So when you schedule a write, you only need to write out the bitmap
> >>  first if the 16bit number of this chunk has the highbit set and has a
> >>  timestamp which is different to the current one - which means that
> >>  the bitmap has been written out with a zero in this slot.
> >>  So:
> >>    On write, if highbit clear, increment counter
> >>              if highbit set and timestamp matches, set counter to 1
> >> and set bit in bitmap
> >>      if highbit set and timestamp doesn't match, set
> >> bit in bitmap, schedule write, set counter to 1
> >>    On write complete,
> >> decrement counter.  If it hits zero, set to timestamp with
> >>         high bit set, clear the bitmap bit, and schudle a bitmap
> >> writout a few seconds hence.
> >>
> >>  For the former I would just hold a separate bitmap, one bit per
> >>  chunk.
> >>  While all drives are working, this bitmap would be all zeros.
> >>  Whenever a write fails to write to all drives, the relevant bit gets
> >>  set.
> >>  When a recently failed drive comes back online, we resync all chunks
> >>  that have that bit set.
> >>
> >>  I don't see a particular need to sync the drives are device creation
> >>  time, but I would like to keep the option of doing so.  I don't
> >>  really care which behaviour is the default.
> >>
> >> NeilBrown
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-01-16 23:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-03  1:22 Fast RAID 1 Resync Philip Cameron
2003-01-16 23:22 ` bmoon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).