Any benefity to write intent bitmaps on Raid1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Any benefity to write intent bitmaps on Raid1
@ 2009-04-09  0:24 Steven Ellis
  2009-04-09  1:30 ` Bryan Mesich
  2009-04-09  5:59 ` Neil Brown
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Ellis @ 2009-04-09  0:24 UTC (permalink / raw)
  To: Linux RAID

Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery
sync time. Would an internal bitmap help dramatically, and are there any
other benefits.

Steve

--------------------------------------------
Steven Ellis - Technical Director
OpenMedia Limited - The Home of myPVR
email   - steven@openmedia.co.nz
website - http://www.openmedia.co.nz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09  0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis
@ 2009-04-09  1:30 ` Bryan Mesich
  2009-04-09  5:59 ` Neil Brown
  1 sibling, 0 replies; 11+ messages in thread
From: Bryan Mesich @ 2009-04-09  1:30 UTC (permalink / raw)
  To: Steven Ellis; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]

On Thu, Apr 09, 2009 at 12:24:05PM +1200, Steven Ellis wrote:
> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery
> sync time. Would an internal bitmap help dramatically, and are there any
> other benefits.
>
If one of your drives goes pear shaped and needs to be replaced,
then no, a write intent bitmap will not help you.  When you
replace the drive, the incomming drive will need to do a full
resync.

Many times a read/write error causes the drive to be failed. In
this case, the bad block that caused the read/write error should
get re-mapped by the HD firmware. If you have a write intent
bitmap enabled, the re-add only resyncs the out-of-sync pages
(not sure if page is the correct terminology).

There is some overhead when using a write intent bitmap as the
bitmap needs to be updated as data is written to the device.  For
most people, the overhead is not noticeable.  If you really need
performance, the bitmap can be moved to another disk(s) that is
not a member of the RAID1 array in question.

I've used write intent bitmaps many times in a SAN environment 
in which FC initiators mirror 2 block devices.  Both block devices 
come from different FC targets. This makes maintiance much easier
since all we have to do is break the RAID1 mirror on the
initiator (we also have good uptime :).  A write intent bitmap 
speeds the re-syncing process up since we only resync the 
out-of-sync data.

Bryan

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09  0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis
  2009-04-09  1:30 ` Bryan Mesich
@ 2009-04-09  5:59 ` Neil Brown
  2009-04-09  6:26   ` Goswin von Brederlow
  2009-04-09 22:51   ` Bill Davidsen
  1 sibling, 2 replies; 11+ messages in thread
From: Neil Brown @ 2009-04-09  5:59 UTC (permalink / raw)
  To: Steven Ellis; +Cc: Linux RAID

On Thursday April 9, steven@openmedia.co.nz wrote:
> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery
> sync time. Would an internal bitmap help dramatically, and are there any
> other benefits.

Bryan answered some of this but...

 - if your machine crashes, then resync will be much faster if you
   have a bitmap.
 - If one drive becomes disconnected, and then can be reconnected,
   recovery will be much faster.
 - if one drive fails and has to be replaced, a bitmap makes no
   difference(*).
 - there might be performance hit - it is very dependant on your
   workload.
 - You can add or remove a bitmap at any time, so you can try to
   measure the impact on your particular workload fairly easily.

(*) I've been wondering about adding another bitmap which would record
which sections of the array have valid data.  Initially nothing would
be valid and so wouldn't need recovery.  Every time we write to a new
section we add that section to the 'valid' sections and make sure that
section is in-sync.
When a device was replaced, we would only need to recover the parts of
the array that are known to be invalid.
As filesystem start using the new "invalidate" command for block
devices, we could clear bits for sections that the filesystem says are
not needed any more...
But currently it is just a vague idea.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09  5:59 ` Neil Brown
@ 2009-04-09  6:26   ` Goswin von Brederlow
  2009-04-10  9:04     ` Neil Brown
  2009-04-09 22:51   ` Bill Davidsen
  1 sibling, 1 reply; 11+ messages in thread
From: Goswin von Brederlow @ 2009-04-09  6:26 UTC (permalink / raw)
  To: Neil Brown; +Cc: Steven Ellis, Linux RAID

Neil Brown <neilb@suse.de> writes:

> (*) I've been wondering about adding another bitmap which would record
> which sections of the array have valid data.  Initially nothing would
> be valid and so wouldn't need recovery.  Every time we write to a new
> section we add that section to the 'valid' sections and make sure that
> section is in-sync.
> When a device was replaced, we would only need to recover the parts of
> the array that are known to be invalid.
> As filesystem start using the new "invalidate" command for block
> devices, we could clear bits for sections that the filesystem says are
> not needed any more...
> But currently it is just a vague idea.
>
> NeilBrown

If you are up for experimenting I would go for a completly new
approach. Instead of working with physical blocks and marking where
blocks are used and out of sync how about adding a mapping layer on
the device and using virtual blocks. You reduce the reported disk size
by maybe 1% to always have some spare blocks and initialy all blocks
will be unmapped (unused). Then whenever there is a write you pick out
an unused block, write to it and change the in memory mapping of the
logical to physical block. Every X seconds, on a barrier or an sync
you commit the mapping from memory to disk in such a way that it is
synchronized between all disks in the raid. So every commited mapping
represents a valid raid set. After the commit of the mapping all
blocks changed between the mapping and the last can be marked as free
again. Better use the second last so there are always 2 valid mappings
to choose from after a crash.

This would obviously need a lot more space than a bitmap but space is
(relatively) cheap. One benefit imho should be that sync/barrier would
not have to stop all activity on the raid to wait for the sync/barrier
to finish. It just has to finalize the mapping for the commit and then
can start a new in memory mapping while the finalized one writes to
disk.

Just some thoughts,
        Goswin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09  5:59 ` Neil Brown
  2009-04-09  6:26   ` Goswin von Brederlow
@ 2009-04-09 22:51   ` Bill Davidsen
  2009-04-10  9:10     ` Neil Brown
  1 sibling, 1 reply; 11+ messages in thread
From: Bill Davidsen @ 2009-04-09 22:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: Steven Ellis, Linux RAID

Neil Brown wrote:
> On Thursday April 9, steven@openmedia.co.nz wrote:
>   
>> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery
>> sync time. Would an internal bitmap help dramatically, and are there any
>> other benefits.
>>     
>
> Bryan answered some of this but...
>
>  - if your machine crashes, then resync will be much faster if you
>    have a bitmap.
>  - If one drive becomes disconnected, and then can be reconnected,
>    recovery will be much faster.
>  - if one drive fails and has to be replaced, a bitmap makes no
>    difference(*).
>  - there might be performance hit - it is very dependant on your
>    workload.
>  - You can add or remove a bitmap at any time, so you can try to
>    measure the impact on your particular workload fairly easily.
>
>
> (*) I've been wondering about adding another bitmap which would record
> which sections of the array have valid data.  Initially nothing would
> be valid and so wouldn't need recovery.  Every time we write to a new
> section we add that section to the 'valid' sections and make sure that
> section is in-sync.
> When a device was replaced, we would only need to recover the parts of
> the array that are known to be invalid.
> As filesystem start using the new "invalidate" command for block
> devices, we could clear bits for sections that the filesystem says are
> not needed any more...
> But currently it is just a vague idea.
>   

It's obvious that this idea would provide a speedup, and might be useful 
in terms of doing some physical dump software which would just save the 
"used" portions of the array. Only you have an idea of how much effort 
this would take, although my thought is "very little" for the stable 
case and "bunches" for the case of an array size change.

I have been trying making a COW copy of an entire drive with qemu-img, 
then booting it under KCM, and besides giving an interesting slant to 
the term "dual boot," I can back up the changes files (a sparse file) 
quickly and into small space with a backup which knows about sparse 
files. There is lots of room to imagine uses for this if we had it.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09  6:26   ` Goswin von Brederlow
@ 2009-04-10  9:04     ` Neil Brown
  2009-04-11  2:56       ` Goswin von Brederlow
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2009-04-10  9:04 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Steven Ellis, Linux RAID

On Thursday April 9, goswin-v-b@web.de wrote:
> Neil Brown <neilb@suse.de> writes:
> 
> > (*) I've been wondering about adding another bitmap which would record
> > which sections of the array have valid data.  Initially nothing would
> > be valid and so wouldn't need recovery.  Every time we write to a new
> > section we add that section to the 'valid' sections and make sure that
> > section is in-sync.
> > When a device was replaced, we would only need to recover the parts of
> > the array that are known to be invalid.
> > As filesystem start using the new "invalidate" command for block
> > devices, we could clear bits for sections that the filesystem says are
> > not needed any more...
> > But currently it is just a vague idea.
> >
> > NeilBrown
> 
> If you are up for experimenting I would go for a completly new
> approach. Instead of working with physical blocks and marking where
> blocks are used and out of sync how about adding a mapping layer on
> the device and using virtual blocks. You reduce the reported disk size
> by maybe 1% to always have some spare blocks and initialy all blocks
> will be unmapped (unused). Then whenever there is a write you pick out
> an unused block, write to it and change the in memory mapping of the
> logical to physical block. Every X seconds, on a barrier or an sync
> you commit the mapping from memory to disk in such a way that it is
> synchronized between all disks in the raid. So every commited mapping
> represents a valid raid set. After the commit of the mapping all
> blocks changed between the mapping and the last can be marked as free
> again. Better use the second last so there are always 2 valid mappings
> to choose from after a crash.
> 
> This would obviously need a lot more space than a bitmap but space is
> (relatively) cheap. One benefit imho should be that sync/barrier would
> not have to stop all activity on the raid to wait for the sync/barrier
> to finish. It just has to finalize the mapping for the commit and then
> can start a new in memory mapping while the finalized one writes to
> disk.

While there is obviously real value in this functionality, I can't
help thinking that it belongs in the file system, not the block
device.

But then I've always seen logical volume management as an interim hack
until filesystems were able to span multiple volumes in a sensible
way.  As time goes on it seems less and less 'interim'.

I may well implement a filesystem that has this sort of
functionality.  I'm very unlikely to implement it in the md layer.
But you never know what will happen...

Thanks for the thoughts.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-09 22:51   ` Bill Davidsen
@ 2009-04-10  9:10     ` Neil Brown
  0 siblings, 0 replies; 11+ messages in thread
From: Neil Brown @ 2009-04-10  9:10 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Steven Ellis, Linux RAID

On Thursday April 9, davidsen@tmr.com wrote:
> Neil Brown wrote:
> > On Thursday April 9, steven@openmedia.co.nz wrote:
> >   
> >> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery
> >> sync time. Would an internal bitmap help dramatically, and are there any
> >> other benefits.
> >>     
> >
> > Bryan answered some of this but...
> >
> >  - if your machine crashes, then resync will be much faster if you
> >    have a bitmap.
> >  - If one drive becomes disconnected, and then can be reconnected,
> >    recovery will be much faster.
> >  - if one drive fails and has to be replaced, a bitmap makes no
> >    difference(*).
> >  - there might be performance hit - it is very dependant on your
> >    workload.
> >  - You can add or remove a bitmap at any time, so you can try to
> >    measure the impact on your particular workload fairly easily.
> >
> >
> > (*) I've been wondering about adding another bitmap which would record
> > which sections of the array have valid data.  Initially nothing would
> > be valid and so wouldn't need recovery.  Every time we write to a new
> > section we add that section to the 'valid' sections and make sure that
> > section is in-sync.
> > When a device was replaced, we would only need to recover the parts of
> > the array that are known to be invalid.
> > As filesystem start using the new "invalidate" command for block
> > devices, we could clear bits for sections that the filesystem says are
> > not needed any more...
> > But currently it is just a vague idea.
> >   
> 
> It's obvious that this idea would provide a speedup, and might be useful 
> in terms of doing some physical dump software which would just save the 
> "used" portions of the array. Only you have an idea of how much effort 
> this would take, although my thought is "very little" for the stable 
> case and "bunches" for the case of an array size change.

The only difficulty I can see with the "size change" case is needing
to find space for a bigger bitmap.  If the space exists, you just copy
the bitmap (if needed) and you are done.
If the space doesn't exist, you change the chunk size (space covered
per bit) and use the same space.
The rest, I agree, should be fairly easy.

One possibly awkwardness is that every time you write to a new segment
which requires setting a new bit, you would need to kick-off a resync
for that segment.  That could have an adverse and unpredictable effect
on throughput.
Ofcourse you don't *need* that resync to complete until a reboot so
you could do it with very low priority, so it might be OK.

> 
> I have been trying making a COW copy of an entire drive with qemu-img, 
> then booting it under KCM, and besides giving an interesting slant to 
> the term "dual boot," I can back up the changes files (a sparse file) 
> quickly and into small space with a backup which knows about sparse 
> files. There is lots of room to imagine uses for this if we had it.

Interesting ideas...

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-10  9:04     ` Neil Brown
@ 2009-04-11  2:56       ` Goswin von Brederlow
  2009-04-11  5:35         ` Neil Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Goswin von Brederlow @ 2009-04-11  2:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: Goswin von Brederlow, Steven Ellis, Linux RAID

Neil Brown <neilb@suse.de> writes:

> On Thursday April 9, goswin-v-b@web.de wrote:
>> Neil Brown <neilb@suse.de> writes:
>> 
>> > (*) I've been wondering about adding another bitmap which would record
>> > which sections of the array have valid data.  Initially nothing would
>> > be valid and so wouldn't need recovery.  Every time we write to a new
>> > section we add that section to the 'valid' sections and make sure that
>> > section is in-sync.
>> > When a device was replaced, we would only need to recover the parts of
>> > the array that are known to be invalid.
>> > As filesystem start using the new "invalidate" command for block
>> > devices, we could clear bits for sections that the filesystem says are
>> > not needed any more...
>> > But currently it is just a vague idea.
>> >
>> > NeilBrown
>> 
>> If you are up for experimenting I would go for a completly new
>> approach. Instead of working with physical blocks and marking where
>> blocks are used and out of sync how about adding a mapping layer on
>> the device and using virtual blocks. You reduce the reported disk size
>> by maybe 1% to always have some spare blocks and initialy all blocks
>> will be unmapped (unused). Then whenever there is a write you pick out
>> an unused block, write to it and change the in memory mapping of the
>> logical to physical block. Every X seconds, on a barrier or an sync
>> you commit the mapping from memory to disk in such a way that it is
>> synchronized between all disks in the raid. So every commited mapping
>> represents a valid raid set. After the commit of the mapping all
>> blocks changed between the mapping and the last can be marked as free
>> again. Better use the second last so there are always 2 valid mappings
>> to choose from after a crash.
>> 
>> This would obviously need a lot more space than a bitmap but space is
>> (relatively) cheap. One benefit imho should be that sync/barrier would
>> not have to stop all activity on the raid to wait for the sync/barrier
>> to finish. It just has to finalize the mapping for the commit and then
>> can start a new in memory mapping while the finalized one writes to
>> disk.
>
> While there is obviously real value in this functionality, I can't
> help thinking that it belongs in the file system, not the block
> device.

I believe it is the only way to actualy remove the race conditions
inherent in software raid and there are some uses that don't work well
with a filesystem. E.g. creating a filesystem with only a swapfile on
it instead of using a raid device seems a bit stupid. Or for databases
that use block devices.

> But then I've always seen logical volume management as an interim hack
> until filesystems were able to span multiple volumes in a sensible
> way.  As time goes on it seems less and less 'interim'.
>
> I may well implement a filesystem that has this sort of
> functionality.  I'm very unlikely to implement it in the md layer.
> But you never know what will happen...

Zfs already does this. btrfs does it but only with raid1. But I find
that zfs doesn't really integrate the two, it just has the raid and
filesystem layer in a single binary but still as 2 seperate layers.
Makes changing the layout inflexible, e.g. you can't grow from 4 to 5
disks per stripe.

> Thanks for the thoughts.
>
> NeilBrown

MfG
        Goswin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-11  2:56       ` Goswin von Brederlow
@ 2009-04-11  5:35         ` Neil Brown
  2009-04-11  8:46           ` Goswin von Brederlow
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2009-04-11  5:35 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Steven Ellis, Linux RAID

On Saturday April 11, goswin-v-b@web.de wrote:
> Neil Brown <neilb@suse.de> writes:
> 
> > On Thursday April 9, goswin-v-b@web.de wrote:
> >> Neil Brown <neilb@suse.de> writes:
> >> 
> >> > (*) I've been wondering about adding another bitmap which would record
> >> > which sections of the array have valid data.  Initially nothing would
> >> > be valid and so wouldn't need recovery.  Every time we write to a new
> >> > section we add that section to the 'valid' sections and make sure that
> >> > section is in-sync.
> >> > When a device was replaced, we would only need to recover the parts of
> >> > the array that are known to be invalid.
> >> > As filesystem start using the new "invalidate" command for block
> >> > devices, we could clear bits for sections that the filesystem says are
> >> > not needed any more...
> >> > But currently it is just a vague idea.
> >> >
> >> > NeilBrown
> >> 
> >> If you are up for experimenting I would go for a completly new
> >> approach. Instead of working with physical blocks and marking where
> >> blocks are used and out of sync how about adding a mapping layer on
> >> the device and using virtual blocks. You reduce the reported disk size
> >> by maybe 1% to always have some spare blocks and initialy all blocks
> >> will be unmapped (unused). Then whenever there is a write you pick out
> >> an unused block, write to it and change the in memory mapping of the
> >> logical to physical block. Every X seconds, on a barrier or an sync
> >> you commit the mapping from memory to disk in such a way that it is
> >> synchronized between all disks in the raid. So every commited mapping
> >> represents a valid raid set. After the commit of the mapping all
> >> blocks changed between the mapping and the last can be marked as free
> >> again. Better use the second last so there are always 2 valid mappings
> >> to choose from after a crash.
> >> 
> >> This would obviously need a lot more space than a bitmap but space is
> >> (relatively) cheap. One benefit imho should be that sync/barrier would
> >> not have to stop all activity on the raid to wait for the sync/barrier
> >> to finish. It just has to finalize the mapping for the commit and then
> >> can start a new in memory mapping while the finalized one writes to
> >> disk.
> >
> > While there is obviously real value in this functionality, I can't
> > help thinking that it belongs in the file system, not the block
> > device.
> 
> I believe it is the only way to actualy remove the race conditions
> inherent in software raid and there are some uses that don't work well
> with a filesystem. E.g. creating a filesystem with only a swapfile on
> it instead of using a raid device seems a bit stupid. Or for databases
> that use block devices.

I agree that it would remove some races, make resync unnecessary, and
thus remove the small risk of data loss when a system with a degraded
raid5 crashes.  I doubt it is the only way, and may not even be a good
way, though I'm not certain.

Your mapping of logical to physical blocks - it would technically need
to map each sector independently, but let's be generous (and fairly
realistic) and map each 4K block independently.
Then with a 1TB device, you have 2**28 entries in the table, each 4
bytes, so 2**30 bytes, or 1 gigabyte.
You suggest this table is kept in memory.  While memory is cheap, I
don't think it is that cheap yet.
So you would need to make compromises, either not keeping it all in
memory, or having larger block sizes (and so needing to pre-read for
updates), or having a more complicated data structure.  Or, more
likely, all of the above.

You could make it work, but there would be a performance hit.

Now look at your cases where a filesystem doesn't work well:
 1/ Swap.  That is a non-issue.  After a crash, the contents of swap
    are irrelevant.  Without a crash, the races you refer to are
    irrelevant.
 2/ Database that use block devices directly.   Why do they use the
    block device directly rather than using O_DIRECT to a
    pre-allocated file?  Because they believe that the filesystem
    introduces a performance penalty.  What reason is there to believe
    that the performance penalty of your remapped-raid would
    necessarily be less than that of a filesystem?  I cannot see one.

BTW an alternate approach to closing those races (assuming that I am
understanding you correctly) is to journal all updates to a separate
device.  Possible an SSD or battery-backed RAM.  That could have the
added benefit of reducing latency, though it may impact throughput.
I'm not sure if that is an approach with a real future either.  But it
is a valid alternate.

> 
> > But then I've always seen logical volume management as an interim hack
> > until filesystems were able to span multiple volumes in a sensible
> > way.  As time goes on it seems less and less 'interim'.
> >
> > I may well implement a filesystem that has this sort of
> > functionality.  I'm very unlikely to implement it in the md layer.
> > But you never know what will happen...
> 
> Zfs already does this. btrfs does it but only with raid1. But I find
> that zfs doesn't really integrate the two, it just has the raid and
> filesystem layer in a single binary but still as 2 seperate layers.
> Makes changing the layout inflexible, e.g. you can't grow from 4 to 5
> disks per stripe.

I thought ZFS was more integrated than that, but I haven't looked
deeply.
My vague notion what that when ZFS wanted to write "some data" it
would break it into sets of N blocks. calculate a parity block for
each N, then write those N+1 blocks to N+1 different devices,
where-ever there happened to be unused space.  Then the addresses of
those N+1 block would be stored in the file metadata which would be
written a similar way, possibly with a different(smaller) N.

This idea (which might be completely wrong) implies very tight
integration between the layers.

With this setup you could conceivably change the default N at any
time.  Old data wouldn't be relocated, but new writes would be written
with the new N.  If you have a background defragmentation process, it
could, over a period of time, arrange for the whole filesystem to be
re-laid out with the new N.

Clearly data would still be recoverable after a single drive failure.

The problem I see with this approach is the cost of recovering to a
hot-spare after device failure.  Finding which blocks need to be
written where would require scanning all the metadata on the entire
filesystem.  And much of this would not be contiguous.  So much
seeking would be involved.  I wouldn't be surprised if recovering a
device in a nearly-full filesystem took an order of magnitude longer
with that approach than with md style raid.

Given that observation: maybe I am wrong about RAID-Z.  However it is
the only model I can come up with that matches the various snippets I
have heard about it.

(hmm... maybe a secondary indexing scheme could help... might get it
down to taking only twice as long, with could be acceptable ....
maybe I will try implementing that after all and see how it
works... in my spare time)

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-11  5:35         ` Neil Brown
@ 2009-04-11  8:46           ` Goswin von Brederlow
  2009-04-11 13:08             ` Bill Davidsen
  0 siblings, 1 reply; 11+ messages in thread
From: Goswin von Brederlow @ 2009-04-11  8:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: Goswin von Brederlow, Steven Ellis, Linux RAID

Neil Brown <neilb@suse.de> writes:

> On Saturday April 11, goswin-v-b@web.de wrote:
>> Neil Brown <neilb@suse.de> writes:
>> 
>> > On Thursday April 9, goswin-v-b@web.de wrote:
>> >> Neil Brown <neilb@suse.de> writes:
>> >> 
>> >> > (*) I've been wondering about adding another bitmap which would record
>> >> > which sections of the array have valid data.  Initially nothing would
>> >> > be valid and so wouldn't need recovery.  Every time we write to a new
>> >> > section we add that section to the 'valid' sections and make sure that
>> >> > section is in-sync.
>> >> > When a device was replaced, we would only need to recover the parts of
>> >> > the array that are known to be invalid.
>> >> > As filesystem start using the new "invalidate" command for block
>> >> > devices, we could clear bits for sections that the filesystem says are
>> >> > not needed any more...
>> >> > But currently it is just a vague idea.
>> >> >
>> >> > NeilBrown
>> >> 
>> >> If you are up for experimenting I would go for a completly new
>> >> approach. Instead of working with physical blocks and marking where
>> >> blocks are used and out of sync how about adding a mapping layer on
>> >> the device and using virtual blocks. You reduce the reported disk size
>> >> by maybe 1% to always have some spare blocks and initialy all blocks
>> >> will be unmapped (unused). Then whenever there is a write you pick out
>> >> an unused block, write to it and change the in memory mapping of the
>> >> logical to physical block. Every X seconds, on a barrier or an sync
>> >> you commit the mapping from memory to disk in such a way that it is
>> >> synchronized between all disks in the raid. So every commited mapping
>> >> represents a valid raid set. After the commit of the mapping all
>> >> blocks changed between the mapping and the last can be marked as free
>> >> again. Better use the second last so there are always 2 valid mappings
>> >> to choose from after a crash.
>> >> 
>> >> This would obviously need a lot more space than a bitmap but space is
>> >> (relatively) cheap. One benefit imho should be that sync/barrier would
>> >> not have to stop all activity on the raid to wait for the sync/barrier
>> >> to finish. It just has to finalize the mapping for the commit and then
>> >> can start a new in memory mapping while the finalized one writes to
>> >> disk.
>> >
>> > While there is obviously real value in this functionality, I can't
>> > help thinking that it belongs in the file system, not the block
>> > device.
>> 
>> I believe it is the only way to actualy remove the race conditions
>> inherent in software raid and there are some uses that don't work well
>> with a filesystem. E.g. creating a filesystem with only a swapfile on
>> it instead of using a raid device seems a bit stupid. Or for databases
>> that use block devices.
>
> I agree that it would remove some races, make resync unnecessary, and
> thus remove the small risk of data loss when a system with a degraded
> raid5 crashes.  I doubt it is the only way, and may not even be a good
> way, though I'm not certain.

Ok, not the only way. You could have a journal where you first write
what block and data is to be updated, sync, and then write the data to
the actual block. After a crash the journal could just be replayed.

> Your mapping of logical to physical blocks - it would technically need
> to map each sector independently, but let's be generous (and fairly
> realistic) and map each 4K block independently.
> Then with a 1TB device, you have 2**28 entries in the table, each 4
> bytes, so 2**30 bytes, or 1 gigabyte.
> You suggest this table is kept in memory.  While memory is cheap, I
> don't think it is that cheap yet.
> So you would need to make compromises, either not keeping it all in
> memory, or having larger block sizes (and so needing to pre-read for
> updates), or having a more complicated data structure.  Or, more
> likely, all of the above.

Plus as a plain array you would have to have multiple copies of 1GB. A
BTree where only used parts are in memory or something similar would
really ne neccessary. Mapping extends instead of individual blocks
would also be usefull as well as a defrager that remaps blocks into
larger continious segments. But now it really got complex.

> You could make it work, but there would be a performance hit.
>
> Now look at your cases where a filesystem doesn't work well:
>  1/ Swap.  That is a non-issue.  After a crash, the contents of swap
>     are irrelevant.  Without a crash, the races you refer to are
>     irrelevant.

What about suspend to swap?

>  2/ Database that use block devices directly.   Why do they use the
>     block device directly rather than using O_DIRECT to a
>     pre-allocated file?  Because they believe that the filesystem
>     introduces a performance penalty.  What reason is there to believe
>     that the performance penalty of your remapped-raid would
>     necessarily be less than that of a filesystem?  I cannot see one.

Youareassuming we could change the DB to use files instead. :)

> BTW an alternate approach to closing those races (assuming that I am
> understanding you correctly) is to journal all updates to a separate
> device.  Possible an SSD or battery-backed RAM.  That could have the
> added benefit of reducing latency, though it may impact throughput.
> I'm not sure if that is an approach with a real future either.  But it
> is a valid alternate.

That is what hardware raids do.
 
>> > But then I've always seen logical volume management as an interim hack
>> > until filesystems were able to span multiple volumes in a sensible
>> > way.  As time goes on it seems less and less 'interim'.
>> >
>> > I may well implement a filesystem that has this sort of
>> > functionality.  I'm very unlikely to implement it in the md layer.
>> > But you never know what will happen...
>> 
>> Zfs already does this. btrfs does it but only with raid1. But I find
>> that zfs doesn't really integrate the two, it just has the raid and
>> filesystem layer in a single binary but still as 2 seperate layers.
>> Makes changing the layout inflexible, e.g. you can't grow from 4 to 5
>> disks per stripe.
>
> I thought ZFS was more integrated than that, but I haven't looked
> deeply.
> My vague notion what that when ZFS wanted to write "some data" it
> would break it into sets of N blocks. calculate a parity block for
> each N, then write those N+1 blocks to N+1 different devices,
> where-ever there happened to be unused space.  Then the addresses of
> those N+1 block would be stored in the file metadata which would be
> written a similar way, possibly with a different(smaller) N.
>
> This idea (which might be completely wrong) implies very tight
> integration between the layers.

But first you define a storage pool form segments X devices with a
certain raid level. The higher level then uses virtual addresses into
that pool. If you want to grow your zfs you have to add new disks and
create a new pool from them. All the docs I've seen didn't mention any
support for changing an existing pool.

> With this setup you could conceivably change the default N at any
> time.  Old data wouldn't be relocated, but new writes would be written
> with the new N.  If you have a background defragmentation process, it
> could, over a period of time, arrange for the whole filesystem to be
> re-laid out with the new N.

As I understand it the pool creates a virtual->physical mapping and the
higher layers use the virtual address. By increasing the number of
disks in a pool all physical addresses would change, just like when
growing a raid, and the hiher layers would have to readjust their
addresses. At least that is my understanding.

> Clearly data would still be recoverable after a single drive failure.
>
> The problem I see with this approach is the cost of recovering to a
> hot-spare after device failure.  Finding which blocks need to be
> written where would require scanning all the metadata on the entire
> filesystem.  And much of this would not be contiguous.  So much
> seeking would be involved.  I wouldn't be surprised if recovering a
> device in a nearly-full filesystem took an order of magnitude longer
> with that approach than with md style raid.

One huge improvement comes from splitting data and metadata into
seperate segments thereby keeping the metadata close together. If one
also takes care to write the parent of a metablock before its child
and defrags them frequently they should be kept pretty linear.

And how much metadata is there in the filesystem? My 4.6TB movie
archive has 30000 inodes used so that would be a few MB of
metadata. Hardly relevant. For a news spool it would look different.

> Given that observation: maybe I am wrong about RAID-Z.  However it is
> the only model I can come up with that matches the various snippets I
> have heard about it.
>
> (hmm... maybe a secondary indexing scheme could help... might get it
> down to taking only twice as long, with could be acceptable ....
> maybe I will try implementing that after all and see how it
> works... in my spare time)
>
> NeilBrown

The snippets I've read about zfs let me to believe that the raid level
is restricted to the pools. So in effect you just have lots of
internal md devices. Resync speed in zfs should be exactly like
normal raid.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Any benefity to write intent bitmaps on Raid1
  2009-04-11  8:46           ` Goswin von Brederlow
@ 2009-04-11 13:08             ` Bill Davidsen
  0 siblings, 0 replies; 11+ messages in thread
From: Bill Davidsen @ 2009-04-11 13:08 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Neil Brown, Steven Ellis, Linux RAID

Goswin von Brederlow wrote:
> Neil Brown <neilb@suse.de> writes:
>   
>
>> You could make it work, but there would be a performance hit.
>>
>> Now look at your cases where a filesystem doesn't work well:
>>  1/ Swap.  That is a non-issue.  After a crash, the contents of swap
>>     are irrelevant.  Without a crash, the races you refer to are
>>     irrelevant.
>>     
>
> What about suspend to swap?
>   

suspend is a "without a crash" case, I wouldn't want to restore from 
swap if the system failed to complete a clean shutdown.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-11 13:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-09  0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis
2009-04-09  1:30 ` Bryan Mesich
2009-04-09  5:59 ` Neil Brown
2009-04-09  6:26   ` Goswin von Brederlow
2009-04-10  9:04     ` Neil Brown
2009-04-11  2:56       ` Goswin von Brederlow
2009-04-11  5:35         ` Neil Brown
2009-04-11  8:46           ` Goswin von Brederlow
2009-04-11 13:08             ` Bill Davidsen
2009-04-09 22:51   ` Bill Davidsen
2009-04-10  9:10     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).