Queuing of dm-raid1 resyncs to the same underlying block devices

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* Queuing of dm-raid1 resyncs to the same underlying block devices
@ 2015-09-26 15:49 Richard Davies
  2015-09-30 13:22 ` Brassow Jonathan
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Davies @ 2015-09-26 15:49 UTC (permalink / raw)
  To: dm-devel

Hi,

Does dm-raid queue resyncs of multiple dm-raid1 arrays, if the underlying
block devices are the same?

Linux md has this feature, e.g.:

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sda2[2] sdb2[1]
      943202240 blocks [2/1] [_U]
      [====>................]  recovery = 21.9% (207167744/943202240)
      finish=20290.8min speed=603K/sec
      bitmap: 1/8 pages [4KB], 65536KB chunk

md0 : active raid1 sda1[2] sdb1[1]
      67108736 blocks [2/1] [_U]
        resync=DELAYED
      bitmap: 1/1 pages [4KB], 65536KB chunk


After some time investigating, I can't find it in dm-raid.

Please can someone tell me if this is implemented or not?

If it is implemented, where should I look to see it happening?

Thanks,

Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-09-26 15:49 Queuing of dm-raid1 resyncs to the same underlying block devices Richard Davies
@ 2015-09-30 13:22 ` Brassow Jonathan
  2015-09-30 14:00   ` Heinz Mauelshagen
  0 siblings, 1 reply; 8+ messages in thread
From: Brassow Jonathan @ 2015-09-30 13:22 UTC (permalink / raw)
  To: device-mapper development; +Cc: Heinz Mauelshagen

I don’t believe it does.  dm-raid does use the same RAID kernel personalities as MD though, so I would think that it could be added.  I’ll check with Heinz and see if he knows.

 brassow

> On Sep 26, 2015, at 10:49 AM, Richard Davies <richard@arachsys.com> wrote:
> 
> Hi,
> 
> Does dm-raid queue resyncs of multiple dm-raid1 arrays, if the underlying
> block devices are the same?
> 
> Linux md has this feature, e.g.:
> 
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
> md1 : active raid1 sda2[2] sdb2[1]
>      943202240 blocks [2/1] [_U]
>      [====>................]  recovery = 21.9% (207167744/943202240)
>      finish=20290.8min speed=603K/sec
>      bitmap: 1/8 pages [4KB], 65536KB chunk
> 
> md0 : active raid1 sda1[2] sdb1[1]
>      67108736 blocks [2/1] [_U]
>        resync=DELAYED
>      bitmap: 1/1 pages [4KB], 65536KB chunk
> 
> 
> After some time investigating, I can't find it in dm-raid.
> 
> Please can someone tell me if this is implemented or not?
> 
> If it is implemented, where should I look to see it happening?
> 
> Thanks,
> 
> Richard.
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-09-30 13:22 ` Brassow Jonathan
@ 2015-09-30 14:00   ` Heinz Mauelshagen
  2015-09-30 22:20     ` Neil Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Heinz Mauelshagen @ 2015-09-30 14:00 UTC (permalink / raw)
  To: Brassow Jonathan, device-mapper development


No, lvm/dm-raid does not queue synchroniation.

If multiple raid1/4/5/6/10 LVs have their image LVs share the same
PV, resyncs will happen in parallel on all respective RAID LVs if requested.

E.g. (see Cpy%Sync field of the created 2 raid1 LVs):
   LV            VG   Attr       LSize   SSize SRes Cpy%Sync Type #Cpy 
#Str Stripe SSize PE Ranges
   r1            r    Rwi-a-r--- 512.00m   128      18.75 raid1     2    
2  0.03m   128 r1_rimage_0:0-127 r1_rimage_1:0-127
   [r1_rimage_0] r    iwi-aor--- 512.00m   128 linear         1     0m   
128 /dev/sdf:1-128
   [r1_rimage_1] r    iwi-aor--- 512.00m   128 linear         1     0m   
128 /dev/sdg:1-128
   [r1_rmeta_0]  r    ewi-aor---   4.00m     1 linear         1     
0m     1 /dev/sdf:0-0
   [r1_rmeta_1]  r    ewi-aor---   4.00m     1 linear         1     
0m     1 /dev/sdg:0-0
   r2            r    Rwi-a-r--- 512.00m   128      31.25 raid1     2    
2  0.03m   128 r2_rimage_0:0-127 r2_rimage_1:0-127
   [r2_rimage_0] r    iwi-aor--- 512.00m   128 linear         1     0m   
128 /dev/sdf:130-257
   [r2_rimage_1] r    iwi-aor--- 512.00m   128 linear         1     0m   
128 /dev/sdg:130-257
   [r2_rmeta_0]  r    ewi-aor---   4.00m     1 linear         1     
0m     1 /dev/sdf:129-129
   [r2_rmeta_1]  r    ewi-aor---   4.00m     1 linear         1     
0m     1 /dev/sdg:129-129


Though there's no automatic queueing of (initial) resynchronizations,
you can create the 2 LVs sharing the same PVs with the "--nosync" 
option, thus preventing
immediate resynchronization and then "lvchange --syncaction repair 
r-r1", wait for it to
finish and "lvchange --syncaction repair r-r2"  afterwards.

Or create all but 1 LV with "--nosync", wait for the one to finish 
before using lvchange
to start resynchronization.

BTW:
When you create a raid1/4/5/6/10 LVs _and_ never read what you have not 
written,
"--nosync" can be used anyway in order to avoid the initial 
resynchronization load
on the devices. Any data written in that case will update all 
mirrors/raid redundancy data.


Heinz


On 09/30/2015 03:22 PM, Brassow Jonathan wrote:
> I don’t believe it does.  dm-raid does use the same RAID kernel personalities as MD though, so I would think that it could be added.  I’ll check with Heinz and see if he knows.
>
>   brassow
>
>> On Sep 26, 2015, at 10:49 AM, Richard Davies <richard@arachsys.com> wrote:
>>
>> Hi,
>>
>> Does dm-raid queue resyncs of multiple dm-raid1 arrays, if the underlying
>> block devices are the same?
>>
>> Linux md has this feature, e.g.:
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
>> md1 : active raid1 sda2[2] sdb2[1]
>>       943202240 blocks [2/1] [_U]
>>       [====>................]  recovery = 21.9% (207167744/943202240)
>>       finish=20290.8min speed=603K/sec
>>       bitmap: 1/8 pages [4KB], 65536KB chunk
>>
>> md0 : active raid1 sda1[2] sdb1[1]
>>       67108736 blocks [2/1] [_U]
>>         resync=DELAYED
>>       bitmap: 1/1 pages [4KB], 65536KB chunk
>>
>>
>> After some time investigating, I can't find it in dm-raid.
>>
>> Please can someone tell me if this is implemented or not?
>>
>> If it is implemented, where should I look to see it happening?
>>
>> Thanks,
>>
>> Richard.
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-09-30 14:00   ` Heinz Mauelshagen
@ 2015-09-30 22:20     ` Neil Brown
  2015-10-01 10:09       ` Heinz Mauelshagen
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2015-09-30 22:20 UTC (permalink / raw)
  To: Heinz Mauelshagen, Brassow Jonathan, device-mapper development

[-- Attachment #1.1: Type: text/plain, Size: 886 bytes --]

Heinz Mauelshagen <heinzm@redhat.com> writes:
>
> BTW:
> When you create a raid1/4/5/6/10 LVs _and_ never read what you have not 
> written,
> "--nosync" can be used anyway in order to avoid the initial 
> resynchronization load
> on the devices. Any data written in that case will update all 
> mirrors/raid redundancy data.
>

While this is true for RAID1 and RAID10, and (I think) for the current
implementation of RAID6, it is definitely not true for RAID4/5.

For RAID4/5 a single-block write will be handled by reading
old-data/parity, subtracting the old data from the parity and adding the
new data, then writing out new data/parity.
So if the parity was wrong before, it will be wrong afterwards.

If the device that new data was written to then fails, the data on it is
lost.

So do this for RAID1/10 if you like, but not for other levels.

NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-09-30 22:20     ` Neil Brown
@ 2015-10-01 10:09       ` Heinz Mauelshagen
  2015-10-07 21:42         ` Neil Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Heinz Mauelshagen @ 2015-10-01 10:09 UTC (permalink / raw)
  To: Neil Brown, Brassow Jonathan, device-mapper development



On 10/01/2015 12:20 AM, Neil Brown wrote:
> Heinz Mauelshagen <heinzm@redhat.com> writes:
>> BTW:
>> When you create a raid1/4/5/6/10 LVs _and_ never read what you have not
>> written,
>> "--nosync" can be used anyway in order to avoid the initial
>> resynchronization load
>> on the devices. Any data written in that case will update all
>> mirrors/raid redundancy data.
>>
> While this is true for RAID1 and RAID10, and (I think) for the current
> implementation of RAID6, it is definitely not true for RAID4/5.

Thanks for the clarification.

I find that to be really bad situation.


>
> For RAID4/5 a single-block write will be handled by reading
> old-data/parity, subtracting the old data from the parity and adding the
> new data, then writing out new data/parity.

Obviously for optimization reasons.

> So if the parity was wrong before, it will be wrong afterwards.

So even overwriting complete stripes in raid4/5/(6)
would not ensure correct parity, thus always requiring
initial sync.

We should think about a solution to avoid it in lieu
of growing disk/array sizes.


Heinz


>
> If the device that new data was written to then fails, the data on it is
> lost.
>
> So do this for RAID1/10 if you like, but not for other levels.
>
> NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-10-01 10:09       ` Heinz Mauelshagen
@ 2015-10-07 21:42         ` Neil Brown
  2015-10-08 11:50           ` Heinz Mauelshagen
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2015-10-07 21:42 UTC (permalink / raw)
  To: Heinz Mauelshagen, Brassow Jonathan, device-mapper development

[-- Attachment #1.1: Type: text/plain, Size: 2285 bytes --]

Heinz Mauelshagen <heinzm@redhat.com> writes:

> On 10/01/2015 12:20 AM, Neil Brown wrote:
>> Heinz Mauelshagen <heinzm@redhat.com> writes:
>>> BTW:
>>> When you create a raid1/4/5/6/10 LVs _and_ never read what you have not
>>> written,
>>> "--nosync" can be used anyway in order to avoid the initial
>>> resynchronization load
>>> on the devices. Any data written in that case will update all
>>> mirrors/raid redundancy data.
>>>
>> While this is true for RAID1 and RAID10, and (I think) for the current
>> implementation of RAID6, it is definitely not true for RAID4/5.
>
> Thanks for the clarification.
>
> I find that to be really bad situation.
>
>
>>
>> For RAID4/5 a single-block write will be handled by reading
>> old-data/parity, subtracting the old data from the parity and adding the
>> new data, then writing out new data/parity.
>
> Obviously for optimization reasons.
>
>> So if the parity was wrong before, it will be wrong afterwards.
>
> So even overwriting complete stripes in raid4/5/(6)
> would not ensure correct parity, thus always requiring
> initial sync.

No, over-writing complete stripes will result in correct parity.
Even writing more than half of the data in a stripe will result in
correct parity.

So if you have a filesystem which only ever writes full stripes, then
there is no need to sync at the start.  But I don't know any filesysetms
which promise that.

If you don't sync at creation time, then you may be perfectly safe when
a device fails, but I can't promise that.  And without guarantees, RAID
is fairly pointless.

>
> We should think about a solution to avoid it in lieu
> of growing disk/array sizes.

With spinning-rust devices you need to read the entire array ("scrub")
every few weeks just to make sure the media isn't degrading.  When you
do that it is useful to check that the parity is still correct - as a
potential warning sign of problems.
If you don't sync first, then checking the parity doesn't tell you
anything.
And as you have to process the entire array occasionally anyway, you
make as well do it at creation time.

NeilBrown

>
>
> Heinz
>
>
>>
>> If the device that new data was written to then fails, the data on it is
>> lost.
>>
>> So do this for RAID1/10 if you like, but not for other levels.
>>
>> NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-10-07 21:42         ` Neil Brown
@ 2015-10-08 11:50           ` Heinz Mauelshagen
  2015-10-08 22:01             ` Neil Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Heinz Mauelshagen @ 2015-10-08 11:50 UTC (permalink / raw)
  To: Neil Brown, Brassow Jonathan, device-mapper development



On 10/07/2015 11:42 PM, Neil Brown wrote:
> Heinz Mauelshagen <heinzm@redhat.com> writes:
>
>> On 10/01/2015 12:20 AM, Neil Brown wrote:
>>> Heinz Mauelshagen <heinzm@redhat.com> writes:
>>>> BTW:
>>>> When you create a raid1/4/5/6/10 LVs _and_ never read what you have not
>>>> written,
>>>> "--nosync" can be used anyway in order to avoid the initial
>>>> resynchronization load
>>>> on the devices. Any data written in that case will update all
>>>> mirrors/raid redundancy data.
>>>>
>>> While this is true for RAID1 and RAID10, and (I think) for the current
>>> implementation of RAID6, it is definitely not true for RAID4/5.
>> Thanks for the clarification.
>>
>> I find that to be really bad situation.
>>
>>
>>> For RAID4/5 a single-block write will be handled by reading
>>> old-data/parity, subtracting the old data from the parity and adding the
>>> new data, then writing out new data/parity.
>> Obviously for optimization reasons.
>>
>>> So if the parity was wrong before, it will be wrong afterwards.
>> So even overwriting complete stripes in raid4/5/(6)
>> would not ensure correct parity, thus always requiring
>> initial sync.
> No, over-writing complete stripes will result in correct parity.
> Even writing more than half of the data in a stripe will result in
> correct parity.


Useless, as you say, because we can never be sure, that
any filesystem/dbms/... upstack will guarantee >= half stripe
writes initially; even more so with many devices and large chunk sizes...

>
> So if you have a filesystem which only ever writes full stripes, then
> there is no need to sync at the start.  But I don't know any filesysetms
> which promise that.
>
> If you don't sync at creation time, then you may be perfectly safe when
> a device fails, but I can't promise that.  And without guarantees, RAID
> is fairly pointless.

Indeed.

>
>> We should think about a solution to avoid it in lieu
>> of growing disk/array sizes.
> With spinning-rust devices you need to read the entire array ("scrub")
> every few weeks just to make sure the media isn't degrading.  When you
> do that it is useful to check that the parity is still correct - as a
> potential warning sign of problems.
> If you don't sync first, then checking the parity doesn't tell you
> anything.

Yes, aware of this.

My point was avoiding superfluous mass io whenever possible.

E.g. keep track of the 'new' state of the array and initialize
parity/syndrome on first access to any given stripe with
the given performance optimization thereafter.

Metadata kept to housekeep this  could be organized in a b-tree
(e.g. via dm-persistent-data), thus storing just one node
defining the whole array as 'new' and splitting the tree up
as we go and have a size threshold to not allow to grow
such metadata too big.

Heinz

> And as you have to process the entire array occasionally anyway, you
> make as well do it at creation time.
>
> NeilBrown
>
>
>>
>> Heinz
>>
>>
>>> If the device that new data was written to then fails, the data on it is
>>> lost.
>>>
>>> So do this for RAID1/10 if you like, but not for other levels.
>>>
>>> NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Queuing of dm-raid1 resyncs to the same underlying block devices
  2015-10-08 11:50           ` Heinz Mauelshagen
@ 2015-10-08 22:01             ` Neil Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2015-10-08 22:01 UTC (permalink / raw)
  To: Heinz Mauelshagen, Brassow Jonathan, device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 903 bytes --]

Heinz Mauelshagen <heinzm@redhat.com> writes:
>
> E.g. keep track of the 'new' state of the array and initialize
> parity/syndrome on first access to any given stripe with
> the given performance optimization thereafter.
>
> Metadata kept to housekeep this  could be organized in a b-tree
> (e.g. via dm-persistent-data), thus storing just one node
> defining the whole array as 'new' and splitting the tree up
> as we go and have a size threshold to not allow to grow
> such metadata too big.
>

This idea has come up before.  A bitmap has been suggested.  Simpler
than a B-tree, though not as flexible.
It would allows us to do something more meaningful with Discard: record
that the whole region is invalid.

I don't object to the idea, but I find it hard to get excited about.  It
further blurs the line between the filesystem and the storage device,
and duplicates work between the two.

NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-08 22:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-26 15:49 Queuing of dm-raid1 resyncs to the same underlying block devices Richard Davies
2015-09-30 13:22 ` Brassow Jonathan
2015-09-30 14:00   ` Heinz Mauelshagen
2015-09-30 22:20     ` Neil Brown
2015-10-01 10:09       ` Heinz Mauelshagen
2015-10-07 21:42         ` Neil Brown
2015-10-08 11:50           ` Heinz Mauelshagen
2015-10-08 22:01             ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).