[linux-lvm] pvmove painfully slow on parity RAID

linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed

* [linux-lvm] pvmove painfully slow on parity RAID
@ 2010-12-29  2:40 Spelic
  2010-12-29 14:02 ` Spelic
  0 siblings, 1 reply; 8+ messages in thread
From: Spelic @ 2010-12-29  2:40 UTC (permalink / raw)
  To: linux-lvm

Hello list

pvmove is painfully slow if the destination is on a 6-disks MD raid-5, 
it performs at 200-500Kbytes/sec! (kernel 2.6.36.2)
Same for lvconvert add mirror.

Instead, if the destination is on a 4 devices MD raid10near, it performs 
at 60MBytes/sec which is much more reasonable. (this is a 120-fold 
difference at least!)
Same for lvconvert add mirror.

How come such a difference?
Are you using barriers every tiny block of data maybe? (This could 
explain the slowness on parity raid)
If yes, could you use barriers (and hence checkpointing) every, like, 100MB?
(However please note that with lvconvert add mirror I also tried various 
--regionsize settings but they don't improve the speed much, i.e. +50% 
at most)

Thanks for any information

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-29  2:40 [linux-lvm] pvmove painfully slow on parity RAID Spelic
@ 2010-12-29 14:02 ` Spelic
  2010-12-30  2:42   ` Stuart D. Gathman
  0 siblings, 1 reply; 8+ messages in thread
From: Spelic @ 2010-12-29 14:02 UTC (permalink / raw)
  To: linux-lvm

On 12/29/2010 03:40 AM, Spelic wrote:
> Hello list
>
> pvmove is painfully slow if the destination is on a 6-disks MD raid-5, 
> it performs at 200-500Kbytes/sec! (kernel 2.6.36.2)
> Same for lvconvert add mirror.
>
> Instead, if the destination is on a 4 devices MD raid10near, it 
> performs at 60MBytes/sec which is much more reasonable. (this is a 
> 120-fold difference at least!)
> Same for lvconvert add mirror.
>

Sorry, yesterday I made a few mistakes computing the speeds.
Here are the times for moving a 200MB logical volume towards various 
types of MD arrays (either pvmove or lvconvert add mirror: doesn't 
change much)

It's the destination array that matters, not the source array.

raid5, 8 devices, 1024k chunk:  36 seconds (5.5MB/sec)
raid5, 6 device, 4096k chunk: 2m18sec ?!?! (1.44 MB/sec!?)
raid5, 5 devices, 1024k chunk: 25sec (8MB/sec)
raid5, 4 devices, 16384k chunk: 41sec (4.9MB/sec)
raid10, 4 devices, 1024k chunk, near-copies: 5 sec! (40MB/sec)
raid1, 2 devices: 3.4sec! (59MB/sec)
raid1, 2 devices (another, identical to the above): 3.4sec! (59MB/sec)

I tried multiple times for every device with consistent results, so I'm 
pretty sure these are actual numbers.
What's happening?
Apart from the amazing difference of parity raid vs nonparity raid, with 
parity raid it seems to vary randomly with the number of devices and the 
chunksize..?

I tried various --regionsize settings for lvconvert add mirror but the 
times didn't change much.

I even tried to set my SATA controller to ignore-FUA mode (it fakes the 
FUA, returns immediately) => no change.

Thanks for any info

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-29 14:02 ` Spelic
@ 2010-12-30  2:42   ` Stuart D. Gathman
  2010-12-30  3:13     ` Spelic
  0 siblings, 1 reply; 8+ messages in thread
From: Stuart D. Gathman @ 2010-12-30  2:42 UTC (permalink / raw)
  To: LVM general discussion and development

On Wed, 29 Dec 2010, Spelic wrote:

> I tried multiple times for every device with consistent results, so I'm pretty
> sure these are actual numbers.
> What's happening?
> Apart from the amazing difference of parity raid vs nonparity raid, with
> parity raid it seems to vary randomly with the number of devices and the
> chunksize..?

This is pretty much my experience with parity raid all around.  Which
is why I stick with raid1 and raid10.

That said, the sequential writes of pvmove should be fast for raid5 *if*
the chunks are aligned so that there is no read/modify/write cycle.

1) Perhaps your test targets are not properly aligned?

2) Perhaps the raid5 implementation (hardware? linux md? 
   experimental lvm raid5?) does a read modify write even when it 
   doesn't have to.

Your numbers sure look like read/modify/write is happening for some reason.

-- 
	      Stuart D. Gathman <stuart@bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-30  2:42   ` Stuart D. Gathman
@ 2010-12-30  3:13     ` Spelic
  2010-12-30 19:12       ` Stuart D. Gathman
  0 siblings, 1 reply; 8+ messages in thread
From: Spelic @ 2010-12-30  3:13 UTC (permalink / raw)
  To: linux-lvm

On 12/30/2010 03:42 AM, Stuart D. Gathman wrote:
> On Wed, 29 Dec 2010, Spelic wrote:
>    
>> I tried multiple times for every device with consistent results, so I'm pretty
>> sure these are actual numbers.
>> What's happening?
>> Apart from the amazing difference of parity raid vs nonparity raid, with
>> parity raid it seems to vary randomly with the number of devices and the
>> chunksize..?
>>      
> This is pretty much my experience with parity raid all around.  Which
> is why I stick with raid1 and raid10.
>    

Parity raid goes fast for me for normal filesystem operations, that's 
why I suppose there is some strict sequentiality is enforced here.

> That said, the sequential writes of pvmove should be fast for raid5 *if*
> the chunks are aligned so that there is no read/modify/write cycle.
>
> 1) Perhaps your test targets are not properly aligned?
>    

aligned to zero yes (arrays are empty now), but all raids have different 
chunk sizes and stripe sizes as I reported, which are all bigger than 
the lvm chunksize which is 1M for the VG.

> 2) Perhaps the raid5 implementation (hardware? linux md?
>     experimental lvm raid5?) does a read modify write even when it
>     doesn't have to.
>
> Your numbers sure look like read/modify/write is happening for some reason.
>    

Ok but strict sequentiality is probably enforced too much. There must be 
some barrier or flush & wait thing going on here at each tiny bit of 
information (at each lvm chunk maybe?). Are you a lvm devel?
Consider that a sequential dd write goes hundreds of megabytes per 
second on my arrays, not hundreds of... kilobytes!
Even random io goes *much* faster than this, if one stripe does not have 
to wait for another stripe to be fully updated (i.e. sequentiality not 
enforced from the application layer).
If pvmove ouputed 100MB before every sync or flush, I'm pretty sure I 
would see speeds almost 100 times higher.

Also there is still the mystery of why times appear *randomly* related 
to the number of devices, chunk sizes, and stripe sizes! if the rmw 
cycle was the culprit, how come I see:
raid5, 4 devices, 16384k chunk: 41sec (4.9MB/sec)
raid5, 6 device, 4096k chunk: 2m18sec ?!?! (1.44 MB/sec!?)
the first has much larger stripe size of 49152K , the second has 20480K !

Thank you

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-30  3:13     ` Spelic
@ 2010-12-30 19:12       ` Stuart D. Gathman
  2010-12-31  3:41         ` Spelic
  0 siblings, 1 reply; 8+ messages in thread
From: Stuart D. Gathman @ 2010-12-30 19:12 UTC (permalink / raw)
  To: LVM general discussion and development

On Thu, 30 Dec 2010, Spelic wrote:

> Also there is still the mystery of why times appear *randomly* related to the
> number of devices, chunk sizes, and stripe sizes! if the rmw cycle was the
> culprit, how come I see:
> raid5, 4 devices, 16384k chunk: 41sec (4.9MB/sec)
> raid5, 6 device, 4096k chunk: 2m18sec ?!?! (1.44 MB/sec!?)
> the first has much larger stripe size of 49152K , the second has 20480K !

Ok, next theory.  Pvmove works by allocating a mirror for each
contiguous segment of the source LV, update metadata (how many metadata
copies do you have?), sync mirror, update metadata and allocate and sync
next segment until finished. 

Pvmove will be fastest when the source LV has a single contiguous chunk.

If you restored the metadata after every test, then the variation by
dest PV would blow this theory.  But if not, then the slow pvmoves would
be for fragmented source LVs.  The metadata updates between every
segment are rather expensive (but necessary).

-- 
	      Stuart D. Gathman <stuart@bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-30 19:12       ` Stuart D. Gathman
@ 2010-12-31  3:41         ` Spelic
  2010-12-31 15:36           ` Stuart D. Gathman
  0 siblings, 1 reply; 8+ messages in thread
From: Spelic @ 2010-12-31  3:41 UTC (permalink / raw)
  To: linux-lvm

On 12/30/2010 08:12 PM, Stuart D. Gathman wrote:
> On Thu, 30 Dec 2010, Spelic wrote:
>
>    
>> Also there is still the mystery of why times appear *randomly* related to the
>> number of devices, chunk sizes, and stripe sizes! if the rmw cycle was the
>> culprit, how come I see:
>> raid5, 4 devices, 16384k chunk: 41sec (4.9MB/sec)
>> raid5, 6 device, 4096k chunk: 2m18sec ?!?! (1.44 MB/sec!?)
>> the first has much larger stripe size of 49152K , the second has 20480K !
>>      
> Ok, next theory.  Pvmove works by allocating a mirror for each
> contiguous segment of the source LV, update metadata

Ok never mind, I found the problem:
LVM probably uses O_DIRECT, right?
Well it's absymally slow on MD parity raid (I checked with dd on the 
bare MD device just now) and I don't know why it's so slow. It's not 
because of the rmw because it's slow even the second time I try, when it 
does not read anything anymore because all reads are in cache already.

I understand this is probably to be fixed at MD side (and I will report 
the problem to linux-raid, but I see it has already been discussed 
without much results)
However...
...is there any chance you might fix it at lvm side too, changing LVM to 
use nondirect IO so to "support" MD?
In my raid5 array between direct and nondirect (dd bs=1M or smaller) 
there's the difference of 2.1MB/s to 250MB/sec, and would probably be 
greater on larger arrays.
Also in raid10 nondirect is much faster for small transfer sizes like 
bs=4K (28MB/sec to 160MB/sec) but not at 1M, however LVM probably uses 
low transfer sizes, right?

Thank you

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-31  3:41         ` Spelic
@ 2010-12-31 15:36           ` Stuart D. Gathman
  2010-12-31 17:23             ` Stuart D. Gathman
  0 siblings, 1 reply; 8+ messages in thread
From: Stuart D. Gathman @ 2010-12-31 15:36 UTC (permalink / raw)
  To: LVM general discussion and development

On Fri, 31 Dec 2010, Spelic wrote:

> Ok never mind, I found the problem:
> LVM probably uses O_DIRECT, right?
> Well it's absymally slow on MD parity raid (I checked with dd on the bare MD
> device just now) and I don't know why it's so slow. It's not because of the
> rmw because it's slow even the second time I try, when it does not read
> anything anymore because all reads are in cache already.

The point of O_DIRECT is to *not* use the cache.  Although a write-through
cache would seem to be OK, you have to make sure that ALL writes write-through
the cache, or the data on parity raid will be corrupted.

The R/M/W problem afflicts every level of parity raid in subtle ways.
That's why I don't like it.

-- 
	      Stuart D. Gathman <stuart@bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linux-lvm] pvmove painfully slow on parity RAID
  2010-12-31 15:36           ` Stuart D. Gathman
@ 2010-12-31 17:23             ` Stuart D. Gathman
  0 siblings, 0 replies; 8+ messages in thread
From: Stuart D. Gathman @ 2010-12-31 17:23 UTC (permalink / raw)
  To: LVM general discussion and development

On Fri, 31 Dec 2010, Stuart D. Gathman wrote:

> On Fri, 31 Dec 2010, Spelic wrote:
> 
> > Ok never mind, I found the problem:
> > LVM probably uses O_DIRECT, right?
> > Well it's absymally slow on MD parity raid (I checked with dd on the bare MD
> > device just now) and I don't know why it's so slow. It's not because of the
> > rmw because it's slow even the second time I try, when it does not read
> > anything anymore because all reads are in cache already.
> 
> The point of O_DIRECT is to *not* use the cache.  Although a write-through
> cache would seem to be OK, you have to make sure that ALL writes write-through
> the cache, or the data on parity raid will be corrupted.
> 
> The R/M/W problem afflicts every level of parity raid in subtle ways.
> That's why I don't like it.

Plus, any write to *part* of a chunk, even with a write-through cache, still
has to write the *entire* chunk.  So if chunk size is 64K, and pvmove
writes to 32K blocks with O_DIRECT, that is 2 writes of the 64K chunk, even
with the write-through cache (without the cache, it is 2 reads + 2 writes
of the same chunk).

-- 
	      Stuart D. Gathman <stuart@bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-12-31 17:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-29  2:40 [linux-lvm] pvmove painfully slow on parity RAID Spelic
2010-12-29 14:02 ` Spelic
2010-12-30  2:42   ` Stuart D. Gathman
2010-12-30  3:13     ` Spelic
2010-12-30 19:12       ` Stuart D. Gathman
2010-12-31  3:41         ` Spelic
2010-12-31 15:36           ` Stuart D. Gathman
2010-12-31 17:23             ` Stuart D. Gathman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).