linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] Wierd lvm2 performance problems
@ 2009-04-18 22:25 Sven Eschenberg
  2009-04-18 23:34 ` Eugene Vilensky
  2009-04-19  6:53 ` Milan Broz
  0 siblings, 2 replies; 11+ messages in thread
From: Sven Eschenberg @ 2009-04-18 22:25 UTC (permalink / raw)
  To: linux-lvm

Dear list,

I tried to create a PE+VG+LV ontop of a mdraid. For no obvious reason 
the lvm volume show extremely poor performance - The transferrates are 
as little as 30% of the rates on the md device itself.
In contrast creating a partition table ontop of the array gives nearly 
no performance impact.

Does anybody have the slightest idea, wht might be going wrong here?

Regards

-Sven

P.S.: I tried aligning the first PE in the VG with the mdraid chunk 
size, no difference whatsoever.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-18 22:25 [linux-lvm] Wierd lvm2 performance problems Sven Eschenberg
@ 2009-04-18 23:34 ` Eugene Vilensky
  2009-04-19  6:53 ` Milan Broz
  1 sibling, 0 replies; 11+ messages in thread
From: Eugene Vilensky @ 2009-04-18 23:34 UTC (permalink / raw)
  To: LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

Google brought up this bug
https://bugzilla.redhat.com/show_bug.cgi?id=232843  .  The read-ahead values
for LV are set to a much lower value than for a md, but can be adjusted
in user-space (my understanding).

Regards,
Eugene Vilensky
evilensky@gmail.com


On Sat, Apr 18, 2009 at 5:25 PM, Sven Eschenberg <sven@whgl.uni-frankfurt.de
> wrote:

> Dear list,
>
> I tried to create a PE+VG+LV ontop of a mdraid. For no obvious reason the
> lvm volume show extremely poor performance - The transferrates are as little
> as 30% of the rates on the md device itself.
> In contrast creating a partition table ontop of the array gives nearly no
> performance impact.
>
> Does anybody have the slightest idea, wht might be going wrong here?
>
> Regards
>
> -Sven
>
> P.S.: I tried aligning the first PE in the VG with the mdraid chunk size,
> no difference whatsoever.
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>

[-- Attachment #2: Type: text/html, Size: 1725 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-18 22:25 [linux-lvm] Wierd lvm2 performance problems Sven Eschenberg
  2009-04-18 23:34 ` Eugene Vilensky
@ 2009-04-19  6:53 ` Milan Broz
  2009-04-19 15:16   ` Sven Eschenberg
  1 sibling, 1 reply; 11+ messages in thread
From: Milan Broz @ 2009-04-19  6:53 UTC (permalink / raw)
  To: LVM general discussion and development

Sven Eschenberg wrote:
> I tried to create a PE+VG+LV ontop of a mdraid. For no obvious reason 
> the lvm volume show extremely poor performance - The transferrates are 
> as little as 30% of the rates on the md device itself.
> In contrast creating a partition table ontop of the array gives nearly 
> no performance impact.
> 
> Does anybody have the slightest idea, wht might be going wrong here?

> P.S.: I tried aligning the first PE in the VG with the mdraid chunk 
> size, no difference whatsoever.

How did you check that it is properly aligned?
The problem is mostly that data area start is misaligned with underlying
MD chunk size

Please can you paste output of
pvs -o +pe_start --unit b
and
cat /sys/block/<your_md_dev>/md/chunk_size

Please use at least lvm2 version 2.02.40 (see lvm version) for creating
VG - this version has automatic alignmnent option if running over MD
(see md_chunk_alignment in /etc/lvm.conf)

Also increasing readahead value can help (but only for simple linear
read operation). You can increase readahead persistenlty for LV
using (see man for lvchange -r ).

Milan
--
mbroz@redhat.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-19  6:53 ` Milan Broz
@ 2009-04-19 15:16   ` Sven Eschenberg
  2009-04-20  5:39     ` Luca Berra
  0 siblings, 1 reply; 11+ messages in thread
From: Sven Eschenberg @ 2009-04-19 15:16 UTC (permalink / raw)
  To: LVM general discussion and development

Hi Milan,

Milan Broz schrieb:
> Sven Eschenberg wrote:
>> I tried to create a PE+VG+LV ontop of a mdraid. For no obvious reason 
>> the lvm volume show extremely poor performance - The transferrates are 
>> as little as 30% of the rates on the md device itself.
>> In contrast creating a partition table ontop of the array gives nearly 
>> no performance impact.
>>
>> Does anybody have the slightest idea, wht might be going wrong here?
> 
>> P.S.: I tried aligning the first PE in the VG with the mdraid chunk 
>> size, no difference whatsoever.
> 
> How did you check that it is properly aligned?
> The problem is mostly that data area start is misaligned with underlying
> MD chunk size

Unfortunately I don't have the box at hand for 2 days, but I asked md to 
use a chunksize of 2048K and the /proc/mdstat reported 2048K, last time 
I checked.
The LVM hat a phy-extsize of 2M and with the --dataalignment option set 
to 2M, pvs reported a pe_start value of 2M aswell.
> 
> Please can you paste output of
> pvs -o +pe_start --unit b
> and
> cat /sys/block/<your_md_dev>/md/chunk_size
> 
> Please use at least lvm2 version 2.02.40 (see lvm version) for creating
> VG - this version has automatic alignmnent option if running over MD
> (see md_chunk_alignment in /etc/lvm.conf)
I need to check on lvm versions when I am back at the place it is 
located and will check the option then.
> 
> Also increasing readahead value can help (but only for simple linear
> read operation). You can increase readahead persistenlty for LV
> using (see man for lvchange -r ).

That I did already somewhen during the night and it improved everything 
dramatically. I set the readahead to 32768, the value md had chosen for 
the underlying md device.
> 
> Milan
> --
> mbroz@redhat.com
> 

Regards

-Sven

> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-19 15:16   ` Sven Eschenberg
@ 2009-04-20  5:39     ` Luca Berra
  2009-04-20 13:15       ` Sven Eschenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Luca Berra @ 2009-04-20  5:39 UTC (permalink / raw)
  To: linux-lvm

On Sun, Apr 19, 2009 at 05:16:21PM +0200, Sven Eschenberg wrote:
> Unfortunately I don't have the box at hand for 2 days, but I asked md to 
> use a chunksize of 2048K and the /proc/mdstat reported 2048K, last time I 
> checked.
> The LVM hat a phy-extsize of 2M and with the --dataalignment option set to 
> 2M, pvs reported a pe_start value of 2M aswell.

if you have a 2M chunk size, a full stripe is 2M*(N-1), where N-1 is the
number of drives in your array minus redundancy. (i.e. for a 5 drive
raid5 a stripe size would be 8M).

L.

-- 
Luca Berra -- bluca@comedia.it
          Communication Media & Services S.r.l.
   /"\
   \ /     ASCII RIBBON CAMPAIGN
    X        AGAINST HTML MAIL
   / \

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-20  5:39     ` Luca Berra
@ 2009-04-20 13:15       ` Sven Eschenberg
  2009-04-20 13:46         ` Luca Berra
  0 siblings, 1 reply; 11+ messages in thread
From: Sven Eschenberg @ 2009-04-20 13:15 UTC (permalink / raw)
  To: LVM general discussion and development

Hi Luca,

Okay, let's assume a chunk size of C. No matter what your md looks like,
the logical md volume consists of a series of size/C chunks. the very
first chunk C0 will hold the LVM header.
If I align the extends with the chunksize and the extends even have the
chunksize, then every extens PEx of my PV equals exactly a chunk on any of
the disks.
Which in turn means, if I want to read PEx I have to read some chunk Cy on
one disk, and PEx+1 would most certainly be a Chunk Cy+1 which would
reside on a different physical disk.

So the question is: Why would you want to align the first PE to the
stripesize, rather then the chunksize?

Regards

-Sven


On Mon, April 20, 2009 07:39, Luca Berra wrote:
> On Sun, Apr 19, 2009 at 05:16:21PM +0200, Sven Eschenberg wrote:
>> Unfortunately I don't have the box at hand for 2 days, but I asked md to
>> use a chunksize of 2048K and the /proc/mdstat reported 2048K, last time
>> I
>> checked.
>> The LVM hat a phy-extsize of 2M and with the --dataalignment option set
>> to
>> 2M, pvs reported a pe_start value of 2M aswell.
>
> if you have a 2M chunk size, a full stripe is 2M*(N-1), where N-1 is the
> number of drives in your array minus redundancy. (i.e. for a 5 drive
> raid5 a stripe size would be 8M).
>
> L.
>
> --
> Luca Berra -- bluca@comedia.it
>           Communication Media & Services S.r.l.
>    /"\
>    \ /     ASCII RIBBON CAMPAIGN
>     X        AGAINST HTML MAIL
>    / \
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-20 13:15       ` Sven Eschenberg
@ 2009-04-20 13:46         ` Luca Berra
  2009-04-20 14:14           ` Sven Eschenberg
  2009-04-21 17:24           ` Sven Eschenberg
  0 siblings, 2 replies; 11+ messages in thread
From: Luca Berra @ 2009-04-20 13:46 UTC (permalink / raw)
  To: linux-lvm

On Mon, Apr 20, 2009 at 03:15:12PM +0200, Sven Eschenberg wrote:
>Hi Luca,
>
>Okay, let's assume a chunk size of C. No matter what your md looks like,
>the logical md volume consists of a series of size/C chunks. the very
>first chunk C0 will hold the LVM header.
>If I align the extends with the chunksize and the extends even have the
>chunksize, then every extens PEx of my PV equals exactly a chunk on any of
>the disks.
>Which in turn means, if I want to read PEx I have to read some chunk Cy on
>one disk, and PEx+1 would most certainly be a Chunk Cy+1 which would
>reside on a different physical disk.

correct

>So the question is: Why would you want to align the first PE to the
>stripesize, rather then the chunksize?

Because when you _write_ incomplete stripes, the raid code
would need to do a read-modify-write of the parity block.

Filesystem, like ext3/4 and xfs have the ability to account for stripe
size in the block allocator to prevent unnecessary read-modify-writes,
but if you do not stripe-align the start of the filesystem you cannot
take advantage of this.

The annoying issue is that rarely you have a (n^2)+P array, and pe_size
must be a power of 2.
So for example, given my 3D1P raid5 the only solution I devised was
having a chunk size which is a power of 2k, pe_start is aligned to
stripe, pe_size = chunk size, and I have to remember that every time I
extend a LV it has to be extended to the nearest multiple of 3 LE.

Regards,
L.

>Regards
>
>-Sven
>
>
>On Mon, April 20, 2009 07:39, Luca Berra wrote:
>> On Sun, Apr 19, 2009 at 05:16:21PM +0200, Sven Eschenberg wrote:
>>> Unfortunately I don't have the box at hand for 2 days, but I asked md to
>>> use a chunksize of 2048K and the /proc/mdstat reported 2048K, last time
>>> I
>>> checked.
>>> The LVM hat a phy-extsize of 2M and with the --dataalignment option set
>>> to
>>> 2M, pvs reported a pe_start value of 2M aswell.
>>
>> if you have a 2M chunk size, a full stripe is 2M*(N-1), where N-1 is the
>> number of drives in your array minus redundancy. (i.e. for a 5 drive
>> raid5 a stripe size would be 8M).
>>
>> L.
>>
>> --
>> Luca Berra -- bluca@comedia.it
>>           Communication Media & Services S.r.l.
>>    /"\
>>    \ /     ASCII RIBBON CAMPAIGN
>>     X        AGAINST HTML MAIL
>>    / \
>>
>> _______________________________________________
>> linux-lvm mailing list
>> linux-lvm@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-lvm
>> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>
>
>
>_______________________________________________
>linux-lvm mailing list
>linux-lvm@redhat.com
>https://www.redhat.com/mailman/listinfo/linux-lvm
>read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-20 13:46         ` Luca Berra
@ 2009-04-20 14:14           ` Sven Eschenberg
  2009-04-20 14:39             ` Luca Berra
  2009-04-21 17:24           ` Sven Eschenberg
  1 sibling, 1 reply; 11+ messages in thread
From: Sven Eschenberg @ 2009-04-20 14:14 UTC (permalink / raw)
  To: LVM general discussion and development

Hi Luca,

On Mon, April 20, 2009 15:46, Luca Berra wrote:
> On Mon, Apr 20, 2009 at 03:15:12PM +0200, Sven Eschenberg wrote:
>>Hi Luca,
>>
>>Okay, let's assume a chunk size of C. No matter what your md looks like,
>>the logical md volume consists of a series of size/C chunks. the very
>>first chunk C0 will hold the LVM header.
>>If I align the extends with the chunksize and the extends even have the
>>chunksize, then every extens PEx of my PV equals exactly a chunk on any
>> of
>>the disks.
>>Which in turn means, if I want to read PEx I have to read some chunk Cy
>> on
>>one disk, and PEx+1 would most certainly be a Chunk Cy+1 which would
>>reside on a different physical disk.
>
> correct
>
>>So the question is: Why would you want to align the first PE to the
>>stripesize, rather then the chunksize?
>
> Because when you _write_ incomplete stripes, the raid code
> would need to do a read-modify-write of the parity block.

I didn't think of this 'yet', then again all the preliminary tests I did
so far were on a 4D raid10 - Didn't have the time to setup the raid5
volume yet, because the performance issues on the raid10 were so amazing
:-D.

>
> Filesystem, like ext3/4 and xfs have the ability to account for stripe
> size in the block allocator to prevent unnecessary read-modify-writes,
> but if you do not stripe-align the start of the filesystem you cannot
> take advantage of this.
>

Since you mentioned it: What is the specific option (for xfs mainly) to
modify this behavior?

> The annoying issue is that rarely you have a (n^2)+P array, and pe_size
> must be a power of 2.
> So for example, given my 3D1P raid5 the only solution I devised was
> having a chunk size which is a power of 2k, pe_start is aligned to
> stripe, pe_size = chunk size, and I have to remember that every time I
> extend a LV it has to be extended to the nearest multiple of 3 LE.

Ouch, I see, I'm gonna be as lucky as you :-).

Another question arose, when I thought about something: I actually wanted
to place the OS on a stripe of mirrors, since this gives me the
statistically best robustness against two failing disks. From what I could
read in the md man page, non of the offered raid10 modes provides such a
layout. Would I have to first mirror two drives with md and then stripe em
together with md on top of md?

>
> Regards,
> L.

Regards

-Sven

>
>>Regards
>>
>>-Sven
>>
>>
>>On Mon, April 20, 2009 07:39, Luca Berra wrote:
>>> On Sun, Apr 19, 2009 at 05:16:21PM +0200, Sven Eschenberg wrote:
>>>> Unfortunately I don't have the box at hand for 2 days, but I asked md
>>>> to
>>>> use a chunksize of 2048K and the /proc/mdstat reported 2048K, last
>>>> time
>>>> I
>>>> checked.
>>>> The LVM hat a phy-extsize of 2M and with the --dataalignment option
>>>> set
>>>> to
>>>> 2M, pvs reported a pe_start value of 2M aswell.
>>>
>>> if you have a 2M chunk size, a full stripe is 2M*(N-1), where N-1 is
>>> the
>>> number of drives in your array minus redundancy. (i.e. for a 5 drive
>>> raid5 a stripe size would be 8M).
>>>
>>> L.
>>>
>>> --
>>> Luca Berra -- bluca@comedia.it
>>>           Communication Media & Services S.r.l.
>>>    /"\
>>>    \ /     ASCII RIBBON CAMPAIGN
>>>     X        AGAINST HTML MAIL
>>>    / \
>>>
>>> _______________________________________________
>>> linux-lvm mailing list
>>> linux-lvm@redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-lvm
>>> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>>
>>
>>
>>_______________________________________________
>>linux-lvm mailing list
>>linux-lvm@redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-lvm
>>read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>
> --
> Luca Berra -- bluca@comedia.it
>          Communication Media & Services S.r.l.
>   /"\
>   \ /     ASCII RIBBON CAMPAIGN
>    X        AGAINST HTML MAIL
>   / \
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-20 14:14           ` Sven Eschenberg
@ 2009-04-20 14:39             ` Luca Berra
  0 siblings, 0 replies; 11+ messages in thread
From: Luca Berra @ 2009-04-20 14:39 UTC (permalink / raw)
  To: linux-lvm

On Mon, Apr 20, 2009 at 04:14:22PM +0200, Sven Eschenberg wrote:
>Hi Luca,
>
>On Mon, April 20, 2009 15:46, Luca Berra wrote:
>> On Mon, Apr 20, 2009 at 03:15:12PM +0200, Sven Eschenberg wrote:
>>>Hi Luca,
>>>
>>>Okay, let's assume a chunk size of C. No matter what your md looks like,
>>>the logical md volume consists of a series of size/C chunks. the very
>>>first chunk C0 will hold the LVM header.
>>>If I align the extends with the chunksize and the extends even have the
>>>chunksize, then every extens PEx of my PV equals exactly a chunk on any
>>> of
>>>the disks.
>>>Which in turn means, if I want to read PEx I have to read some chunk Cy
>>> on
>>>one disk, and PEx+1 would most certainly be a Chunk Cy+1 which would
>>>reside on a different physical disk.
>>
>> correct
>>
>>>So the question is: Why would you want to align the first PE to the
>>>stripesize, rather then the chunksize?
>>
>> Because when you _write_ incomplete stripes, the raid code
>> would need to do a read-modify-write of the parity block.
>
>I didn't think of this 'yet', then again all the preliminary tests I did
>so far were on a 4D raid10 - Didn't have the time to setup the raid5
>volume yet, because the performance issues on the raid10 were so amazing
>:-D.
>
>>
>> Filesystem, like ext3/4 and xfs have the ability to account for stripe
>> size in the block allocator to prevent unnecessary read-modify-writes,
>> but if you do not stripe-align the start of the filesystem you cannot
>> take advantage of this.
>>
>
>Since you mentioned it: What is the specific option (for xfs mainly) to
>modify this behavior?
-d sunit=n (chunk size in 512b blocks)
-d swidth=n (stripe size in 512b blocks)
or, more convenient
-d su=n (chunk size in bytes)
-d sw=n (stripe size in bites

eg. mkfs.xfs -d su=64k,sw=192k ....
for a 3+1 raid5 with default chunksize

>> The annoying issue is that rarely you have a (n^2)+P array, and pe_size
>> must be a power of 2.
>> So for example, given my 3D1P raid5 the only solution I devised was
>> having a chunk size which is a power of 2k, pe_start is aligned to
>> stripe, pe_size = chunk size, and I have to remember that every time I
>> extend a LV it has to be extended to the nearest multiple of 3 LE.
>
>Ouch, I see, I'm gonna be as lucky as you :-).
>
>Another question arose, when I thought about something: I actually wanted
>to place the OS on a stripe of mirrors, since this gives me the
>statistically best robustness against two failing disks. From what I could
>read in the md man page, non of the offered raid10 modes provides such a
>layout. Would I have to first mirror two drives with md and then stripe em
>together with md on top of md?

i believe raid10 to be smart enough, but i am not 100% confident,
you could ask on linux-raid ml.
stacking raid devices would be an alternative

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-20 13:46         ` Luca Berra
  2009-04-20 14:14           ` Sven Eschenberg
@ 2009-04-21 17:24           ` Sven Eschenberg
  2009-04-22  7:38             ` Luca Berra
  1 sibling, 1 reply; 11+ messages in thread
From: Sven Eschenberg @ 2009-04-21 17:24 UTC (permalink / raw)
  To: LVM general discussion and development

Hi Luca,

I gave this a little more thought ...

Luca Berra schrieb:
> On Mon, Apr 20, 2009 at 03:15:12PM +0200, Sven Eschenberg wrote:
>> Hi Luca,
>>
>> Okay, let's assume a chunk size of C. No matter what your md looks like,
>> the logical md volume consists of a series of size/C chunks. the very
>> first chunk C0 will hold the LVM header.
>> If I align the extends with the chunksize and the extends even have the
>> chunksize, then every extens PEx of my PV equals exactly a chunk on 
>> any of
>> the disks.
>> Which in turn means, if I want to read PEx I have to read some chunk 
>> Cy on
>> one disk, and PEx+1 would most certainly be a Chunk Cy+1 which would
>> reside on a different physical disk.
> 
> correct
> 
>> So the question is: Why would you want to align the first PE to the
>> stripesize, rather then the chunksize?
> 
> Because when you _write_ incomplete stripes, the raid code
> would need to do a read-modify-write of the parity block.

Okay, the question is, how often, if you modify files at random, do you 
really write a full stripe, even if the cache holds back all 
modification for a couple minutes. I wonder how often you can take 
advantage of this in normal mixed load situations.

> 
> Filesystem, like ext3/4 and xfs have the ability to account for stripe
> size in the block allocator to prevent unnecessary read-modify-writes,
> but if you do not stripe-align the start of the filesystem you cannot
> take advantage of this.

Okay, understood, but doesn't this imply, as long as my application 
running on top of an md and/or LV ontop of an md cannot take advantage 
of the layout information, it doesn't matter at all. I do see the 
advantage, I.E. if you have an RDBMS that can operate and organize 
itself ontop of some blockdevice which has a certain layout, or any 
filesystem taking this into account.
In contrast, if I am to export the blockdevice as iSCSI target in a 
plain NAS, this doesn't help me at all.
Now, even if I properly stripe align the pe_start, what happens if I am 
doing a whole disk online capacity expansion? As long as LVM cannot 
realign everything online, and the filesystem can realign itself (or 
update it's layout accordingly) online, this is pretty much pointless.

> 
> Regards,
> L.
> 

In the end it all comes down to, that in most cases aligning doesn't 
help, at leats not, if the whoel array configuration might change over 
time - or am I mistaken there?

Regards

-Sven

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Wierd lvm2 performance problems
  2009-04-21 17:24           ` Sven Eschenberg
@ 2009-04-22  7:38             ` Luca Berra
  0 siblings, 0 replies; 11+ messages in thread
From: Luca Berra @ 2009-04-22  7:38 UTC (permalink / raw)
  To: linux-lvm

On Tue, Apr 21, 2009 at 07:24:19PM +0200, Sven Eschenberg wrote:
> Hi Luca,
>
> I gave this a little more thought ...
>
> Luca Berra schrieb:
>> Because when you _write_ incomplete stripes, the raid code
>> would need to do a read-modify-write of the parity block.
>
> Okay, the question is, how often, if you modify files at random, do you 
> really write a full stripe, even if the cache holds back all modification 
> for a couple minutes. I wonder how often you can take advantage of this in 
> normal mixed load situations.
i am no expert in filesystem internals, but i believe the idea is
minimize r-m-w, non necessarily writing always full stripes
i.e
default raid5 4+1, chunk 64k stripe 256k
you write a 800k file starting with chunk 1233 it has to r-m-w stripe 308
and 311, ad full stripe 309 and 310.
if the fs was aware of the underlying device it would try allocate the
file starting from chunk 1236, resulting in 2 full stripe and only one
r-m-w

>> Filesystem, like ext3/4 and xfs have the ability to account for stripe
>> size in the block allocator to prevent unnecessary read-modify-writes,
>> but if you do not stripe-align the start of the filesystem you cannot
>> take advantage of this.
>
> Okay, understood, but doesn't this imply, as long as my application running 
> on top of an md and/or LV ontop of an md cannot take advantage of the 
> layout information, it doesn't matter at all. I do see the advantage, I.E. 
> if you have an RDBMS that can operate and organize itself ontop of some 
> blockdevice which has a certain layout, or any filesystem taking this into 
> account.
> In contrast, if I am to export the blockdevice as iSCSI target in a plain 
> NAS, this doesn't help me at all.
probably not, unless the iscsi client is also optimized

> Now, even if I properly stripe align the pe_start, what happens if I am 
> doing a whole disk online capacity expansion? As long as LVM cannot realign 
> everything online, and the filesystem can realign itself (or update it's 
> layout accordingly) online, this is pretty much pointless.
afaik lvm cannot realign itself automatically, i believe it is doable
manually by pvmoving away the first pe (or the first n pe, depending on
configuration), vgcfgbackup, vi, vgcfgrestore.
then you only have to realign PEs.
another option is planning for possible capacity upgrades and using
n1*n2 .. nn * chunk_size as unit for both pe_start and pe_size *
number_of_pe_i_align_lv_size_to (see my previous mail about non n^2
stripe size).  This is at most 3*4*5*7*chunk_size.
Filesystems _can_ be taught to update their layout (for future writes,
that is): ext3/4 with tune2fs, xfs with sunit/swidth mount options.

> In the end it all comes down to, that in most cases aligning doesn't help, 
> at leats not, if the whoel array configuration might change over time - or 
> am I mistaken there?
It all comes down to, that performance tuning is bound to the
environment we are tuning for. some choices may give performance boosts
in one environment, but be detrimental in another.
Sometimes it is not even clear at a project start what the best route
is, sometimes unforseen changes disrupt a well tought setup.
being able to adapt to all possible future changes is probably
impossible, still a little bit of foretought is not completely wasted.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-22  7:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-18 22:25 [linux-lvm] Wierd lvm2 performance problems Sven Eschenberg
2009-04-18 23:34 ` Eugene Vilensky
2009-04-19  6:53 ` Milan Broz
2009-04-19 15:16   ` Sven Eschenberg
2009-04-20  5:39     ` Luca Berra
2009-04-20 13:15       ` Sven Eschenberg
2009-04-20 13:46         ` Luca Berra
2009-04-20 14:14           ` Sven Eschenberg
2009-04-20 14:39             ` Luca Berra
2009-04-21 17:24           ` Sven Eschenberg
2009-04-22  7:38             ` Luca Berra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).