* Alignment: XFS + LVM2
@ 2014-05-07 12:43 Marc Caubet
2014-05-08 2:28 ` Stan Hoeppner
0 siblings, 1 reply; 6+ messages in thread
From: Marc Caubet @ 2014-05-07 12:43 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 2668 bytes --]
Hi all,
I am trying to setup a storage pool with correct disk alignment and I hope
somebody can help me to understand some unclear parts to me when
configuring XFS over LVM2.
Actually we have few storage pools with the following settings each:
- LSI Controller with 3xRAID6
- Each RAID6 is configured with 10 data disks + 2 for double-parity.
- Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
- 3x(10+2) configuration was considered in order to gain best performance
and data safety (less disks per RAID less probability of data corruption)
>From the O.S. side we see:
[root@stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
Disk /dev/sda: 40000.0 GB, 39999997214720 bytes
255 heads, 63 sectors/track, 4863055 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdb: 40000.0 GB, 39999997214720 bytes
255 heads, 63 sectors/track, 4863055 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdc: 40000.0 GB, 39999997214720 bytes
255 heads, 63 sectors/track, 4863055 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
The idea is to aggregate the above devices and show only 1 storage space.
We did as follows:
vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
Hence, stripe of the 3 RAID6 in a LV.
And here is my first question: How can I check if the storage and the LV
are correctly aligned?
On the other hand, I have formatted XFS as follows:
mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
So my second question is, are the above 'su' and 'sw' parameters correct on
the current LV configuration? If not, which values should I have and why?
AFAIK su is the stripe size configured in the controller side, but in this
case we have a LV. Also, sw is the number of data disks in a RAID, but
again, we have a LV with 3 stripes, and I am not sure if the number of data
disks should be 30 instead.
Thanks a lot,
--
Marc Caubet Serrabou
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html
[-- Attachment #1.2: Type: text/html, Size: 3113 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Alignment: XFS + LVM2
2014-05-07 12:43 Alignment: XFS + LVM2 Marc Caubet
@ 2014-05-08 2:28 ` Stan Hoeppner
2014-05-08 9:12 ` Marc Caubet
0 siblings, 1 reply; 6+ messages in thread
From: Stan Hoeppner @ 2014-05-08 2:28 UTC (permalink / raw)
To: Marc Caubet, xfs
Everything begins and ends with the workload.
On 5/7/2014 7:43 AM, Marc Caubet wrote:
> Hi all,
>
> I am trying to setup a storage pool with correct disk alignment and I hope
> somebody can help me to understand some unclear parts to me when
> configuring XFS over LVM2.
I'll try. But to be honest, after my first read of your post, a few
things jump out as breaking traditional rules.
The first thing you need to consider is your workload and the type of
read/write patterns it will generate. This document is unfinished, and
unformatted, but reading what is there should be informative:
http://www.hardwarefreak.com/xfs/storage-arch.txt
> Actually we have few storage pools with the following settings each:
>
> - LSI Controller with 3xRAID6
> - Each RAID6 is configured with 10 data disks + 2 for double-parity.
> - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
512e drives may cause data loss. See:
http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
> - 3x(10+2) configuration was considered in order to gain best performance
> and data safety (less disks per RAID less probability of data corruption)
RAID6 is the worst performer of all the RAID levels but gives the best
resilience to multiple drive failure. The reason for using fewer drives
per array has less to do with probability of corruption, but
1. Limiting RMW operations to as few drives as possible, especially for
controllers that do full stripe scrubbing on RMW
2. Lowering bandwidth and time required to rebuild a dead drive, fewer
drives tied up during a rebuild
> From the O.S. side we see:
>
> [root@stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
...
You omitted crucial information. What is the stripe unit size of each
RAID6?
> The idea is to aggregate the above devices and show only 1 storage space.
> We did as follows:
>
> vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
> lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
You've told LVM that its stripe unit is 4MB, and thus the stripe width
of each RAID6 is 4MB. This is not possible with 10 data spindles.
Again, show the RAID geometry from the LSI tools.
When creating a nested stripe, the stripe unit of the outer stripe (LVM)
must equal the stripe width of eachinner stripe (RAID6).
> Hence, stripe of the 3 RAID6 in a LV.
Each RAID6 has ~1.3GB/s of throughput. By striping the 3 arrays into a
nested RAID60 this suggests you need single file throughput greater than
1.3GB/s and that all files are very large. If not, you'd be better off
using a concatenation, and using md to accomplish that instead of LVM.
> And here is my first question: How can I check if the storage and the LV
> are correctly aligned?
Answer is above. But the more important question is whether your
workload wants a stripe or a concatenation.
> On the other hand, I have formatted XFS as follows:
>
> mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
This alignment is not correct. XFS must be aligned to the LVM stripe
geometry. Here you apparently aligned XFS to the RAID6 geometry
instead. Why are you manually specifying a 128M log? If you knew your
workload that well, you would not have made these other mistakes.
In a nutshell, you need to ditch all of this and start over.
> So my second question is, are the above 'su' and 'sw' parameters correct on
> the current LV configuration? If not, which values should I have and why?
> AFAIK su is the stripe size configured in the controller side, but in this
> case we have a LV. Also, sw is the number of data disks in a RAID, but
> again, we have a LV with 3 stripes, and I am not sure if the number of data
> disks should be 30 instead.
Describe your workload and we can tell you how to properly set this up.
Cheers,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Alignment: XFS + LVM2
2014-05-08 2:28 ` Stan Hoeppner
@ 2014-05-08 9:12 ` Marc Caubet
2014-05-08 13:04 ` Stan Hoeppner
0 siblings, 1 reply; 6+ messages in thread
From: Marc Caubet @ 2014-05-08 9:12 UTC (permalink / raw)
To: stan; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 4889 bytes --]
Hi Stan,
thanks for your answer.
Everything begins and ends with the workload.
>
> On 5/7/2014 7:43 AM, Marc Caubet wrote:
> > Hi all,
> >
> > I am trying to setup a storage pool with correct disk alignment and I
> hope
> > somebody can help me to understand some unclear parts to me when
> > configuring XFS over LVM2.
>
> I'll try. But to be honest, after my first read of your post, a few
> things jump out as breaking traditional rules.
>
> The first thing you need to consider is your workload and the type of
> read/write patterns it will generate. This document is unfinished, and
> unformatted, but reading what is there should be informative:
>
> http://www.hardwarefreak.com/xfs/storage-arch.txt
>
Basically we are moving a lot of data :) It means, parallel large files
(GBs) are being written and read all the time. Basically we have a batch
farm with 3,5k cores processing jobs that are constantly reading and
writing to the storage pools (4PBs). Only few pools (~5% of the total)
contain small files (and only small files).
>
> > Actually we have few storage pools with the following settings each:
> >
> > - LSI Controller with 3xRAID6
> > - Each RAID6 is configured with 10 data disks + 2 for double-parity.
> > - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
>
> 512e drives may cause data loss. See:
> http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
>
Haven't experienced this yet. But good to know thanks :) On the other
hand, we do not use zfs
> > - 3x(10+2) configuration was considered in order to gain best performance
> > and data safety (less disks per RAID less probability of data corruption)
>
> RAID6 is the worst performer of all the RAID levels but gives the best
> resilience to multiple drive failure. The reason for using fewer drives
> per array has less to do with probability of corruption, but
>
> 1. Limiting RMW operations to as few drives as possible, especially for
> controllers that do full stripe scrubbing on RMW
>
> 2. Lowering bandwidth and time required to rebuild a dead drive, fewer
> drives tied up during a rebuild
>
> > From the O.S. side we see:
> >
> > [root@stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
> ...
>
> You omitted crucial information. What is the stripe unit size of each
> RAID6?
>
Actually the stripe size for each RAID6 is 256KB but we plan to increase
some pools to 1MB for all their RAIDs. It will be in order to compare
performance for pools containing large files and if this improves, we will
apply it to the other systems in the future.
> > The idea is to aggregate the above devices and show only 1 storage space.
> > We did as follows:
> >
> > vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
> > lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
>
> You've told LVM that its stripe unit is 4MB, and thus the stripe width
> of each RAID6 is 4MB. This is not possible with 10 data spindles.
> Again, show the RAID geometry from the LSI tools.
>
When creating a nested stripe, the stripe unit of the outer stripe (LVM)
> must equal the stripe width of eachinner stripe (RAID6).
>
Great. Hence, if the RAID6 stripe size is 256k then the LVM should be
defined with 256k as well, isn't it?
> Hence, stripe of the 3 RAID6 in a LV.
>
> Each RAID6 has ~1.3GB/s of throughput. By striping the 3 arrays into a
> nested RAID60 this suggests you need single file throughput greater than
> 1.3GB/s and that all files are very large. If not, you'd be better off
> using a concatenation, and using md to accomplish that instead of LVM.
>
> > And here is my first question: How can I check if the storage and the LV
> > are correctly aligned?
>
> Answer is above. But the more important question is whether your
> workload wants a stripe or a concatenation.
>
> > On the other hand, I have formatted XFS as follows:
> >
> > mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
>
> This alignment is not correct. XFS must be aligned to the LVM stripe
> geometry. Here you apparently aligned XFS to the RAID6 geometry
> instead. Why are you manually specifying a 128M log? If you knew your
> workload that well, you would not have made these other mistakes.
>
We receive several parallel writes all the time, and afaik filesystems with
such write load benenfit from a larger log. 128M is the maximum log size.
So how XFS should be formatted then? As you specify, should be aligned with
the LVM stripe, as we have a LV with 3 stripes then 256k*3 and sw=30?
Thanks a lot,
--
Marc Caubet Serrabou
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html
[-- Attachment #1.2: Type: text/html, Size: 6809 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Alignment: XFS + LVM2
2014-05-08 9:12 ` Marc Caubet
@ 2014-05-08 13:04 ` Stan Hoeppner
2014-05-08 13:52 ` Marc Caubet
0 siblings, 1 reply; 6+ messages in thread
From: Stan Hoeppner @ 2014-05-08 13:04 UTC (permalink / raw)
To: Marc Caubet; +Cc: xfs
On 5/8/2014 4:12 AM, Marc Caubet wrote:
> Hi Stan,
>
> thanks for your answer.
>
> Everything begins and ends with the workload.
>>
>> On 5/7/2014 7:43 AM, Marc Caubet wrote:
>>> Hi all,
>>>
>>> I am trying to setup a storage pool with correct disk alignment and I
>> hope
>>> somebody can help me to understand some unclear parts to me when
>>> configuring XFS over LVM2.
>>
>> I'll try. But to be honest, after my first read of your post, a few
>> things jump out as breaking traditional rules.
>>
>> The first thing you need to consider is your workload and the type of
>> read/write patterns it will generate. This document is unfinished, and
>> unformatted, but reading what is there should be informative:
>>
>> http://www.hardwarefreak.com/xfs/storage-arch.txt
>>
>
> Basically we are moving a lot of data :) It means, parallel large files
> (GBs) are being written and read all the time. Basically we have a batch
> farm with 3,5k cores processing jobs that are constantly reading and
> writing to the storage pools (4PBs). Only few pools (~5% of the total)
> contain small files (and only small files).
And these pools are tied together with? Gluster? Ceph?
>>> Actually we have few storage pools with the following settings each:
>>>
>>> - LSI Controller with 3xRAID6
>>> - Each RAID6 is configured with 10 data disks + 2 for double-parity.
>>> - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
>>
>> 512e drives may cause data loss. See:
>> http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
>>
>
> Haven't experienced this yet. But good to know thanks :) On the other
> hand, we do not use zfs
This problem affects all filesystems. If the drive loses power during
an RMW cycle the physical sector is corrupted. As noted, not all 512e
drives may have this problem. And for the bulk of your workload this
shouldn't be an issue. If you have sufficient and properly functioning
UPS it shouldn't be an issue either.
>>> - 3x(10+2) configuration was considered in order to gain best performance
>>> and data safety (less disks per RAID less probability of data corruption)
>>
>> RAID6 is the worst performer of all the RAID levels but gives the best
>> resilience to multiple drive failure. The reason for using fewer drives
>> per array has less to do with probability of corruption, but
>>
>> 1. Limiting RMW operations to as few drives as possible, especially for
>> controllers that do full stripe scrubbing on RMW
>>
>> 2. Lowering bandwidth and time required to rebuild a dead drive, fewer
>> drives tied up during a rebuild
>>
>
>>> From the O.S. side we see:
>>>
>>> [root@stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
>> ...
>>
>> You omitted crucial information. What is the stripe unit size of each
>> RAID6?
>>
>
> Actually the stripe size for each RAID6 is 256KB but we plan to increase
> some pools to 1MB for all their RAIDs. It will be in order to compare
> performance for pools containing large files and if this improves, we will
> apply it to the other systems in the future.
So currently you have a 2.5MB stripe width per RAID6 and you plan to
test with a 10MB stripe width.
>>> The idea is to aggregate the above devices and show only 1 storage space.
>>> We did as follows:
>>>
>>> vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
>>> lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
>>
>> You've told LVM that its stripe unit is 4MB, and thus the stripe width
>> of each RAID6 is 4MB. This is not possible with 10 data spindles.
>> Again, show the RAID geometry from the LSI tools.
>>
> When creating a nested stripe, the stripe unit of the outer stripe (LVM)
>> must equal the stripe width of eachinner stripe (RAID6).
>>
>
> Great. Hence, if the RAID6 stripe size is 256k then the LVM should be
> defined with 256k as well, isn't it?
No. And according to lvcreate(8) you cannot use LVM for the outer
stripe because you have 10 data spindles per RAID6. "StripeSize" is
limited to power of 2 values. Your RAID6 stripe width is 2560 KB which
is not a power of 2 value. So you must use md. See mdadm(8).
And be careful with terminology. "Stripe unit" is per disk, called
"chunk" by mdadm. "Stripe width" is per array. "Stripe size" is ambiguous.
When nesting stripes, the "stripe width" of the RAID6 becomes the
"stripe unit" of the outer stripe of the resulting RAID60. In essence,
each RAID6 is treated as a "drive" in the outer stripe. For example:
RAID6 stripe unit = 256 KB
RAID6 stripe width = 2560 KB
RAID60 stripe unit = 2560 KB
RAID60 stripe width = 7680 KB
For RAID6 w/1MB stripe unit
RAID6 stripe unit = 1 MB
RAID6 stripe width = 10 MB
RAID60 stripe unit = 10 MB
RAID60 stripe width = 30 MB
This is assuming your stated configuration of 12 drives per RAID6, 10
data spindles, and 3 RAID6 arrays per nested stripe.
>> Hence, stripe of the 3 RAID6 in a LV.
>>
>> Each RAID6 has ~1.3GB/s of throughput. By striping the 3 arrays into a
>> nested RAID60 this suggests you need single file throughput greater than
>> 1.3GB/s and that all files are very large. If not, you'd be better off
>> using a concatenation, and using md to accomplish that instead of LVM.
>>
>>> And here is my first question: How can I check if the storage and the LV
>>> are correctly aligned?
>>
>> Answer is above. But the more important question is whether your
>> workload wants a stripe or a concatenation.
>>
>>> On the other hand, I have formatted XFS as follows:
>>>
>>> mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
lazy-count=1 is the default. No need to specify it.
>> This alignment is not correct. XFS must be aligned to the LVM stripe
>> geometry. Here you apparently aligned XFS to the RAID6 geometry
>> instead. Why are you manually specifying a 128M log? If you knew your
>> workload that well, you would not have made these other mistakes.
>>
>
> We receive several parallel writes all the time, and afaik filesystems with
> such write load benenfit from a larger log. 128M is the maximum log size.
Metadata is journaled, file data is not. Filesystems experiencing a
large amount of metadata modification may benefit from a larger journal
log, however writing many large files in parallel typically doesn't
generate much metadata modification. In addition, with delayed logging
now the default, the amount of data written to the journal is much less
than it used to be. So specifying a log size should not be necessary
with your workload.
> So how XFS should be formatted then? As you specify, should be aligned with
> the LVM stripe, as we have a LV with 3 stripes then 256k*3 and sw=30?
It must be aligned to the outer stripe in the nest, which would be the
LVM geometry if it could work. However, as stated, it appears you
cannot use lvcreate to make the outer stripe because it does not allow a
2560 KiB StripeSize. Destroy the LVM volume and create an md RAID0
device of the 3 RAID6 devices, eg:
$ mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
For making the filesystem and aligning it to the md nested stripe
RAID60, this is all that is required:
$ mkfs.xfs -d su=2560k,sw=3 /dev/md0
Cheers,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Alignment: XFS + LVM2
2014-05-08 13:04 ` Stan Hoeppner
@ 2014-05-08 13:52 ` Marc Caubet
2014-05-08 19:49 ` Stan Hoeppner
0 siblings, 1 reply; 6+ messages in thread
From: Marc Caubet @ 2014-05-08 13:52 UTC (permalink / raw)
To: stan; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 9353 bytes --]
Hi Stan,
once again, thanks for your answer.
> Hi Stan,
> >
> > thanks for your answer.
> >
> > Everything begins and ends with the workload.
> >>
> >> On 5/7/2014 7:43 AM, Marc Caubet wrote:
> >>> Hi all,
> >>>
> >>> I am trying to setup a storage pool with correct disk alignment and I
> >> hope
> >>> somebody can help me to understand some unclear parts to me when
> >>> configuring XFS over LVM2.
> >>
> >> I'll try. But to be honest, after my first read of your post, a few
> >> things jump out as breaking traditional rules.
> >>
> >> The first thing you need to consider is your workload and the type of
> >> read/write patterns it will generate. This document is unfinished, and
> >> unformatted, but reading what is there should be informative:
> >>
> >> http://www.hardwarefreak.com/xfs/storage-arch.txt
> >>
> >
> > Basically we are moving a lot of data :) It means, parallel large files
> > (GBs) are being written and read all the time. Basically we have a batch
> > farm with 3,5k cores processing jobs that are constantly reading and
> > writing to the storage pools (4PBs). Only few pools (~5% of the total)
> > contain small files (and only small files).
>
> And these pools are tied together with? Gluster? Ceph?
>
We are using dCache (http://www.dcache.org/), where a file is written in a
single pool instead of spreading parts among pools as Ceph or Hadoop do. So
large files go entirely to a pool.
> >>> Actually we have few storage pools with the following settings each:
> >>>
> >>> - LSI Controller with 3xRAID6
> >>> - Each RAID6 is configured with 10 data disks + 2 for double-parity.
> >>> - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
> >>
> >> 512e drives may cause data loss. See:
> >> http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
> >>
> >
> > Haven't experienced this yet. But good to know thanks :) On the other
> > hand, we do not use zfs
>
> This problem affects all filesystems. If the drive loses power during
> an RMW cycle the physical sector is corrupted. As noted, not all 512e
> drives may have this problem. And for the bulk of your workload this
> shouldn't be an issue. If you have sufficient and properly functioning
> UPS it shouldn't be an issue either.
>
Actually all LSI controllers have batteries so I hope it will not happen.
This problems is good to have this in mind when we purchase new storage
machines so thanks :)
>
> >>> - 3x(10+2) configuration was considered in order to gain best
> performance
> >>> and data safety (less disks per RAID less probability of data
> corruption)
> >>
> >> RAID6 is the worst performer of all the RAID levels but gives the best
> >> resilience to multiple drive failure. The reason for using fewer drives
> >> per array has less to do with probability of corruption, but
> >>
> >> 1. Limiting RMW operations to as few drives as possible, especially for
> >> controllers that do full stripe scrubbing on RMW
> >>
> >> 2. Lowering bandwidth and time required to rebuild a dead drive, fewer
> >> drives tied up during a rebuild
> >>
> >
> >>> From the O.S. side we see:
> >>>
> >>> [root@stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
> >> ...
> >>
> >> You omitted crucial information. What is the stripe unit size of each
> >> RAID6?
> >>
> >
> > Actually the stripe size for each RAID6 is 256KB but we plan to increase
> > some pools to 1MB for all their RAIDs. It will be in order to compare
> > performance for pools containing large files and if this improves, we
> will
> > apply it to the other systems in the future.
>
> So currently you have a 2.5MB stripe width per RAID6 and you plan to
> test with a 10MB stripe width.
>
> >>> The idea is to aggregate the above devices and show only 1 storage
> space.
> >>> We did as follows:
> >>>
> >>> vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
> >>> lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
> >>
> >> You've told LVM that its stripe unit is 4MB, and thus the stripe width
> >> of each RAID6 is 4MB. This is not possible with 10 data spindles.
> >> Again, show the RAID geometry from the LSI tools.
> >>
> > When creating a nested stripe, the stripe unit of the outer stripe (LVM)
> >> must equal the stripe width of eachinner stripe (RAID6).
> >>
> >
> > Great. Hence, if the RAID6 stripe size is 256k then the LVM should be
> > defined with 256k as well, isn't it?
>
> No. And according to lvcreate(8) you cannot use LVM for the outer
> stripe because you have 10 data spindles per RAID6. "StripeSize" is
> limited to power of 2 values. Your RAID6 stripe width is 2560 KB which
> is not a power of 2 value. So you must use md. See mdadm(8).
>
Great thanks, this is exactly what I needed and I think I am starting to
understand then :) So a RAID6 of 16+2 disks, stripe width of 256KB will
have a stripe width of 256*16=4096 which is a power of 2. Then in this case
LVM2 can be used. Am I correct? Then seems clear to me that new purchases
will go in this way (we have planned a new purchase in the next month and I
am trying to understand this)
> And be careful with terminology. "Stripe unit" is per disk, called
> "chunk" by mdadm. "Stripe width" is per array. "Stripe size" is
> ambiguous.
>
Yes correct, sorry for the wrong terminology is something that I don't use
to manage :)
>
> When nesting stripes, the "stripe width" of the RAID6 becomes the
> "stripe unit" of the outer stripe of the resulting RAID60. In essence,
> each RAID6 is treated as a "drive" in the outer stripe. For example:
>
> RAID6 stripe unit = 256 KB
> RAID6 stripe width = 2560 KB
> RAID60 stripe unit = 2560 KB
> RAID60 stripe width = 7680 KB
>
> For RAID6 w/1MB stripe unit
>
> RAID6 stripe unit = 1 MB
> RAID6 stripe width = 10 MB
> RAID60 stripe unit = 10 MB
> RAID60 stripe width = 30 MB
>
> This is assuming your stated configuration of 12 drives per RAID6, 10
> data spindles, and 3 RAID6 arrays per nested stripe.
>
> >> Hence, stripe of the 3 RAID6 in a LV.
> >>
> >> Each RAID6 has ~1.3GB/s of throughput. By striping the 3 arrays into a
> >> nested RAID60 this suggests you need single file throughput greater than
> >> 1.3GB/s and that all files are very large. If not, you'd be better off
> >> using a concatenation, and using md to accomplish that instead of LVM.
> >>
> >>> And here is my first question: How can I check if the storage and the
> LV
> >>> are correctly aligned?
> >>
> >> Answer is above. But the more important question is whether your
> >> workload wants a stripe or a concatenation.
> >>
> >>> On the other hand, I have formatted XFS as follows:
> >>>
> >>> mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
>
> lazy-count=1 is the default. No need to specify it.
>
Ok thanks :)
>
> >> This alignment is not correct. XFS must be aligned to the LVM stripe
> >> geometry. Here you apparently aligned XFS to the RAID6 geometry
> >> instead. Why are you manually specifying a 128M log? If you knew your
> >> workload that well, you would not have made these other mistakes.
> >>
> >
> > We receive several parallel writes all the time, and afaik filesystems
> with
> > such write load benenfit from a larger log. 128M is the maximum log size.
>
> Metadata is journaled, file data is not. Filesystems experiencing a
> large amount of metadata modification may benefit from a larger journal
> log, however writing many large files in parallel typically doesn't
> generate much metadata modification. In addition, with delayed logging
> now the default, the amount of data written to the journal is much less
> than it used to be. So specifying a log size should not be necessary
> with your workload.
>
Ok. Then I'll try to remove that.
> > So how XFS should be formatted then? As you specify, should be aligned
> with
> > the LVM stripe, as we have a LV with 3 stripes then 256k*3 and sw=30?
>
> It must be aligned to the outer stripe in the nest, which would be the
> LVM geometry if it could work. However, as stated, it appears you
> cannot use lvcreate to make the outer stripe because it does not allow a
> 2560 KiB StripeSize. Destroy the LVM volume and create an md RAID0
> device of the 3 RAID6 devices, eg:
>
> $ mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
>
> For making the filesystem and aligning it to the md nested stripe
> RAID60, this is all that is required:
>
> $ mkfs.xfs -d su=2560k,sw=3 /dev/md0
>
Perfect! I'll try this with the current server having 3xRAID6(10+2). You
really helped me with that.
Just one final question, if I had 3*RAID6(16+2) the Stripe Width should be
4096 (256KB*16) and when applying this to LVM2 should be:
lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
And then the XFS format should be:
mkfs.xfs -d su=4096k, sw=3 /dev/dcvg_a/dcpool
Is it correct?
Thanks a lot for your help,
--
Marc Caubet Serrabou
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html
[-- Attachment #1.2: Type: text/html, Size: 12496 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Alignment: XFS + LVM2
2014-05-08 13:52 ` Marc Caubet
@ 2014-05-08 19:49 ` Stan Hoeppner
0 siblings, 0 replies; 6+ messages in thread
From: Stan Hoeppner @ 2014-05-08 19:49 UTC (permalink / raw)
To: Marc Caubet; +Cc: xfs
On 5/8/2014 8:52 AM, Marc Caubet wrote:
> Hi Stan,
>
> once again, thanks for your answer.
You bet.
...
> Actually all LSI controllers have batteries so I hope it will not happen.
> This problems is good to have this in mind when we purchase new storage
> machines so thanks :)
Battery or flash backed cache on the controller does not prevent this
problem. This read-modify-write operation is internal to the drive and
transparent to the controller.
...
> Great thanks, this is exactly what I needed and I think I am starting to
> understand then :) So a RAID6 of 16+2 disks, stripe width of 256KB will
> have a stripe width of 256*16=4096 which is a power of 2. Then in this case
> LVM2 can be used. Am I correct? Then seems clear to me that new purchases
> will go in this way (we have planned a new purchase in the next month and I
> am trying to understand this)
Use md for the outer stripe so you have no hardware and no stripe unit
limitations. If you need features of LVM such as snapshots simply layer
a PV and LV atop the md RAID0 device.
...
>> It must be aligned to the outer stripe in the nest, which would be the
>> LVM geometry if it could work. However, as stated, it appears you
>> cannot use lvcreate to make the outer stripe because it does not allow a
>> 2560 KiB StripeSize. Destroy the LVM volume and create an md RAID0
>> device of the 3 RAID6 devices, eg:
>>
>> $ mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
>>
>> For making the filesystem and aligning it to the md nested stripe
>> RAID60, this is all that is required:
>>
>> $ mkfs.xfs -d su=2560k,sw=3 /dev/md0
>>
>
> Perfect! I'll try this with the current server having 3xRAID6(10+2). You
> really helped me with that.
>
> Just one final question, if I had 3*RAID6(16+2) the Stripe Width should be
> 4096 (256KB*16) and when applying this to LVM2 should be:
Again, do not use LVM for the outer stripe.
$ mdadm -C /dev/md0 --raid_devices=3 --chunk=4096 --level=0 /dev/sd[abc]
> And then the XFS format should be:
$ mkfs.xfs -d su=4096k,sw=3 /dev/md0
> Is it correct?
It is now.
> Thanks a lot for your help,
Sure thing.
Cheers,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-05-08 19:49 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-07 12:43 Alignment: XFS + LVM2 Marc Caubet
2014-05-08 2:28 ` Stan Hoeppner
2014-05-08 9:12 ` Marc Caubet
2014-05-08 13:04 ` Stan Hoeppner
2014-05-08 13:52 ` Marc Caubet
2014-05-08 19:49 ` Stan Hoeppner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox