* Poor performance -- poor config?
@ 2007-06-20 20:59 Robert Petkus
2007-06-20 21:04 ` Justin Piszcz
0 siblings, 1 reply; 6+ messages in thread
From: Robert Petkus @ 2007-06-20 20:59 UTC (permalink / raw)
To: xfs; +Cc: Petkus Robert
Folks,
I'm trying to configure a system (server + DS4700 disk array) that can
offer the highest performance for our application. We will be reading
and writing multiple threads of 1-2GB files with 1MB block sizes.
DS4700 config:
(16) 500 GB SATA disks
(3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
(2) RAID arrays are on controller A, (1) RAID array is on controller B.
512k segment size
Server Config:
IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
My original goal was to use XFS and create a highly optimized config.
Here is what I came up with:
Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M
-- 128MB is the maximum allowable XFS log size.
The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of su)
4k is the max block size allowable on x86_64 since 4k is the max kernel
page size
[root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f
/dev/sdb
[root@~]# mount -t xfs -o
context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1
/dev/sdb /data0
And the write performance is lousy compared to ext3 built like so:
[root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
[root@~]# mount -t ext3 -o
noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation
/dev/sdc /data1
What am I missing?
Thanks!
--
Robert Petkus
RHIC/USATLAS Computing Facility
Brookhaven National Laboratory
Physics Dept. - Bldg. 510A
Upton, New York 11973
http://www.bnl.gov/RHIC
http://www.acf.bnl.gov
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config?
2007-06-20 20:59 Poor performance -- poor config? Robert Petkus
@ 2007-06-20 21:04 ` Justin Piszcz
2007-06-20 21:16 ` Robert Petkus
0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2007-06-20 21:04 UTC (permalink / raw)
To: Robert Petkus; +Cc: xfs
On Wed, 20 Jun 2007, Robert Petkus wrote:
> Folks,
> I'm trying to configure a system (server + DS4700 disk array) that can offer
> the highest performance for our application. We will be reading and writing
> multiple threads of 1-2GB files with 1MB block sizes.
> DS4700 config:
> (16) 500 GB SATA disks
> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
> (2) RAID arrays are on controller A, (1) RAID array is on controller B.
> 512k segment size
>
> Server Config:
> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
>
> My original goal was to use XFS and create a highly optimized config. Here
> is what I came up with:
> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M --
> 128MB is the maximum allowable XFS log size.
> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of su)
> 4k is the max block size allowable on x86_64 since 4k is the max kernel page
> size
>
> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f
> /dev/sdb
> [root@~]# mount -t xfs -o
> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1
> /dev/sdb /data0
>
> And the write performance is lousy compared to ext3 built like so:
> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
> [root@~]# mount -t ext3 -o
> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation
> /dev/sdc /data1
>
> What am I missing?
>
> Thanks!
>
> --
> Robert Petkus
> RHIC/USATLAS Computing Facility
> Brookhaven National Laboratory
> Physics Dept. - Bldg. 510A
> Upton, New York 11973
>
> http://www.bnl.gov/RHIC
> http://www.acf.bnl.gov
>
>
What speeds are you getting?
Have you tried a SW RAID with the 16 drives, if you do that, XFS will
auto-optimize per the physical characteristics of the md array.
Also, most of those mount options besides the logdev/noatime don't do much
with XFS from my personal benchmarks, you're better off with the
defaults+noatime.
What speed are you getting reads/writes, what do you expect? How are the
drives attached/what type of controller? PCI?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config?
2007-06-20 21:04 ` Justin Piszcz
@ 2007-06-20 21:16 ` Robert Petkus
2007-06-20 21:23 ` Justin Piszcz
0 siblings, 1 reply; 6+ messages in thread
From: Robert Petkus @ 2007-06-20 21:16 UTC (permalink / raw)
To: Justin Piszcz; +Cc: xfs
Justin Piszcz wrote:
>
>
> On Wed, 20 Jun 2007, Robert Petkus wrote:
>
>> Folks,
>> I'm trying to configure a system (server + DS4700 disk array) that
>> can offer the highest performance for our application. We will be
>> reading and writing multiple threads of 1-2GB files with 1MB block
>> sizes.
>> DS4700 config:
>> (16) 500 GB SATA disks
>> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
>> (2) RAID arrays are on controller A, (1) RAID array is on controller B.
>> 512k segment size
>>
>> Server Config:
>> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
>> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
>>
>> My original goal was to use XFS and create a highly optimized
>> config. Here is what I came up with:
>> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each
>> 150M -- 128MB is the maximum allowable XFS log size.
>> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
>> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple
>> of su)
>> 4k is the max block size allowable on x86_64 since 4k is the max
>> kernel page size
>>
>> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4
>> -f /dev/sdb
>> [root@~]# mount -t xfs -o
>> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1
>> /dev/sdb /data0
>>
>> And the write performance is lousy compared to ext3 built like so:
>> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
>> [root@~]# mount -t ext3 -o
>> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation
>> /dev/sdc /data1
>>
>> What am I missing?
>>
>> Thanks!
>>
>> --
>> Robert Petkus
>> RHIC/USATLAS Computing Facility
>> Brookhaven National Laboratory
>> Physics Dept. - Bldg. 510A
>> Upton, New York 11973
>>
>> http://www.bnl.gov/RHIC
>> http://www.acf.bnl.gov
>>
>>
>
> What speeds are you getting?
dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000
5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s
dd if=/data0/bigfile of=/dev/null bs=1024k count=5000
5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s
iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1
Children see throughput for 4 initial writers = 28528.59 KB/sec
Parent sees throughput for 4 initial writers = 25212.79 KB/sec
Min throughput per process = 6259.05 KB/sec
Max throughput per process = 7548.29 KB/sec
Avg throughput per process = 7132.15 KB/sec
iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1
Children see throughput for 4 readers = 3059690.19 KB/sec
Parent sees throughput for 4 readers = 3055307.71 KB/sec
Min throughput per process = 757151.81 KB/sec
Max throughput per process = 776032.62 KB/sec
Avg throughput per process = 764922.55 KB/sec
>
> Have you tried a SW RAID with the 16 drives, if you do that, XFS will
> auto-optimize per the physical characteristics of the md array.
No because this would waste an expensive disk array. I've done this
with various JBODs, even a SUN Thumper, with OK results...
>
> Also, most of those mount options besides the logdev/noatime don't do
> much with XFS from my personal benchmarks, you're better off with the
> defaults+noatime.
The security context stuff is in there since I run a strict SELinux
policy. Otherwise, I need logdev since it's on a different disk. BTW,
the same filesystem w/out a separate log disk made no difference in
performance.
>
> What speed are you getting reads/writes, what do you expect? How are
> the drives attached/what type of controller? PCI?
I can get ~3x write performance with ext3. I have a dual-port FC-4 PCIe
HBA connected to (2) IBM DS4700 FC-4 controllers. There is lots of
headroom.
--
Robert Petkus
RHIC/USATLAS Computing Facility
Brookhaven National Laboratory
Physics Dept. - Bldg. 510A
Upton, New York 11973
http://www.bnl.gov/RHIC
http://www.acf.bnl.gov
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config?
2007-06-20 21:16 ` Robert Petkus
@ 2007-06-20 21:23 ` Justin Piszcz
2007-06-21 6:37 ` Sebastian Brings
0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2007-06-20 21:23 UTC (permalink / raw)
To: Robert Petkus; +Cc: xfs
On Wed, 20 Jun 2007, Robert Petkus wrote:
> Justin Piszcz wrote:
>>
>>
>> On Wed, 20 Jun 2007, Robert Petkus wrote:
>>
>>> Folks,
>>> I'm trying to configure a system (server + DS4700 disk array) that can
>>> offer the highest performance for our application. We will be reading and
>>> writing multiple threads of 1-2GB files with 1MB block sizes.
>>> DS4700 config:
>>> (16) 500 GB SATA disks
>>> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
>>> (2) RAID arrays are on controller A, (1) RAID array is on controller B.
>>> 512k segment size
>>>
>>> Server Config:
>>> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
>>> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
>>>
>>> My original goal was to use XFS and create a highly optimized config.
>>> Here is what I came up with:
>>> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M
>>> -- 128MB is the maximum allowable XFS log size.
>>> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
>>> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of
>>> su)
>>> 4k is the max block size allowable on x86_64 since 4k is the max kernel
>>> page size
>>>
>>> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f
>>> /dev/sdb
>>> [root@~]# mount -t xfs -o
>>> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1
>>> /dev/sdb /data0
>>>
>>> And the write performance is lousy compared to ext3 built like so:
>>> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
>>> [root@~]# mount -t ext3 -o
>>> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation
>>> /dev/sdc /data1
>>>
>>> What am I missing?
>>>
>>> Thanks!
>>>
>>> --
>>> Robert Petkus
>>> RHIC/USATLAS Computing Facility
>>> Brookhaven National Laboratory
>>> Physics Dept. - Bldg. 510A
>>> Upton, New York 11973
>>>
>>> http://www.bnl.gov/RHIC
>>> http://www.acf.bnl.gov
>>>
>>>
>>
>> What speeds are you getting?
> dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000
> 5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s
>
> dd if=/data0/bigfile of=/dev/null bs=1024k count=5000
> 5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s
>
> iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1
> Children see throughput for 4 initial writers = 28528.59 KB/sec
> Parent sees throughput for 4 initial writers = 25212.79 KB/sec
> Min throughput per process = 6259.05 KB/sec
> Max throughput per process = 7548.29 KB/sec
> Avg throughput per process = 7132.15 KB/sec
>
> iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1
> Children see throughput for 4 readers = 3059690.19 KB/sec
> Parent sees throughput for 4 readers = 3055307.71 KB/sec
> Min throughput per process = 757151.81 KB/sec
> Max throughput per process = 776032.62 KB/sec
> Avg throughput per process = 764922.55 KB/sec
>
>>
>> Have you tried a SW RAID with the 16 drives, if you do that, XFS will
>> auto-optimize per the physical characteristics of the md array.
> No because this would waste an expensive disk array. I've done this with
> various JBODs, even a SUN Thumper, with OK results...
>>
>> Also, most of those mount options besides the logdev/noatime don't do much
>> with XFS from my personal benchmarks, you're better off with the
>> defaults+noatime.
> The security context stuff is in there since I run a strict SELinux policy.
> Otherwise, I need logdev since it's on a different disk. BTW, the same
> filesystem w/out a separate log disk made no difference in performance.
>>
>> What speed are you getting reads/writes, what do you expect? How are the
>> drives attached/what type of controller? PCI?
> I can get ~3x write performance with ext3. I have a dual-port FC-4 PCIe HBA
> connected to (2) IBM DS4700 FC-4 controllers. There is lots of headroom.
>
> --
> Robert Petkus
> RHIC/USATLAS Computing Facility
> Brookhaven National Laboratory
> Physics Dept. - Bldg. 510A
> Upton, New York 11973
>
> http://www.bnl.gov/RHIC
> http://www.acf.bnl.gov
>
>
EXT3 up to 3x fast? Hrm.. Have you tried default mkfs.xfs options
[internal journal]? What write speed do you get using the defaults?
What kernel version?
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Poor performance -- poor config?
2007-06-20 21:23 ` Justin Piszcz
@ 2007-06-21 6:37 ` Sebastian Brings
2007-06-21 23:59 ` David Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Brings @ 2007-06-21 6:37 UTC (permalink / raw)
To: Justin Piszcz, Robert Petkus; +Cc: xfs
> -----Original Message-----
> From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf
Of Justin Piszcz
> Sent: Mittwoch, 20. Juni 2007 23:24
> To: Robert Petkus
> Cc: xfs@oss.sgi.com
> Subject: Re: Poor performance -- poor config?
>
>
>
> On Wed, 20 Jun 2007, Robert Petkus wrote:
>
> > Justin Piszcz wrote:
> >>
> >>
> >> On Wed, 20 Jun 2007, Robert Petkus wrote:
> >>
> >>> Folks,
> >>> I'm trying to configure a system (server + DS4700 disk array) that
can
> >>> offer the highest performance for our application. We will be
reading and
> >>> writing multiple threads of 1-2GB files with 1MB block sizes.
> >>> DS4700 config:
> >>> (16) 500 GB SATA disks
> >>> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
> >>> (2) RAID arrays are on controller A, (1) RAID array is on
controller B.
> >>> 512k segment size
> >>>
> >>> Server Config:
> >>> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
> >>> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
> >>>
> >>> My original goal was to use XFS and create a highly optimized
config.
> >>> Here is what I came up with:
> >>> Create separate partitions for XFS log files: sdd1, sdd2, sdd3
each 150M
> >>> -- 128MB is the maximum allowable XFS log size.
> >>> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
> >>> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a
multiple of
> >>> su)
> >>> 4k is the max block size allowable on x86_64 since 4k is the max
kernel
> >>> page size
> >>>
> >>> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d
sw=4 -f
> >>> /dev/sdb
> >>> [root@~]# mount -t xfs -o
> >>>
context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logd
ev=/dev/sdd1
> >>> /dev/sdb /data0
> >>>
> >>> And the write performance is lousy compared to ext3 built like so:
> >>> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
> >>> [root@~]# mount -t ext3 -o
> >>>
noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservati
on
> >>> /dev/sdc /data1
> >>>
> >>> What am I missing?
> >>>
> >>> Thanks!
> >>>
> >>> --
> >>> Robert Petkus
> >>> RHIC/USATLAS Computing Facility
> >>> Brookhaven National Laboratory
> >>> Physics Dept. - Bldg. 510A
> >>> Upton, New York 11973
> >>>
> >>> http://www.bnl.gov/RHIC
> >>> http://www.acf.bnl.gov
> >>>
> >>>
> >>
> >> What speeds are you getting?
> > dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000
> > 5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s
> >
> > dd if=/data0/bigfile of=/dev/null bs=1024k count=5000
> > 5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s
> >
> > iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1
> > Children see throughput for 4 initial writers = 28528.59 KB/sec
> > Parent sees throughput for 4 initial writers = 25212.79
KB/sec
> > Min throughput per process = 6259.05
KB/sec
> > Max throughput per process = 7548.29
KB/sec
> > Avg throughput per process = 7132.15
KB/sec
> >
> > iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1
> > Children see throughput for 4 readers = 3059690.19 KB/sec
> > Parent sees throughput for 4 readers = 3055307.71
KB/sec
> > Min throughput per process = 757151.81
KB/sec
> > Max throughput per process = 776032.62
KB/sec
> > Avg throughput per process = 764922.55
KB/sec
> >
> >>
> >> Have you tried a SW RAID with the 16 drives, if you do that, XFS
will
> >> auto-optimize per the physical characteristics of the md array.
> > No because this would waste an expensive disk array. I've done this
with
> > various JBODs, even a SUN Thumper, with OK results...
> >>
> >> Also, most of those mount options besides the logdev/noatime don't
do much
> >> with XFS from my personal benchmarks, you're better off with the
> >> defaults+noatime.
> > The security context stuff is in there since I run a strict SELinux
policy.
> > Otherwise, I need logdev since it's on a different disk. BTW, the
same
> > filesystem w/out a separate log disk made no difference in
performance.
> >>
> >> What speed are you getting reads/writes, what do you expect? How
are the
> >> drives attached/what type of controller? PCI?
> > I can get ~3x write performance with ext3. I have a dual-port FC-4
PCIe HBA
> > connected to (2) IBM DS4700 FC-4 controllers. There is lots of
headroom.
> >
> > --
> > Robert Petkus
> > RHIC/USATLAS Computing Facility
> > Brookhaven National Laboratory
> > Physics Dept. - Bldg. 510A
> > Upton, New York 11973
> >
> > http://www.bnl.gov/RHIC
> > http://www.acf.bnl.gov
> >
> >
>
> EXT3 up to 3x fast? Hrm.. Have you tried default mkfs.xfs options
> [internal journal]? What write speed do you get using the defaults?
>
> What kernel version?
>
> Justin.
>
Not sure if it makes much sense to set stripe unit and width for a Raid
which appears as a single device.
As you state, the "width" of your DS lun is 4 x 512K == 2MB. In case you
don't have write cache enabled each of your 1MB writes will cause the DS
to write to two out of four disks only, causing heavy overhead to create
parity.
Write cache mirroring on the DS also causes limitation in write
performance. And finally there is an option in the DS to change the
cache segment size from 16k default to 4k IIRC. Make sure it is set to
16k.
But still, 35MB/s for a single sequential write is really poor. Almost
looks like you get single spindle performance only.
Sebastian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config?
2007-06-21 6:37 ` Sebastian Brings
@ 2007-06-21 23:59 ` David Chinner
0 siblings, 0 replies; 6+ messages in thread
From: David Chinner @ 2007-06-21 23:59 UTC (permalink / raw)
To: Sebastian Brings; +Cc: Justin Piszcz, Robert Petkus, xfs
On Thu, Jun 21, 2007 at 08:37:36AM +0200, Sebastian Brings wrote:
> Not sure if it makes much sense to set stripe unit and width for a Raid
> which appears as a single device.
Certainly it does.
That way you get stripe aligned allocation and therfore you are
much more likely to get full-stripe width writes instead of unaligned
writes that force RMW cycles on the RAID controller for parity calculations.
> As you state, the "width" of your DS lun is 4 x 512K == 2MB. In case you
> don't have write cache enabled each of your 1MB writes will cause the DS
> to write to two out of four disks only, causing heavy overhead to create
> parity.
You're assuming stripe aligned I/O there. That 1MB could hit 3 of the 4
data disks - if you don't have a stripe unit set then that will
be the common case. i.e. its worse than you think :/
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-06-25 5:50 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-20 20:59 Poor performance -- poor config? Robert Petkus
2007-06-20 21:04 ` Justin Piszcz
2007-06-20 21:16 ` Robert Petkus
2007-06-20 21:23 ` Justin Piszcz
2007-06-21 6:37 ` Sebastian Brings
2007-06-21 23:59 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox