* Poor performance -- poor config?
@ 2007-06-20 20:59 Robert Petkus
2007-06-20 21:04 ` Justin Piszcz
0 siblings, 1 reply; 6+ messages in thread
From: Robert Petkus @ 2007-06-20 20:59 UTC (permalink / raw)
To: xfs; +Cc: Petkus Robert
Folks,
I'm trying to configure a system (server + DS4700 disk array) that can
offer the highest performance for our application. We will be reading
and writing multiple threads of 1-2GB files with 1MB block sizes.
DS4700 config:
(16) 500 GB SATA disks
(3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs.
(2) RAID arrays are on controller A, (1) RAID array is on controller B.
512k segment size
Server Config:
IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18)
The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B}
My original goal was to use XFS and create a highly optimized config.
Here is what I came up with:
Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M
-- 128MB is the maximum allowable XFS log size.
The XFS "stripe unit" (su) = 512k to match the DS4700 segment size
The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of su)
4k is the max block size allowable on x86_64 since 4k is the max kernel
page size
[root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f
/dev/sdb
[root@~]# mount -t xfs -o
context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1
/dev/sdb /data0
And the write performance is lousy compared to ext3 built like so:
[root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc
[root@~]# mount -t ext3 -o
noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation
/dev/sdc /data1
What am I missing?
Thanks!
--
Robert Petkus
RHIC/USATLAS Computing Facility
Brookhaven National Laboratory
Physics Dept. - Bldg. 510A
Upton, New York 11973
http://www.bnl.gov/RHIC
http://www.acf.bnl.gov
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Poor performance -- poor config? 2007-06-20 20:59 Poor performance -- poor config? Robert Petkus @ 2007-06-20 21:04 ` Justin Piszcz 2007-06-20 21:16 ` Robert Petkus 0 siblings, 1 reply; 6+ messages in thread From: Justin Piszcz @ 2007-06-20 21:04 UTC (permalink / raw) To: Robert Petkus; +Cc: xfs On Wed, 20 Jun 2007, Robert Petkus wrote: > Folks, > I'm trying to configure a system (server + DS4700 disk array) that can offer > the highest performance for our application. We will be reading and writing > multiple threads of 1-2GB files with 1MB block sizes. > DS4700 config: > (16) 500 GB SATA disks > (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs. > (2) RAID arrays are on controller A, (1) RAID array is on controller B. > 512k segment size > > Server Config: > IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18) > The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B} > > My original goal was to use XFS and create a highly optimized config. Here > is what I came up with: > Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M -- > 128MB is the maximum allowable XFS log size. > The XFS "stripe unit" (su) = 512k to match the DS4700 segment size > The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of su) > 4k is the max block size allowable on x86_64 since 4k is the max kernel page > size > > [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f > /dev/sdb > [root@~]# mount -t xfs -o > context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1 > /dev/sdb /data0 > > And the write performance is lousy compared to ext3 built like so: > [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc > [root@~]# mount -t ext3 -o > noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation > /dev/sdc /data1 > > What am I missing? > > Thanks! > > -- > Robert Petkus > RHIC/USATLAS Computing Facility > Brookhaven National Laboratory > Physics Dept. - Bldg. 510A > Upton, New York 11973 > > http://www.bnl.gov/RHIC > http://www.acf.bnl.gov > > What speeds are you getting? Have you tried a SW RAID with the 16 drives, if you do that, XFS will auto-optimize per the physical characteristics of the md array. Also, most of those mount options besides the logdev/noatime don't do much with XFS from my personal benchmarks, you're better off with the defaults+noatime. What speed are you getting reads/writes, what do you expect? How are the drives attached/what type of controller? PCI? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config? 2007-06-20 21:04 ` Justin Piszcz @ 2007-06-20 21:16 ` Robert Petkus 2007-06-20 21:23 ` Justin Piszcz 0 siblings, 1 reply; 6+ messages in thread From: Robert Petkus @ 2007-06-20 21:16 UTC (permalink / raw) To: Justin Piszcz; +Cc: xfs Justin Piszcz wrote: > > > On Wed, 20 Jun 2007, Robert Petkus wrote: > >> Folks, >> I'm trying to configure a system (server + DS4700 disk array) that >> can offer the highest performance for our application. We will be >> reading and writing multiple threads of 1-2GB files with 1MB block >> sizes. >> DS4700 config: >> (16) 500 GB SATA disks >> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs. >> (2) RAID arrays are on controller A, (1) RAID array is on controller B. >> 512k segment size >> >> Server Config: >> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18) >> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B} >> >> My original goal was to use XFS and create a highly optimized >> config. Here is what I came up with: >> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each >> 150M -- 128MB is the maximum allowable XFS log size. >> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size >> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple >> of su) >> 4k is the max block size allowable on x86_64 since 4k is the max >> kernel page size >> >> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 >> -f /dev/sdb >> [root@~]# mount -t xfs -o >> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1 >> /dev/sdb /data0 >> >> And the write performance is lousy compared to ext3 built like so: >> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc >> [root@~]# mount -t ext3 -o >> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation >> /dev/sdc /data1 >> >> What am I missing? >> >> Thanks! >> >> -- >> Robert Petkus >> RHIC/USATLAS Computing Facility >> Brookhaven National Laboratory >> Physics Dept. - Bldg. 510A >> Upton, New York 11973 >> >> http://www.bnl.gov/RHIC >> http://www.acf.bnl.gov >> >> > > What speeds are you getting? dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000 5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s dd if=/data0/bigfile of=/dev/null bs=1024k count=5000 5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1 Children see throughput for 4 initial writers = 28528.59 KB/sec Parent sees throughput for 4 initial writers = 25212.79 KB/sec Min throughput per process = 6259.05 KB/sec Max throughput per process = 7548.29 KB/sec Avg throughput per process = 7132.15 KB/sec iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1 Children see throughput for 4 readers = 3059690.19 KB/sec Parent sees throughput for 4 readers = 3055307.71 KB/sec Min throughput per process = 757151.81 KB/sec Max throughput per process = 776032.62 KB/sec Avg throughput per process = 764922.55 KB/sec > > Have you tried a SW RAID with the 16 drives, if you do that, XFS will > auto-optimize per the physical characteristics of the md array. No because this would waste an expensive disk array. I've done this with various JBODs, even a SUN Thumper, with OK results... > > Also, most of those mount options besides the logdev/noatime don't do > much with XFS from my personal benchmarks, you're better off with the > defaults+noatime. The security context stuff is in there since I run a strict SELinux policy. Otherwise, I need logdev since it's on a different disk. BTW, the same filesystem w/out a separate log disk made no difference in performance. > > What speed are you getting reads/writes, what do you expect? How are > the drives attached/what type of controller? PCI? I can get ~3x write performance with ext3. I have a dual-port FC-4 PCIe HBA connected to (2) IBM DS4700 FC-4 controllers. There is lots of headroom. -- Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory Physics Dept. - Bldg. 510A Upton, New York 11973 http://www.bnl.gov/RHIC http://www.acf.bnl.gov ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config? 2007-06-20 21:16 ` Robert Petkus @ 2007-06-20 21:23 ` Justin Piszcz 2007-06-21 6:37 ` Sebastian Brings 0 siblings, 1 reply; 6+ messages in thread From: Justin Piszcz @ 2007-06-20 21:23 UTC (permalink / raw) To: Robert Petkus; +Cc: xfs On Wed, 20 Jun 2007, Robert Petkus wrote: > Justin Piszcz wrote: >> >> >> On Wed, 20 Jun 2007, Robert Petkus wrote: >> >>> Folks, >>> I'm trying to configure a system (server + DS4700 disk array) that can >>> offer the highest performance for our application. We will be reading and >>> writing multiple threads of 1-2GB files with 1MB block sizes. >>> DS4700 config: >>> (16) 500 GB SATA disks >>> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs. >>> (2) RAID arrays are on controller A, (1) RAID array is on controller B. >>> 512k segment size >>> >>> Server Config: >>> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18) >>> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B} >>> >>> My original goal was to use XFS and create a highly optimized config. >>> Here is what I came up with: >>> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M >>> -- 128MB is the maximum allowable XFS log size. >>> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size >>> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of >>> su) >>> 4k is the max block size allowable on x86_64 since 4k is the max kernel >>> page size >>> >>> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f >>> /dev/sdb >>> [root@~]# mount -t xfs -o >>> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1 >>> /dev/sdb /data0 >>> >>> And the write performance is lousy compared to ext3 built like so: >>> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc >>> [root@~]# mount -t ext3 -o >>> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservation >>> /dev/sdc /data1 >>> >>> What am I missing? >>> >>> Thanks! >>> >>> -- >>> Robert Petkus >>> RHIC/USATLAS Computing Facility >>> Brookhaven National Laboratory >>> Physics Dept. - Bldg. 510A >>> Upton, New York 11973 >>> >>> http://www.bnl.gov/RHIC >>> http://www.acf.bnl.gov >>> >>> >> >> What speeds are you getting? > dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000 > 5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s > > dd if=/data0/bigfile of=/dev/null bs=1024k count=5000 > 5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s > > iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1 > Children see throughput for 4 initial writers = 28528.59 KB/sec > Parent sees throughput for 4 initial writers = 25212.79 KB/sec > Min throughput per process = 6259.05 KB/sec > Max throughput per process = 7548.29 KB/sec > Avg throughput per process = 7132.15 KB/sec > > iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1 > Children see throughput for 4 readers = 3059690.19 KB/sec > Parent sees throughput for 4 readers = 3055307.71 KB/sec > Min throughput per process = 757151.81 KB/sec > Max throughput per process = 776032.62 KB/sec > Avg throughput per process = 764922.55 KB/sec > >> >> Have you tried a SW RAID with the 16 drives, if you do that, XFS will >> auto-optimize per the physical characteristics of the md array. > No because this would waste an expensive disk array. I've done this with > various JBODs, even a SUN Thumper, with OK results... >> >> Also, most of those mount options besides the logdev/noatime don't do much >> with XFS from my personal benchmarks, you're better off with the >> defaults+noatime. > The security context stuff is in there since I run a strict SELinux policy. > Otherwise, I need logdev since it's on a different disk. BTW, the same > filesystem w/out a separate log disk made no difference in performance. >> >> What speed are you getting reads/writes, what do you expect? How are the >> drives attached/what type of controller? PCI? > I can get ~3x write performance with ext3. I have a dual-port FC-4 PCIe HBA > connected to (2) IBM DS4700 FC-4 controllers. There is lots of headroom. > > -- > Robert Petkus > RHIC/USATLAS Computing Facility > Brookhaven National Laboratory > Physics Dept. - Bldg. 510A > Upton, New York 11973 > > http://www.bnl.gov/RHIC > http://www.acf.bnl.gov > > EXT3 up to 3x fast? Hrm.. Have you tried default mkfs.xfs options [internal journal]? What write speed do you get using the defaults? What kernel version? Justin. ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Poor performance -- poor config? 2007-06-20 21:23 ` Justin Piszcz @ 2007-06-21 6:37 ` Sebastian Brings 2007-06-21 23:59 ` David Chinner 0 siblings, 1 reply; 6+ messages in thread From: Sebastian Brings @ 2007-06-21 6:37 UTC (permalink / raw) To: Justin Piszcz, Robert Petkus; +Cc: xfs > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] On Behalf Of Justin Piszcz > Sent: Mittwoch, 20. Juni 2007 23:24 > To: Robert Petkus > Cc: xfs@oss.sgi.com > Subject: Re: Poor performance -- poor config? > > > > On Wed, 20 Jun 2007, Robert Petkus wrote: > > > Justin Piszcz wrote: > >> > >> > >> On Wed, 20 Jun 2007, Robert Petkus wrote: > >> > >>> Folks, > >>> I'm trying to configure a system (server + DS4700 disk array) that can > >>> offer the highest performance for our application. We will be reading and > >>> writing multiple threads of 1-2GB files with 1MB block sizes. > >>> DS4700 config: > >>> (16) 500 GB SATA disks > >>> (3) 4+1 RAID 5 arrays and (1) hot spare == (3) 2TB LUNs. > >>> (2) RAID arrays are on controller A, (1) RAID array is on controller B. > >>> 512k segment size > >>> > >>> Server Config: > >>> IBM x3550, 9GB RAM, RHEL 5 x86_64 (2.6.18) > >>> The (3) LUNs are sdb, sdc {both controller A}, sdd {controller B} > >>> > >>> My original goal was to use XFS and create a highly optimized config. > >>> Here is what I came up with: > >>> Create separate partitions for XFS log files: sdd1, sdd2, sdd3 each 150M > >>> -- 128MB is the maximum allowable XFS log size. > >>> The XFS "stripe unit" (su) = 512k to match the DS4700 segment size > >>> The "stripe width" ( (n-1)*sunit )= swidth=2048k = sw=4 (a multiple of > >>> su) > >>> 4k is the max block size allowable on x86_64 since 4k is the max kernel > >>> page size > >>> > >>> [root@~]# mkfs.xfs -l logdev=/dev/sdd1,size=128m -d su=512k -d sw=4 -f > >>> /dev/sdb > >>> [root@~]# mount -t xfs -o > >>> context=system_u:object_r:unconfined_t,noatime,nodiratime,logbufs=8,logd ev=/dev/sdd1 > >>> /dev/sdb /data0 > >>> > >>> And the write performance is lousy compared to ext3 built like so: > >>> [root@~]# mke2fs -j -m 1 -b4096 -E stride=128 /dev/sdc > >>> [root@~]# mount -t ext3 -o > >>> noatime,nodiratime,context="system_u:object_r:unconfined_t:s0",reservati on > >>> /dev/sdc /data1 > >>> > >>> What am I missing? > >>> > >>> Thanks! > >>> > >>> -- > >>> Robert Petkus > >>> RHIC/USATLAS Computing Facility > >>> Brookhaven National Laboratory > >>> Physics Dept. - Bldg. 510A > >>> Upton, New York 11973 > >>> > >>> http://www.bnl.gov/RHIC > >>> http://www.acf.bnl.gov > >>> > >>> > >> > >> What speeds are you getting? > > dd if=/dev/zero of=/data0/bigfile bs=1024k count=5000 > > 5242880000 bytes (5.2 GB) copied, 149.296 seconds, 35.1 MB/s > > > > dd if=/data0/bigfile of=/dev/null bs=1024k count=5000 > > 5242880000 bytes (5.2 GB) copied, 26.3148 seconds, 199 MB/s > > > > iozone.linux -w -r 1m -s 1g -i0 -t 4 -e -w -f /data0/test1 > > Children see throughput for 4 initial writers = 28528.59 KB/sec > > Parent sees throughput for 4 initial writers = 25212.79 KB/sec > > Min throughput per process = 6259.05 KB/sec > > Max throughput per process = 7548.29 KB/sec > > Avg throughput per process = 7132.15 KB/sec > > > > iozone.linux -w -r 1m -s 1g -i1 -t 4 -e -w -f /data0/test1 > > Children see throughput for 4 readers = 3059690.19 KB/sec > > Parent sees throughput for 4 readers = 3055307.71 KB/sec > > Min throughput per process = 757151.81 KB/sec > > Max throughput per process = 776032.62 KB/sec > > Avg throughput per process = 764922.55 KB/sec > > > >> > >> Have you tried a SW RAID with the 16 drives, if you do that, XFS will > >> auto-optimize per the physical characteristics of the md array. > > No because this would waste an expensive disk array. I've done this with > > various JBODs, even a SUN Thumper, with OK results... > >> > >> Also, most of those mount options besides the logdev/noatime don't do much > >> with XFS from my personal benchmarks, you're better off with the > >> defaults+noatime. > > The security context stuff is in there since I run a strict SELinux policy. > > Otherwise, I need logdev since it's on a different disk. BTW, the same > > filesystem w/out a separate log disk made no difference in performance. > >> > >> What speed are you getting reads/writes, what do you expect? How are the > >> drives attached/what type of controller? PCI? > > I can get ~3x write performance with ext3. I have a dual-port FC-4 PCIe HBA > > connected to (2) IBM DS4700 FC-4 controllers. There is lots of headroom. > > > > -- > > Robert Petkus > > RHIC/USATLAS Computing Facility > > Brookhaven National Laboratory > > Physics Dept. - Bldg. 510A > > Upton, New York 11973 > > > > http://www.bnl.gov/RHIC > > http://www.acf.bnl.gov > > > > > > EXT3 up to 3x fast? Hrm.. Have you tried default mkfs.xfs options > [internal journal]? What write speed do you get using the defaults? > > What kernel version? > > Justin. > Not sure if it makes much sense to set stripe unit and width for a Raid which appears as a single device. As you state, the "width" of your DS lun is 4 x 512K == 2MB. In case you don't have write cache enabled each of your 1MB writes will cause the DS to write to two out of four disks only, causing heavy overhead to create parity. Write cache mirroring on the DS also causes limitation in write performance. And finally there is an option in the DS to change the cache segment size from 16k default to 4k IIRC. Make sure it is set to 16k. But still, 35MB/s for a single sequential write is really poor. Almost looks like you get single spindle performance only. Sebastian ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Poor performance -- poor config? 2007-06-21 6:37 ` Sebastian Brings @ 2007-06-21 23:59 ` David Chinner 0 siblings, 0 replies; 6+ messages in thread From: David Chinner @ 2007-06-21 23:59 UTC (permalink / raw) To: Sebastian Brings; +Cc: Justin Piszcz, Robert Petkus, xfs On Thu, Jun 21, 2007 at 08:37:36AM +0200, Sebastian Brings wrote: > Not sure if it makes much sense to set stripe unit and width for a Raid > which appears as a single device. Certainly it does. That way you get stripe aligned allocation and therfore you are much more likely to get full-stripe width writes instead of unaligned writes that force RMW cycles on the RAID controller for parity calculations. > As you state, the "width" of your DS lun is 4 x 512K == 2MB. In case you > don't have write cache enabled each of your 1MB writes will cause the DS > to write to two out of four disks only, causing heavy overhead to create > parity. You're assuming stripe aligned I/O there. That 1MB could hit 3 of the 4 data disks - if you don't have a stripe unit set then that will be the common case. i.e. its worse than you think :/ Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-06-25 5:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-20 20:59 Poor performance -- poor config? Robert Petkus 2007-06-20 21:04 ` Justin Piszcz 2007-06-20 21:16 ` Robert Petkus 2007-06-20 21:23 ` Justin Piszcz 2007-06-21 6:37 ` Sebastian Brings 2007-06-21 23:59 ` David Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox