From mboxrd@z Thu Jan 1 00:00:00 1970 From: "--[ UxBoD ]--" Subject: Performance Question Date: Thu, 15 Sep 2011 20:43:27 +0100 (BST) Message-ID: <1eff05ad-e2cc-42a5-b757-dfa40f14e776@office.splatnix.net> Reply-To: "--\[ UxBoD \]--" , device-mapper development Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3978009683246225922==" Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: dm-devel@redhat.com List-Id: dm-devel.ids --===============3978009683246225922== Content-Type: multipart/alternative; boundary="=_a5debe90-bde7-4d8f-9f04-a4c698cd5f52" --=_a5debe90-bde7-4d8f-9f04-a4c698cd5f52 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hello all, we are about to configure a new storage system that utilizes the Nexenta OS= with sparsely allocated ZVOLs. We wish to present 4TB of storage to a Linu= x system that has four NICs available to it. We are unsure whether to prese= nt one large ZVOL or four smaller ones to maximize the use of the NICs avai= lable to us. We have set rr_min_io to 100 which we have found offers a good= level of performance. Though this raises an interesting question; that the= multipath.conf man pages says that the rr_min_io parameter is the number o= f IOs across the whole path group before a switch is made to the next path.= What constitutes a single IO operation ? A user opens a file for read acce= ss, one IOP to open the file, IOsX to read the contents, and another to clo= se ? Do each of those SCSI operations happen on the same path ie. on the sa= me block device ? If a second user comes along and requests data from the s= ame block device do they happen on the same path or the next one in the pat= h group ? We imagine that they will all happen on the same path until rr_mi= n_io is reached and it switches over to the next path. We are trying to squeeze out the maximum performance from our system and we= are unable to max out our 4 x 1Gbe interfaces. Any thoughts on how we can = improve our performance ? -- Thanks, Phil --=_a5debe90-bde7-4d8f-9f04-a4c698cd5f52 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <= div style=3D'font-family: Courier New; font-size: 10pt; color: #000000'>Hel= lo all,

we are about to configure a new storage system that utilizes= the Nexenta OS with sparsely allocated ZVOLs.  We wish to present 4TB= of storage to a Linux system that has four NICs available to it. We are un= sure whether to present one large ZVOL or four smaller ones to maximize the= use of the NICs available to us.  We have set rr_min_io to 100 which = we have found offers a good level of performance.  Though this raises = an interesting question; that the multipath.conf man pages says that the rr= _min_io parameter is the number of IOs across the whole path group before a= switch is made to the next path. What constitutes a single IO operation ? = A user opens a file for read access, one IOP to open the file, IOsX to read= the contents, and another to close ? Do each of those SCSI operations happ= en on the same path ie. on the same block device ? If a second user comes a= long and requests data from the same block device do they happen on the sam= e path or the next one in the path group ? We imagine that they will all ha= ppen on the same path until rr_min_io is reached and it switches over to th= e next path.

We are trying to squeeze out the maximum performance fr= om our system and we are unable to max out our 4 x 1Gbe interfaces. Any tho= ughts on how we can improve our performance ?
--
Thanks, Phil

= --=_a5debe90-bde7-4d8f-9f04-a4c698cd5f52-- --===============3978009683246225922== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============3978009683246225922==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: jp@pour.midcoast.com Subject: performance question Date: Mon, 31 Mar 2003 16:37:39 -0500 (EST) Sender: nfs-admin@lists.sourceforge.net Message-ID: <20030331213739.25218.qmail@pour.midcoast.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from pour.midcoast.com ([206.26.226.21] ident=qmailr) by sc8-sf-list1.sourceforge.net with smtp (Exim 3.31-VA-mm2 #1 (Debian)) id 1906u1-0004tB-00 for ; Mon, 31 Mar 2003 13:33:13 -0800 To: nfs@lists.sourceforge.net Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: I have looked through the last couple months of mailing lists archives and reviewed the material at nfs.sourceforge.net and the list to netapp's nfs suggestions. I am trying to get real good performance out of NFS. So far the best I've got is about 1/10 of the local speed with dedicated 100mbps ethernet between fairly speedy computers. Here's the setup. Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6 kernel from kernel.org, boots to an ata100 drive, promise rm8000 external hardware raid5 array on adaptec Adaptec AHA-2940U/UW/D controller, 3com 3c905C forced to 100-FD with "/sbin/mii-tool -F 100baseTx-FD eth1". coffeepot:~ # mount |grep sda /dev/sda1 on /shared/home type ext2 (rw,noatime) /dev/sda2 on /shared/backup type ext2 (rw,noatime) /dev/sda3 on /shared/logs type ext2 (rw,noatime) coffeepot:~ # cat /etc/exports /shared/home 10.0.34.0/24(rw,no_root_squash,async) /shared/backup/ 10.0.34.0/24(ro,root_squash,async) /shared/logs 10.0.34.0/24(rw,root_squash,async) coffeepot:~ # bonnie++ -d /shared/home/jp -s 1600 -r 512 -u jp Version 1.01d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP coffeepot 1600M 40206 42 39934 13 9989 3 19782 22 21765 5 317.2 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2675 99 +++++ +++ +++++ +++ 2759 99 +++++ +++ 4360 100 Performance is fairly kickin' here locally. Connected through a HP4000M switch set for full duplex 100baseT on the same switch linecard for both ports is the client. http://midcoast.com/~jp/10.0.15.2_15-day.png is the network traffic between the two computers showing two bonnie++ tests on the right of the graph. There is no packet loss between the computers when tested with flood pings or regular pings. Client info.(froth) - Athlon XP2200, Suse 8.1, 2.4.21-pre6 kernel from kernel.org, boots to a ata-100 drive. 3com 3c905C forced to 100-FD with "/sbin/mii-tool -F 100baseTx-FD eth1". froth:~ # cat /etc/mtab 10.0.34.1:/shared/backup /shared/backup nfs rw,tcp,hard,intr,rsize=1024,wsize=1024,addr=10.0.34.1 0 0 10.0.34.1:/shared/logs /shared/logs nfs rw,tcp,hard,intr,rsize=1024,wsize=1024,addr=10.0.34.1 0 0 10.0.34.1:/shared/home /shared/home nfs rw,udp,hard,intr,rsize=1400,wsize=1400,addr=10.0.34.1 0 0 same bonnie++ command: Version 1.01d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP froth 1600M 2724 5 2764 4 1395 3 2778 5 2848 3 33.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1175 3 5061 12 2840 10 1208 5 5723 11 1684 4 froth,1600M,2724,5,2764,4,1395,3,2778,5,2848,3,33.5,0,16,1175,3,5061,12,2840,10,1208,5,5723,11,1684,4 I get about 2700 K/sec and seeks go from 317 to 33/sec. The transfer speed matches the network traffic graph. I would like to do better than 2700ish. What is possible for me to improve without moving to Gig-Ethernet? I've tried both TCP and UDP NFS. rsize & wsize or 1024,1400,4096,8192. The larger two have horrid performance due to packet fragmentation. Like magnitudes worse. 1024, 1400, UDP and TCP all have similar performance for me. Also, is it possible to clear the counters in nfsstat? MUCH TIA, Jason -- /* Jason Philbrook | Midcoast Internet Solutions - Internet Access, KB1IOJ | Hosting, and TCP-IP Networks for Midcoast Maine http://f64.nu/ | http://www.midcoast.com/ */ ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lever, Charles" Subject: RE: performance question Date: Mon, 31 Mar 2003 13:45:24 -0800 Sender: nfs-admin@lists.sourceforge.net Message-ID: <6440EA1A6AA1D5118C6900902745938E07D55480@black.eng.netapp.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Cc: Return-path: Received: from mx01.netapp.com ([198.95.226.53]) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 19075v-0001gn-00 for ; Mon, 31 Mar 2003 13:45:31 -0800 To: Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: hi jp- > What is possible for me to improve without moving to Gig-Ethernet? >=20 > I've tried both TCP and UDP NFS. rsize & wsize or=20 > 1024,1400,4096,8192. The larger > two have horrid performance due to packet fragmentation. Like=20 > magnitudes worse. > 1024, 1400, UDP and TCP all have similar performance for me. this sounds like a network issue. you should use a network performance tool (like iPerf) to measure performance between your client and server, and try to rectify any problems you find there, before you work on NFS performance. > Also, is it possible to clear the counters in nfsstat? only via a client reboot. ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Trond Myklebust Subject: Re: performance question Date: 01 Apr 2003 07:40:27 +0200 Sender: nfs-admin@lists.sourceforge.net Message-ID: References: <20030331213739.25218.qmail@pour.midcoast.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: nfs@lists.sourceforge.net Return-path: Received: from pat.uio.no ([129.240.130.16] ident=7411) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190EVh-0007tE-00 for ; Mon, 31 Mar 2003 21:40:37 -0800 To: jp@pour.midcoast.com In-Reply-To: <20030331213739.25218.qmail@pour.midcoast.com> Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: >>>>> " " == jp writes: > Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6 > kernel from kernel.org, boots to an ata100 drive, promise > rm8000 external hardware raid5 array on adaptec Adaptec > AHA-2940U/UW/D controller, 3com 3c905C forced to 100-FD with > "/sbin/mii-tool -F 100baseTx-FD eth1". Why do you have to force it to 100-FD? Cheers, Trond ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: jp@pour.midcoast.com Subject: Re: performance question Date: Tue, 1 Apr 2003 10:39:50 -0500 (EST) Sender: nfs-admin@lists.sourceforge.net Message-ID: <20030401153950.27556.qmail@pour.midcoast.com> References: <1049188686.19334.20.camel@deskpro02> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: nfs@lists.sourceforge.net Return-path: Received: from pour.midcoast.com ([206.26.226.21] ident=qmailr) by sc8-sf-list1.sourceforge.net with smtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190NnJ-0000wf-00 for ; Tue, 01 Apr 2003 07:35:25 -0800 To: ukh@id.cbs.dk (=?ISO-8859-1?Q?K=E5re?= Hviid) In-Reply-To: <1049188686.19334.20.camel@deskpro02> from "=?ISO-8859-1?Q?K=E5re?= Hviid" at Apr 01, 2003 11:18:05 AM Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: Thanks to the several people for responses! > > Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6 kernel from > > kernel.org, boots to an ata100 drive, promise rm8000 external hardware=20 > > raid5 array on adaptec Adaptec AHA-2940U/UW/D controller, 3com 3c905C=20 > > forced to 100-FD with "/sbin/mii-tool -F 100baseTx-FD eth1". > > Fast question: Are you sure the _switch_ is setup to do > 100FD as well? In my experience, forcing FD on newer > cards and switches is something that must be done > carefully. Also, what about link flow control? I'm not > sure the 3c905c can be forced to do flow control by > simple means if your switch happens to support it. Try > the same using N-Way auto-negotiation and check what the > 3c905c thinks about it. Flow control on the switch is disabled - the default, I checked. It's also set for 100-FD, like my ethernet cards. I always hard-set ethernet settings because I don't trust autonegotiation under all circumstances. I installed iperf on both machines and there is not a problem sending large amounts of data between machines. coffeepot:~ # /usr/local/bin/iperf -s -u froth:/tmp/iperf-1.7.0 # /usr/local/bin/iperf -c 10.0.34.1 -b 100m WARNING: option -b implies udp testing ------------------------------------------------------------ Client connecting to 10.0.34.1, UDP port 5001 Sending 1470 byte datagrams UDP buffer size: 64.0 KByte (default) ------------------------------------------------------------ [ 5] local 10.0.34.2 port 32876 connected with 10.0.34.1 port 5001 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 114 MBytes 95.6 Mbits/sec [ 5] Server Report: [ 5] 0.0-10.0 sec 114 MBytes 95.6 Mbits/sec 0.246 ms 0/81337 (0%) [ 5] Sent 81337 datagrams froth:/proc # /usr/local/bin/iperf -c 10.0.34.1 -b 90m WARNING: option -b implies udp testing ------------------------------------------------------------ Client connecting to 10.0.34.1, UDP port 5001 Sending 1470 byte datagrams UDP buffer size: 64.0 KByte (default) ------------------------------------------------------------ [ 5] local 10.0.34.2 port 32876 connected with 10.0.34.1 port 5001 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 108 MBytes 90.5 Mbits/sec [ 5] Server Report: [ 5] 0.0-10.0 sec 108 MBytes 90.5 Mbits/sec 0.000 ms 0/76925 (0%) [ 5] Sent 76925 datagrams > > Cheers, > --=20 > K=E5re Hviid Sys Admin ukh@id.cbs.dk +45 3815 3075 > Institut for Datalingvistik, Handelsh=F8jskolen i K=F8benhavn > -- /* Jason Philbrook | Midcoast Internet Solutions - Internet Access, KB1IOJ | Hosting, and TCP-IP Networks for Midcoast Maine http://f64.nu/ | http://www.midcoast.com/ */ ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philippe =?ISO-8859-15?Q?Gramoull=E9?= Subject: Re: performance question Date: Tue, 1 Apr 2003 18:06:33 +0200 Sender: nfs-admin@lists.sourceforge.net Message-ID: <20030401180633.05170a7d.philippe.gramoulle@mmania.com> References: <1049188686.19334.20.camel@deskpro02> <20030401153950.27556.qmail@pour.midcoast.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Return-path: Received: from smtp-103.nerim.net ([62.4.16.103] helo=kraid.nerim.net) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190OHe-0001SX-00 for ; Tue, 01 Apr 2003 08:06:46 -0800 Received: from philou.gramoulle.local (pgramoul.net2.nerim.net [80.65.227.234]) by kraid.nerim.net (Postfix) with SMTP id AA02A40F2D for ; Tue, 1 Apr 2003 18:06:42 +0200 (CEST) To: nfs@lists.sourceforge.net In-Reply-To: <20030401153950.27556.qmail@pour.midcoast.com> Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: Hi, Unless you're using an old exotic Cisco switch, i don't think you should do= this, IMHO. We've had the worst problems doing that and since we use autoneg ( with int= el EEpro100 card) we never had a single problem ever since. Thanks, Philippe -- Philippe Gramoull=E9 philippe.gramoulle@mmania.com Lycos Europe - NOC France On Tue, 1 Apr 2003 10:39:50 -0500 (EST) jp@pour.midcoast.com wrote: | I always hard-set ethernet=20 | settings because I don't trust autonegotiation under all circumstances. ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Matt Heaton" Subject: Re: performance question Date: Tue, 1 Apr 2003 09:22:13 -0700 Sender: nfs-admin@lists.sourceforge.net Message-ID: <35ff01c2f86a$da9c1880$6601a8c0@userl3x55qxqed> References: <1049188686.19334.20.camel@deskpro02><20030401153950.27556.qmail@pour.midcoast.com> <20030401180633.05170a7d.philippe.gramoulle@mmania.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Return-path: Received: from sccrmhc01.attbi.com ([204.127.202.61]) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190OZJ-00068x-00 for ; Tue, 01 Apr 2003 08:25:02 -0800 To: =?ISO-8859-15?Q?Philippe_Gramoull=E9?= , Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: I just have to respond to this. I must respectfully disagree. Autonegotiation is tolerable at best. With certain equiptment it works flawlessly, but MANY brands autonegotiate correct speeds and duplex, but still exhibit 2-3% packetloss or intermittant latency (high pings times etc). A perfect example is my cisco 2940 catalyst switch and my alteon/nortel 180e (layer 2-7 switch). Both switches are high quality and work well, but if you link them up with autonegiation you will have problems. It will detect proper speeds and duplex, but has speed problems and packet loss. When contacting BOTH cisco and nortel support they both said autonegiation is bad news and should be used only to get things up and going. Cisco said if all the products were cisco then no problem, just as nortel said the same thing. Just my 2 cents worth, but I have seen this problem on more than 5 devices on my own network alone. L8r... Matt ----- Original Message ----- From: "Philippe Gramoullé" To: Sent: Tuesday, April 01, 2003 9:06 AM Subject: Re: [NFS] performance question Hi, Unless you're using an old exotic Cisco switch, i don't think you should do this, IMHO. We've had the worst problems doing that and since we use autoneg ( with intel EEpro100 card) we never had a single problem ever since. Thanks, Philippe -- Philippe Gramoullé philippe.gramoulle@mmania.com Lycos Europe - NOC France On Tue, 1 Apr 2003 10:39:50 -0500 (EST) jp@pour.midcoast.com wrote: | I always hard-set ethernet | settings because I don't trust autonegotiation under all circumstances. ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philippe =?ISO-8859-15?Q?Gramoull=E9?= Subject: Re: performance question Date: Tue, 1 Apr 2003 19:08:01 +0200 Sender: nfs-admin@lists.sourceforge.net Message-ID: <20030401190801.53a516d0.philippe.gramoulle@mmania.com> References: <1049188686.19334.20.camel@deskpro02> <20030401153950.27556.qmail@pour.midcoast.com> <20030401180633.05170a7d.philippe.gramoulle@mmania.com> <35ff01c2f86a$da9c1880$6601a8c0@userl3x55qxqed> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Cc: Return-path: Received: from smtp-101.noc.nerim.net ([62.4.17.101] helo=mallaury.noc.nerim.net) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190PF8-0006UG-00 for ; Tue, 01 Apr 2003 09:08:14 -0800 To: "Matt Heaton" In-Reply-To: <35ff01c2f86a$da9c1880$6601a8c0@userl3x55qxqed> Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: Hi, Ok, i should have been more precise :) My recommandations were only for NIC <-> switch. In case of switch <-> switch then you could indeed force things without problems. I was refering to some Linux NFS servers, here, having big troubles talkin= g to=20 a switch (sorry i don't remember the brand) on which settings were forced. Thanks, Philippe -- Philippe Gramoull=E9 philippe.gramoulle@mmania.com Lycos Europe - NOC France On Tue, 1 Apr 2003 09:22:13 -0700 "Matt Heaton" wrote: | I just have to respond to this. I must respectfully disagree. | Autonegotiation is tolerable at best. | With certain equiptment it works flawlessly, but MANY brands autonegoti= ate | correct speeds and duplex, but still exhibit 2-3% packetloss or intermi= ttant | latency (high pings times etc). A perfect example is my cisco 2940 cat= alyst | switch and my alteon/nortel 180e (layer 2-7 switch). Both switches are= high | quality and work well, but if you link them up with autonegiation you w= ill | have problems. It will detect proper speeds and duplex, but has speed | problems and packet loss. When contacting BOTH cisco and nortel support | they both said autonegiation is bad news and should be used only to get | things up and going. Cisco said if all the products were cisco then no | problem, just as nortel said the same thing. Just my 2 cents worth, bu= t I | have seen this problem on more than 5 devices on my own network alone. |=20 | L8r... |=20 | Matt |=20 | ----- Original Message ----- | From: "Philippe Gramoull=E9" | To: | Sent: Tuesday, April 01, 2003 9:06 AM | Subject: Re: [NFS] performance question |=20 |=20 | Hi, |=20 | Unless you're using an old exotic Cisco switch, i don't think you shoul= d do | this, IMHO. |=20 | We've had the worst problems doing that and since we use autoneg ( with | intel EEpro100 card) | we never had a single problem ever since. |=20 | Thanks, |=20 | Philippe |=20 | -- |=20 | Philippe Gramoull=E9 | philippe.gramoulle@mmania.com | Lycos Europe - NOC France |=20 |=20 |=20 |=20 | On Tue, 1 Apr 2003 10:39:50 -0500 (EST) | jp@pour.midcoast.com wrote: |=20 | | I always hard-set ethernet | | settings because I don't trust autonegotiation under all circumsta= nces. |=20 |=20 | ------------------------------------------------------- | This SF.net email is sponsored by: ValueWeb: | Dedicated Hosting for just $79/mo with 500 GB of bandwidth! | No other company gives more support or power for your dedicated server | http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ | _______________________________________________ | NFS maillist - NFS@lists.sourceforge.net | https://lists.sourceforge.net/lists/listinfo/nfs |=20 |=20 |=20 |=20 | ------------------------------------------------------- | This SF.net email is sponsored by: ValueWeb:=20 | Dedicated Hosting for just $79/mo with 500 GB of bandwidth!=20 | No other company gives more support or power for your dedicated server | http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ | _______________________________________________ | NFS maillist - NFS@lists.sourceforge.net | https://lists.sourceforge.net/lists/listinfo/nfs |=20 ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bogdan Costescu Subject: Re: performance question Date: Tue, 1 Apr 2003 20:45:33 +0200 (CEST) Sender: nfs-admin@lists.sourceforge.net Message-ID: References: <20030401153950.27556.qmail@pour.midcoast.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: =?ISO-8859-1?Q?K=E5re?= Hviid , Return-path: Received: from mail.iwr.uni-heidelberg.de ([129.206.104.30]) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 190Qlh-0004lH-00 for ; Tue, 01 Apr 2003 10:45:57 -0800 To: jp@pour.midcoast.com In-Reply-To: <20030401153950.27556.qmail@pour.midcoast.com> Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: On Tue, 1 Apr 2003 jp@pour.midcoast.com wrote: > Flow control on the switch is disabled - the default, I checked. It's also > set for 100-FD, like my ethernet cards. I always hard-set ethernet > settings because I don't trust autonegotiation under all circumstances. People that don't want to be helped should not ask for help any more! I've already warned about forcing speed, check the net driver mailing lists and scyld.com to see why and also for some words from Donald Becker about why the forcing of speed and duplex ever came into discussion. > WARNING: option -b implies udp testing Oh yes, you want to test network quality with UDP... Have you ever thought that NFS needs communication both ways ? If you think that your network with forced full-duplex is perfect, try two UDP streams in opposite directions - you should not loose one packet and still achieve high bandwidth; and if you want to stress it even more, try UDP packets that do not fit in an Ethernet frame. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb: Dedicated Hosting for just $79/mo with 500 GB of bandwidth! No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Font Bella" Subject: Performance question Date: Thu, 14 Feb 2008 16:40:53 +0100 Message-ID: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 To: linux-nfs@vger.kernel.org Return-path: Received: from fk-out-0910.google.com ([209.85.128.191]:65158 "EHLO fk-out-0910.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758924AbYBNPk6 (ORCPT ); Thu, 14 Feb 2008 10:40:58 -0500 Received: by fk-out-0910.google.com with SMTP id z23so386855fkz.5 for ; Thu, 14 Feb 2008 07:40:54 -0800 (PST) Sender: linux-nfs-owner@vger.kernel.org List-ID: Hi, some of our apps are experiencing slow nfs performance in our new cluster, in comparison with the old one. The nfs setups for both clusters are very similar, and we are wondering what's going on. The details of both setups are given below for reference. The problem seems to occur with apps that do heavy i/o, creating, writing, reading, and deleting many files. However, writing or reading a large file (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not slow. We have performed some tests with the disk benchmark 'dbench', which reports i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the new one. After noticing this problem, we tried the user-mode nfs server instead of the kernel-mode server, and just installing the user-mode server helped improving throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec. After going through the "Optimizing NFS performance" section of the NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be 2048, which seems kind of weird to me, specially compared to the 8192 used in the old cluster), throughput increased to 21 Mb/sec, but is still too far from the old 60Mb/sec. We are stuck at this point. Any help/comment/suggestion will be greatly appreciated. /P **************************** OLD CLUSTER ***************************** SATA disks. Filesystem: ext3. * the version of nfs-utils you are using: I don't know. It's the most recent version in debian sarge (oldstable). user-mode nfs server. nfs version 2, as reported with rpcinfo. * the version of the kernel and any non-stock applied kernels: 2.6.12 * the distribution of linux you are using: Debian sarge x386 on Intel Xeon processors. * the version(s) of other operating systems involved: no other OS. It is also useful to know the networking configuration connecting the hosts: Typical beowulf setup, with all servers connected to a switch, 1Gb network. /etc/exports: /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash) /etc/fstab: server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0 **************************** NEW CLUSTER ***************************** SAS 10k disks. Filesystem: ext3 over LVM. * the version of nfs-utils you are using: I don't know. It's the most recent version in debian etch (stable). kernel-mode nfs server. nfs version 2, as reported with rpcinfo. * the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64 * the distribution of linux you are using: Debian etch AMD64 on Intel Xeon processors. * the version(s) of other operating systems involved: no other OS. It is also useful to know the networking configuration connecting the hosts: Typical beowulf setup, with all servers connected to a switch, 1Gb network. /etc/exports: /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash) mount options: rsize=8192,wsize=8192 From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Marcelo Leal" Subject: Re: Performance question Date: Thu, 14 Feb 2008 14:27:50 -0200 Message-ID: <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: linux-nfs@vger.kernel.org To: "Font Bella" Return-path: Received: from wf-out-1314.google.com ([209.85.200.171]:57570 "EHLO wf-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758317AbYBNQeg (ORCPT ); Thu, 14 Feb 2008 11:34:36 -0500 Received: by wf-out-1314.google.com with SMTP id 28so54414wff.4 for ; Thu, 14 Feb 2008 08:34:36 -0800 (PST) In-Reply-To: <90d010000802140740y3ff2706ybc169728fbafbfb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: Hello all, There is a great diff between access the raw discs and through LVM, with some kind of RAID, and etc. I think you should use NFS v3, and it's hard to think that without you explicitally configure it to use v2, it using... A great diff between v2 and v3 is that v2 is always "async", what is a performance burst. Are you sure that in the new environment is not v3? In the new stable version (nfs-utils), debian is sync by default. I'm used to "8192" transfer sizes, and was the best perfomance in my tests. Would be nice if you could test another network service writing in that server.. like ftp, or iscsi. Another question, the discs are "local" or SAN? There is no concurrency? ps.: v2 has a 2GB file size limit AFAIK. Leal. 2008/2/14, Font Bella : > Hi, > > some of our apps are experiencing slow nfs performance in our new cluster, in > comparison with the old one. The nfs setups for both clusters are very > similar, and we are wondering what's going on. The details of both setups are > given below for reference. > > The problem seems to occur with apps that do heavy i/o, creating, writing, > reading, and deleting many files. However, writing or reading a large file > (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not > slow. > > We have performed some tests with the disk benchmark 'dbench', which reports > i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the > new one. > > After noticing this problem, we tried the user-mode nfs server instead of the > kernel-mode server, and just installing the user-mode server helped improving > throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec. > > After going through the "Optimizing NFS performance" section of the > NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be > 2048, which seems kind of weird to me, specially compared to the 8192 used in > the old cluster), throughput increased to 21 Mb/sec, but is still too far > from the old 60Mb/sec. > > We are stuck at this point. Any help/comment/suggestion will be greatly > appreciated. > /P > > **************************** OLD CLUSTER ***************************** > > SATA disks. > > Filesystem: ext3. > > * the version of nfs-utils you are using: I don't know. It's the most > recent version in debian sarge (oldstable). > > user-mode nfs server. > > nfs version 2, as reported with rpcinfo. > > * the version of the kernel and any non-stock applied kernels: 2.6.12 > * the distribution of linux you are using: Debian sarge x386 on Intel Xeon > processors. > * the version(s) of other operating systems involved: no other OS. > > It is also useful to know the networking configuration connecting the hosts: > Typical beowulf setup, with all servers connected to a switch, 1Gb network. > > /etc/exports: > > /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash) > > /etc/fstab: > > server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0 > > **************************** NEW CLUSTER ***************************** > > SAS 10k disks. > > Filesystem: ext3 over LVM. > > * the version of nfs-utils you are using: I don't know. It's the most > recent version in debian etch (stable). > > kernel-mode nfs server. > > nfs version 2, as reported with rpcinfo. > > * the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64 > * the distribution of linux you are using: Debian etch AMD64 on Intel Xeon > processors. > * the version(s) of other operating systems involved: no other OS. > > It is also useful to know the networking configuration connecting the hosts: > Typical beowulf setup, with all servers connected to a switch, 1Gb network. > > /etc/exports: > > /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash) > > mount options: > > rsize=8192,wsize=8192 > - > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- pOSix rules From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chuck Lever Subject: Re: Performance question Date: Thu, 14 Feb 2008 11:56:36 -0500 Message-ID: <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> Mime-Version: 1.0 (Apple Message framework v753) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Cc: NFS list , Marcelo Leal To: Font Bella Return-path: Received: from rgminet01.oracle.com ([148.87.113.118]:39367 "EHLO rgminet01.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751919AbYBNQ5c (ORCPT ); Thu, 14 Feb 2008 11:57:32 -0500 In-Reply-To: <42996ba90802140827p533779c6o8ab404400be51fdc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote: > Hello all, > There is a great diff between access the raw discs and through LVM, > with some kind of RAID, and etc. I think you should use NFS v3, and > it's hard to think that without you explicitally configure it to use > v2, it using... > A great diff between v2 and v3 is that v2 is always "async", what is a > performance burst. Are you sure that in the new environment is not v3? > In the new stable version (nfs-utils), debian is sync by default. I'm > used to "8192" transfer sizes, and was the best perfomance in my > tests. As Marcelo suggested, this could be nothing more than the change in default export options (see exports(8) -- the description of the sync/ async option) between sarge and etch. This was a change in the nfs- utils package done a while back to improve data integrity guarantees during server instability. You can test this easily by explicitly specifying sync or async in your /etc/exports and trying your test. It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie they must be committed to permanent storage before the server replies) -- the async export option breaks that guarantee to improve performance. There is some further description in the NFS FAQ at http://nfs.sourceforge.net/ . The preferred way to get "async" write performance is to use NFSv3. > Would be nice if you could test another network service writing in > that server.. like ftp, or iscsi. > Another question, the discs are "local" or SAN? There is no > concurrency? > > ps.: v2 has a 2GB file size limit AFAIK. > > Leal. > > 2008/2/14, Font Bella : >> Hi, >> >> some of our apps are experiencing slow nfs performance in our new >> cluster, in >> comparison with the old one. The nfs setups for both clusters are >> very >> similar, and we are wondering what's going on. The details of >> both setups are >> given below for reference. >> >> The problem seems to occur with apps that do heavy i/o, creating, >> writing, >> reading, and deleting many files. However, writing or reading a >> large file >> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 >> count=2000`) is not >> slow. >> >> We have performed some tests with the disk benchmark 'dbench', >> which reports >> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/ >> sec in the >> new one. >> >> After noticing this problem, we tried the user-mode nfs server >> instead of the >> kernel-mode server, and just installing the user-mode server >> helped improving >> throughput up to 12 Mb/sec, but still far away from the good old >> 60 Mb/sec. >> >> After going through the "Optimizing NFS performance" section of the >> NFS-Howto and tweaking the rsize,wsize parameters (the optimal >> seems to be >> 2048, which seems kind of weird to me, specially compared to the >> 8192 used in >> the old cluster), throughput increased to 21 Mb/sec, but is still >> too far >> from the old 60Mb/sec. >> >> We are stuck at this point. Any help/comment/suggestion will be >> greatly >> appreciated. >> /P >> >> **************************** OLD CLUSTER >> ***************************** >> >> SATA disks. >> >> Filesystem: ext3. >> >> * the version of nfs-utils you are using: I don't know. It's the >> most >> recent version in debian sarge (oldstable). >> >> user-mode nfs server. >> >> nfs version 2, as reported with rpcinfo. >> >> * the version of the kernel and any non-stock applied kernels: >> 2.6.12 >> * the distribution of linux you are using: Debian sarge x386 on >> Intel Xeon >> processors. >> * the version(s) of other operating systems involved: no other OS. >> >> It is also useful to know the networking configuration connecting >> the hosts: >> Typical beowulf setup, with all servers connected to a switch, >> 1Gb network. >> >> /etc/exports: >> >> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash) >> >> /etc/fstab: >> >> server:/srv/homes/user /mnt/user nfs >> rw,hard,intr,rsize=8192,wsize=8192 0 0 >> >> **************************** NEW CLUSTER >> ***************************** >> >> SAS 10k disks. >> >> Filesystem: ext3 over LVM. >> >> * the version of nfs-utils you are using: I don't know. It's the >> most >> recent version in debian etch (stable). >> >> kernel-mode nfs server. >> >> nfs version 2, as reported with rpcinfo. >> >> * the version of the kernel and any non-stock applied kernels: >> 2.6.18-5-amd64 >> * the distribution of linux you are using: Debian etch AMD64 on >> Intel Xeon >> processors. >> * the version(s) of other operating systems involved: no other OS. >> >> It is also useful to know the networking configuration connecting >> the hosts: >> Typical beowulf setup, with all servers connected to a switch, >> 1Gb network. >> >> /etc/exports: >> >> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash) >> >> mount options: >> >> rsize=8192,wsize=8192 >> - >> To unsubscribe from this list: send the line "unsubscribe linux- >> nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > > -- > pOSix rules > - > To unsubscribe from this list: send the line "unsubscribe linux- > nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Font Bella" Subject: Re: Performance question Date: Fri, 15 Feb 2008 16:37:07 +0100 Message-ID: <90d010000802150737x2ad0739dmeaaa24dc2845e81a@mail.gmail.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: "NFS list" , "Marcelo Leal" To: "Chuck Lever" Return-path: Received: from fk-out-0910.google.com ([209.85.128.191]:1114 "EHLO fk-out-0910.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751645AbYBOPhM (ORCPT ); Fri, 15 Feb 2008 10:37:12 -0500 Received: by fk-out-0910.google.com with SMTP id z23so772791fkz.5 for ; Fri, 15 Feb 2008 07:37:08 -0800 (PST) In-Reply-To: <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: Dear all, I finally got it to work, after much pain/testing. Here are my config notes (just for the record). Thanks Marcelo and Chuck! NFS setup ========= Documentation ------------- * http://billharlan.com/pub/papers/NFS_for_clusters.html * http://nfs.sourceforge.net/nfs-howto/ar01s05.html#nfsd_daemon_instances Setting ------- We use package nfs-kernel-server, i.e. we use the kernel-space nfs server, which is faster than nfs-user-server. We use NFS version 3. Configuration ------------- Make sure we are using nfs version 3. This seems to be the default with package nfs-kernel-server. Check from client side with:: cat /proc/mounts Use UDP for packet transmission, i.e. use option 'proto=udp' in your /etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount command. Check from client side also with 'cat /proc/mounts'. Make sure you have enough nfsd server threads. See if your server is receiving too many overlapping requests with $ grep th /proc/net/rpc/nfsd Ours isn't, so we increase the number of threads used by the server to 32 by changing RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian configuration file for startup scripts). Remember to restart nfs-kernel-server for changes to take effect. In the server side, use 'async' option in /etc/exports. This was a crucial step to get good performance. Finally, try different values of rsize and wsize in your /etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount command. Check from client side also with 'cat /proc/mounts'. Test your favourite benchmark with different rsize,wsize and look for an optimal value. ALL the steps above were necessary for me to get good performance, but the last step was crucial, since I got very different performances depending on the value of rsize/wsize. On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever wrote: > On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote: > > Hello all, > > There is a great diff between access the raw discs and through LVM, > > with some kind of RAID, and etc. I think you should use NFS v3, and > > it's hard to think that without you explicitally configure it to use > > v2, it using... > > A great diff between v2 and v3 is that v2 is always "async", what is a > > performance burst. Are you sure that in the new environment is not v3? > > In the new stable version (nfs-utils), debian is sync by default. I'm > > used to "8192" transfer sizes, and was the best perfomance in my > > tests. > > As Marcelo suggested, this could be nothing more than the change in > default export options (see exports(8) -- the description of the sync/ > async option) between sarge and etch. This was a change in the nfs- > utils package done a while back to improve data integrity guarantees > during server instability. > > You can test this easily by explicitly specifying sync or async in > your /etc/exports and trying your test. > > It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie > they must be committed to permanent storage before the server > replies) -- the async export option breaks that guarantee to improve > performance. There is some further description in the NFS FAQ at > http://nfs.sourceforge.net/ . > > The preferred way to get "async" write performance is to use NFSv3. > > > > > Would be nice if you could test another network service writing in > > that server.. like ftp, or iscsi. > > Another question, the discs are "local" or SAN? There is no > > concurrency? > > > > ps.: v2 has a 2GB file size limit AFAIK. > > > > Leal. > > > > 2008/2/14, Font Bella : > >> Hi, > >> > >> some of our apps are experiencing slow nfs performance in our new > >> cluster, in > >> comparison with the old one. The nfs setups for both clusters are > >> very > >> similar, and we are wondering what's going on. The details of > >> both setups are > >> given below for reference. > >> > >> The problem seems to occur with apps that do heavy i/o, creating, > >> writing, > >> reading, and deleting many files. However, writing or reading a > >> large file > >> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 > >> count=2000`) is not > >> slow. > >> > >> We have performed some tests with the disk benchmark 'dbench', > >> which reports > >> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/ > >> sec in the > >> new one. > >> > >> After noticing this problem, we tried the user-mode nfs server > >> instead of the > >> kernel-mode server, and just installing the user-mode server > >> helped improving > >> throughput up to 12 Mb/sec, but still far away from the good old > >> 60 Mb/sec. > >> > >> After going through the "Optimizing NFS performance" section of the > >> NFS-Howto and tweaking the rsize,wsize parameters (the optimal > >> seems to be > >> 2048, which seems kind of weird to me, specially compared to the > >> 8192 used in > >> the old cluster), throughput increased to 21 Mb/sec, but is still > >> too far > >> from the old 60Mb/sec. > >> > >> We are stuck at this point. Any help/comment/suggestion will be > >> greatly > >> appreciated. > >> /P > >> > >> **************************** OLD CLUSTER > >> ***************************** > >> > >> SATA disks. > >> > >> Filesystem: ext3. > >> > >> * the version of nfs-utils you are using: I don't know. It's the > >> most > >> recent version in debian sarge (oldstable). > >> > >> user-mode nfs server. > >> > >> nfs version 2, as reported with rpcinfo. > >> > >> * the version of the kernel and any non-stock applied kernels: > >> 2.6.12 > >> * the distribution of linux you are using: Debian sarge x386 on > >> Intel Xeon > >> processors. > >> * the version(s) of other operating systems involved: no other OS. > >> > >> It is also useful to know the networking configuration connecting > >> the hosts: > >> Typical beowulf setup, with all servers connected to a switch, > >> 1Gb network. > >> > >> /etc/exports: > >> > >> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash) > >> > >> /etc/fstab: > >> > >> server:/srv/homes/user /mnt/user nfs > >> rw,hard,intr,rsize=8192,wsize=8192 0 0 > >> > >> **************************** NEW CLUSTER > >> ***************************** > >> > >> SAS 10k disks. > >> > >> Filesystem: ext3 over LVM. > >> > >> * the version of nfs-utils you are using: I don't know. It's the > >> most > >> recent version in debian etch (stable). > >> > >> kernel-mode nfs server. > >> > >> nfs version 2, as reported with rpcinfo. > >> > >> * the version of the kernel and any non-stock applied kernels: > >> 2.6.18-5-amd64 > >> * the distribution of linux you are using: Debian etch AMD64 on > >> Intel Xeon > >> processors. > >> * the version(s) of other operating systems involved: no other OS. > >> > >> It is also useful to know the networking configuration connecting > >> the hosts: > >> Typical beowulf setup, with all servers connected to a switch, > >> 1Gb network. > >> > >> /etc/exports: > >> > >> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash) > >> > >> mount options: > >> > >> rsize=8192,wsize=8192 > >> - > >> To unsubscribe from this list: send the line "unsubscribe linux- > >> nfs" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> > > > > > > -- > > pOSix rules > > - > > To unsubscribe from this list: send the line "unsubscribe linux- > > nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Trond Myklebust Subject: Re: Performance question Date: Fri, 15 Feb 2008 11:13:50 -0500 Message-ID: <1203092030.11333.4.camel@heimdal.trondhjem.org> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> <90d010000802150737x2ad0739dmeaaa24dc2845e81a@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain Cc: Chuck Lever , NFS list , Marcelo Leal To: Font Bella Return-path: Received: from pat.uio.no ([129.240.10.15]:56876 "EHLO pat.uio.no" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752604AbYBOQNz (ORCPT ); Fri, 15 Feb 2008 11:13:55 -0500 In-Reply-To: <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Fri, 2008-02-15 at 16:37 +0100, Font Bella wrote: > Finally, try different values of rsize and wsize in your > /etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount > command. Check from client side also with 'cat /proc/mounts'. > Test your favourite benchmark with different rsize,wsize and look for an > optimal value. > > ALL the steps above were necessary for me to get good performance, but > the last step was > crucial, since I got very different performances depending on the > value of rsize/wsize. That very likely implies that you have problems with UDP packet loss. Switch to TCP. Trond From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chuck Lever Subject: Re: Performance question Date: Fri, 15 Feb 2008 11:18:06 -0500 Message-ID: <12C4649C-537F-4850-AE77-78A130B54B37@oracle.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> <90d010000802150737x2ad0739dmeaaa24dc2845e81a@mail.gmail.com> Mime-Version: 1.0 (Apple Message framework v753) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Cc: "NFS list" , "Marcelo Leal" To: "Font Bella" Return-path: Received: from agminet01.oracle.com ([141.146.126.228]:10349 "EHLO agminet01.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752734AbYBOQSw (ORCPT ); Fri, 15 Feb 2008 11:18:52 -0500 In-Reply-To: <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Feb 15, 2008, at 10:37 AM, Font Bella wrote: > Dear all, > > I finally got it to work, after much pain/testing. Here are my config > notes (just for the record). > Thanks Marcelo and Chuck! > > NFS setup > ========= > > Documentation > ------------- > > * http://billharlan.com/pub/papers/NFS_for_clusters.html > * http://nfs.sourceforge.net/nfs-howto/ > ar01s05.html#nfsd_daemon_instances > > Setting > ------- > > We use package nfs-kernel-server, i.e. we use the kernel-space nfs > server, > which is faster than nfs-user-server. > > We use NFS version 3. > > Configuration > ------------- > > Make sure we are using nfs version 3. This seems to be the default > with > package nfs-kernel-server. Check from client side with:: > > cat /proc/mounts > > Use UDP for packet transmission, i.e. use option 'proto=udp' in your > /etc/fstab, /etc/auto.home (if using automounts), or in general, in > any mount > command. Check from client side also with 'cat /proc/mounts'. > > Make sure you have enough nfsd server threads. See if your server > is receiving > too many overlapping requests with > > $ grep th /proc/net/rpc/nfsd > > Ours isn't, so we increase the number of threads used by the server to > 32 by changing > RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian > configuration file > for startup scripts). Remember to restart nfs-kernel-server for > changes to > take effect. > > In the server side, use 'async' option in /etc/exports. This was a > crucial > step to get good performance. > > Finally, try different values of rsize and wsize in your > /etc/fstab, /etc/auto.home (if using automounts), or in general, in > any mount > command. Check from client side also with 'cat /proc/mounts'. > Test your favourite benchmark with different rsize,wsize and look > for an > optimal value. > > ALL the steps above were necessary for me to get good performance, but > the last step was > crucial, since I got very different performances depending on the > value of rsize/wsize. I'm glad you were able to make progress. 32 server threads is actually fairly conservative; you might consider 128 or more if you have more than a few clients. I want to make sure you understand the limitations and risks of using UDP and the "async" export option, however. 1. "async" is no longer the default because it introduces a silent data corruption risk. With NFSv3, data write operations are already asynchronous, with a subsequent COMMIT, so that they are safe. The client now knows when data has hit stable storage and can thus delete its cached copy safely. I urge you to read the NFS FAQ discussion on the "async" export option and reconsider its use in production. 2. UDP is no longer the default because it also introduces a silent data corruption risk, since the IP ID field (which UDP depends on for reassembling datagrams larger than a single link-layer frame) is only 16 bits wide. If this field should wrap, datagram reassembly is compromised. The UDP datagram checksum is weak enough that the receiving end probably won't detect the reassembly errors. In addition, UDP will likely perform poorly in situations involving more than a few clients. It's congestion control algorithm is unable to handle large amounts of concurrent network traffic since it doesn't have a packet ACK mechanism like TCP does. The fact that your performance was best at such a small r/wsize (you mentioned 2048 in your earlier e-mail) suggests you have a network environment that would benefit enormously from using TCP. So, our recommendation these days is to use the default "sync" export setting, and use NFSv3 over TCP if at all possible. (The HOWTO may be out of date in this regard). If you are not able to achieve good performance results with these settings, you can e-mail the list again and we can do further analysis. > On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever > wrote: >> On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote: >>> Hello all, >>> There is a great diff between access the raw discs and through LVM, >>> with some kind of RAID, and etc. I think you should use NFS v3, and >>> it's hard to think that without you explicitally configure it to use >>> v2, it using... >>> A great diff between v2 and v3 is that v2 is always "async", what >>> is a >>> performance burst. Are you sure that in the new environment is >>> not v3? >>> In the new stable version (nfs-utils), debian is sync by default. >>> I'm >>> used to "8192" transfer sizes, and was the best perfomance in my >>> tests. >> >> As Marcelo suggested, this could be nothing more than the change in >> default export options (see exports(8) -- the description of the >> sync/ >> async option) between sarge and etch. This was a change in the nfs- >> utils package done a while back to improve data integrity guarantees >> during server instability. >> >> You can test this easily by explicitly specifying sync or async in >> your /etc/exports and trying your test. >> >> It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie >> they must be committed to permanent storage before the server >> replies) -- the async export option breaks that guarantee to improve >> performance. There is some further description in the NFS FAQ at >> http://nfs.sourceforge.net/ . >> >> The preferred way to get "async" write performance is to use NFSv3. >> >> >> >>> Would be nice if you could test another network service writing in >>> that server.. like ftp, or iscsi. >>> Another question, the discs are "local" or SAN? There is no >>> concurrency? >>> >>> ps.: v2 has a 2GB file size limit AFAIK. >>> >>> Leal. >>> >>> 2008/2/14, Font Bella : >>>> Hi, >>>> >>>> some of our apps are experiencing slow nfs performance in our new >>>> cluster, in >>>> comparison with the old one. The nfs setups for both clusters are >>>> very >>>> similar, and we are wondering what's going on. The details of >>>> both setups are >>>> given below for reference. >>>> >>>> The problem seems to occur with apps that do heavy i/o, creating, >>>> writing, >>>> reading, and deleting many files. However, writing or reading a >>>> large file >>>> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 >>>> count=2000`) is not >>>> slow. >>>> >>>> We have performed some tests with the disk benchmark 'dbench', >>>> which reports >>>> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/ >>>> sec in the >>>> new one. >>>> >>>> After noticing this problem, we tried the user-mode nfs server >>>> instead of the >>>> kernel-mode server, and just installing the user-mode server >>>> helped improving >>>> throughput up to 12 Mb/sec, but still far away from the good old >>>> 60 Mb/sec. >>>> >>>> After going through the "Optimizing NFS performance" section of >>>> the >>>> NFS-Howto and tweaking the rsize,wsize parameters (the optimal >>>> seems to be >>>> 2048, which seems kind of weird to me, specially compared to the >>>> 8192 used in >>>> the old cluster), throughput increased to 21 Mb/sec, but is still >>>> too far >>>> from the old 60Mb/sec. >>>> >>>> We are stuck at this point. Any help/comment/suggestion will be >>>> greatly >>>> appreciated. >>>> /P >>>> >>>> **************************** OLD CLUSTER >>>> ***************************** >>>> >>>> SATA disks. >>>> >>>> Filesystem: ext3. >>>> >>>> * the version of nfs-utils you are using: I don't know. It's the >>>> most >>>> recent version in debian sarge (oldstable). >>>> >>>> user-mode nfs server. >>>> >>>> nfs version 2, as reported with rpcinfo. >>>> >>>> * the version of the kernel and any non-stock applied kernels: >>>> 2.6.12 >>>> * the distribution of linux you are using: Debian sarge x386 on >>>> Intel Xeon >>>> processors. >>>> * the version(s) of other operating systems involved: no other OS. >>>> >>>> It is also useful to know the networking configuration connecting >>>> the hosts: >>>> Typical beowulf setup, with all servers connected to a switch, >>>> 1Gb network. >>>> >>>> /etc/exports: >>>> >>>> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash) >>>> >>>> /etc/fstab: >>>> >>>> server:/srv/homes/user /mnt/user nfs >>>> rw,hard,intr,rsize=8192,wsize=8192 0 0 >>>> >>>> **************************** NEW CLUSTER >>>> ***************************** >>>> >>>> SAS 10k disks. >>>> >>>> Filesystem: ext3 over LVM. >>>> >>>> * the version of nfs-utils you are using: I don't know. It's the >>>> most >>>> recent version in debian etch (stable). >>>> >>>> kernel-mode nfs server. >>>> >>>> nfs version 2, as reported with rpcinfo. >>>> >>>> * the version of the kernel and any non-stock applied kernels: >>>> 2.6.18-5-amd64 >>>> * the distribution of linux you are using: Debian etch AMD64 on >>>> Intel Xeon >>>> processors. >>>> * the version(s) of other operating systems involved: no other OS. >>>> >>>> It is also useful to know the networking configuration connecting >>>> the hosts: >>>> Typical beowulf setup, with all servers connected to a switch, >>>> 1Gb network. >>>> >>>> /etc/exports: >>>> >>>> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash) >>>> >>>> mount options: >>>> >>>> rsize=8192,wsize=8192 >>>> - >>>> To unsubscribe from this list: send the line "unsubscribe linux- >>>> nfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> >>> >>> -- >>> pOSix rules >>> - >>> To unsubscribe from this list: send the line "unsubscribe linux- >>> nfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> Chuck Lever >> chuck[dot]lever[at]oracle[dot]com >> >> >> >> > - > To unsubscribe from this list: send the line "unsubscribe linux- > nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Font Bella" Subject: Re: Performance question Date: Mon, 18 Feb 2008 10:39:46 +0100 Message-ID: <90d010000802180139x49ac1f49x976f11cec0e01fdf@mail.gmail.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> <90d010000802150737x2ad0739dmeaaa24dc2845e81a@mail.gmail.com> <1203092030.11333.4.camel@heimdal.trondhjem.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: "Chuck Lever" , "NFS list" , "Marcelo Leal" To: "Trond Myklebust" Return-path: Received: from fg-out-1718.google.com ([72.14.220.158]:9856 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751368AbYBRJjs (ORCPT ); Mon, 18 Feb 2008 04:39:48 -0500 Received: by fg-out-1718.google.com with SMTP id e21so1420978fga.17 for ; Mon, 18 Feb 2008 01:39:47 -0800 (PST) In-Reply-To: <1203092030.11333.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: I tried TCP and async options, but I get poor performance in my benchmarks (a dbench run with 10 clients). Below I tabulated the outcome of my tests, which show that in my setting there is a huge difference between sync and async, and udp/tcp. Any comments/suggestions are warmly welcome. I also tried setting 128 server threads as Chuck suggested, but this doesn't seem to affect performance. This makes sense, since we only have a dozen of clients. About sync/async, I am not very concerned about corrupt data if the cluster goes down, we do mostly computing, no crucial database transactions or anything like that. Our users wouldn't mind some degree of data corruption in case of power failure, but speed is crucial. Our network setting is just a dozen of servers connected to a switch. Everything (adapters/cables/switch) is 1gigabit. We use ethernet bonding to double networking speed. Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP already gives me very poor performance. Admittedly, my test is very simple, and I should probably try something more complete, like IOzone. But the dbench run seems to reproduce the bottleneck we've been observing in our cluster. Thanks, /P ********************** ASYNC option in server ****************************** rsize,wsize TCP UDP 1024 24 MB/s 34 MB/s 2048 35 49 4096 37 75 8192 40.4 35 16386 40.2 19 ********************** SYNC option in server ****************************** rsize,wsize TCP UDP 1024 6 MB/s ?? MB/s 2048 7.44 ?? 4096 7.33 ?? 8192 7 ?? 16386 7 ?? On Feb 15, 2008 5:13 PM, Trond Myklebust wrote: > > That very likely implies that you have problems with UDP packet loss. > Switch to TCP. > > Trond > > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chuck Lever Subject: Re: Performance question Date: Mon, 18 Feb 2008 11:59:14 -0500 Message-ID: <06EE0C0B-F8AB-4ACA-9314-DF53F2B37E0D@oracle.com> References: <90d010000802140740y3ff2706ybc169728fbafbfb4@mail.gmail.com> <42996ba90802140827p533779c6o8ab404400be51fdc@mail.gmail.com> <80E378BD-86F7-4009-832A-2978A6FB4600@oracle.com> <90d010000802150737x2ad0739dmeaaa24dc2845e81a@mail.gmail.com> <1203092030.11333.4.camel@heimdal.trondhjem.org> <90d010000802180139x49ac1f49x976f11cec0e01fdf@mail.gmail.com> Mime-Version: 1.0 (Apple Message framework v753) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Cc: "Trond Myklebust" , "NFS list" , "Marcelo Leal" To: "Font Bella" Return-path: Received: from rgminet01.oracle.com ([148.87.113.118]:25527 "EHLO rgminet01.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753753AbYBRQ76 (ORCPT ); Mon, 18 Feb 2008 11:59:58 -0500 In-Reply-To: <90d010000802180139x49ac1f49x976f11cec0e01fdf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Feb 18, 2008, at 4:39 AM, Font Bella wrote: > I tried TCP and async options, but I get poor performance in my > benchmarks (a dbench run with 10 clients). Below I tabulated the > outcome of my tests, which show that in my setting there is a huge > difference between sync and async, and udp/tcp. Any > comments/suggestions are warmly welcome. > > I also tried setting 128 server threads as Chuck suggested, but this > doesn't seem to affect performance. This makes sense, since we only > have a dozen of clients. Each Linux client mount point can generate up to 16 server requests by default. A dozen clients each with a single mount point can generate 192 concurrent requests. So 128 server threads is not as outlandish as you might think. In this case, you are likely hitting some other bottleneck before the clients can utilize all the server threads. > About sync/async, I am not very concerned about corrupt data if the > cluster goes down, we do mostly computing, no crucial database > transactions or anything like that. Our users wouldn't mind some > degree of data corruption in case of power failure, but speed is > crucial. The data corruption is silent. If it weren't, you could simply restore from a backup as soon as you recover from a server crash. Silent corruption spreads into your backed up data, and starts causing strange application errors, sometimes a long time after the corruption first occurred. > Our network setting is just a dozen of servers connected to a switch. > Everything (adapters/cables/switch) is 1gigabit. We use ethernet > bonding to double networking speed. > > Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP > already gives me very poor performance. Admittedly, my test is very > simple, and I should probably try something more complete, like > IOzone. But the dbench run seems to reproduce the bottleneck we've > been observing in our cluster. I assume the dbench test is read and write only (little or no metadata activity like file creation and deletion). How closely does dbench reflect your production workload? I see from your initial e-mail that your server file system is: > SAS 10k disks. > > Filesystem: ext3 over LVM. Have you tried testing over NFS with a file system that resides on a single physical disk? If you have done a read-only test versus a write-only test, how do the numbers compare? Have you tested a range of write sizes, from small file writes v. writes to writing files larger than the server's memory? > ********************** ASYNC option in server > ****************************** > > rsize,wsize TCP UDP > > 1024 24 MB/s 34 MB/s > 2048 35 49 > 4096 37 75 > 8192 40.4 35 > 16386 40.2 19 As the size of the read and write requests increase, your UDP throughput decreases markedly. This does indicate some packet loss, so TCP is going to provide consistent performance and much lower risk to data integrity as your network and client workloads increase. You might try this test again and watch your clients' ethernet bandwidth and RPC retransmit rate to see what I mean. At the 16386 setting, the UDP test may be pumping significantly more packets onto the network, but is getting only about 20MB/s through. This will certainly have some effect on other traffic on the network. The first thing I check in these instances is that gigabit ethernet flow control is enabled in both directions on all interfaces (both host and switch). In addition, using larger r/wsize settings on your clients means the server can perform disk reads and writes more efficiently, which will help your server scale with increasing client workloads. By examining your current network carefully, you might be able to boost the performance of NFS over both UDP and TCP. With bonded gigabit, you should be able to push network throughput past 200 MB/s using a test like iPerf which doesn't touch disks. Thus, at least NFS reads from files already in the server's page cache ought to fly in this configuration. > ********************** SYNC option in server > ****************************** > > rsize,wsize TCP UDP > > 1024 6 MB/s ?? MB/s > 2048 7.44 ?? > 4096 7.33 ?? > 8192 7 ?? > 16386 7 ?? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Performance question Date: Sat, 17 Jan 2009 18:18:06 +0100 Message-ID: <20090117171806.GA9432@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi all, I'll have to setup some machines with two HDs (each) in order to get some redundancy. Reading the MD features I noticed there are several possibilities to create a mirror. I was wondering which one offer the best perfomances and/or what are the compromises to accept between the different solutions. One possibility is a classic RAID-1 mirror. Another is a RAID-10 far. There would also be the RAID-10 near, but I guess this is equivalent to RAID-1. Any suggestion on which method offers higher "speed"? Or there are other possibilities with 2 HDs (keeping the redundancy, of course)? Thanks a lot in advance, bye, -- piergiorgio From mboxrd@z Thu Jan 1 00:00:00 1970 From: "David Lethe" Subject: Re: Performance question Date: Sat, 17 Jan 2009 12:11:00 -0600 Message-ID: <2f8901c978ce$f6cb6bde$3d01a8c0@exchange.rackspace.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Return-path: Sender: linux-raid-owner@vger.kernel.org To: Piergiorgio Sartor , linux-raid@vger.kernel.org List-Id: linux-raid.ids All we know is that you use 2 disks and md. This is like posting to a TCP/IP architecture group and saying you have a network connection and want performance advice. Read up, supply full config info, run benchmarks, then ask specific questions. GI=GO. -----Original Message----- From: "Piergiorgio Sartor" Subj: Performance question Date: Sat Jan 17, 2009 11:18 am Size: 874 bytes To: "linux-raid@vger.kernel.org" Hi all, I'll have to setup some machines with two HDs (each) in order to get some redundancy. Reading the MD features I noticed there are several possibilities to create a mirror. I was wondering which one offer the best perfomances and/or what are the compromises to accept between the different solutions. One possibility is a classic RAID-1 mirror. Another is a RAID-10 far. There would also be the RAID-10 near, but I guess this is equivalent to RAID-1. Any suggestion on which method offers higher "speed"? Or there are other possibilities with 2 HDs (keeping the redundancy, of course)? Thanks a lot in advance, bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Re: Performance question Date: Sat, 17 Jan 2009 19:20:40 +0100 Message-ID: <20090117182040.GA13355@lazy.lzy> References: <2f8901c978ce$f6cb6bde$3d01a8c0@exchange.rackspace.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <2f8901c978ce$f6cb6bde$3d01a8c0@exchange.rackspace.com> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi, thanks for the answer. Well what I would like to have is exactly a configuration hint, eventually benchmarks and the like. The requirements are: two disks, redundacy. The question is: what configuration is reccommended in view of performances (or "what can be achieved"). Is that specific enough? Thanks again, bye, pg On Sat, Jan 17, 2009 at 12:11:00PM -0600, David Lethe wrote: > All we know is that you use 2 disks and md. This is like posting to a TCP/IP architecture group and saying you have a network connection and want performance advice. Read up, supply full config info, run benchmarks, then ask specific questions. GI=GO. > -----Original Message----- > > From: "Piergiorgio Sartor" > Subj: Performance question > Date: Sat Jan 17, 2009 11:18 am > Size: 874 bytes > To: "linux-raid@vger.kernel.org" > > Hi all, > > I'll have to setup some machines with two HDs (each) > in order to get some redundancy. > > Reading the MD features I noticed there are several > possibilities to create a mirror. > I was wondering which one offer the best perfomances > and/or what are the compromises to accept between > the different solutions. > > One possibility is a classic RAID-1 mirror. > Another is a RAID-10 far. > There would also be the RAID-10 near, but I guess > this is equivalent to RAID-1. > > Any suggestion on which method offers higher "speed"? > Or there are other possibilities with 2 HDs (keeping > the redundancy, of course)? > > Thanks a lot in advance, > > bye, > > -- > > piergiorgio > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- piergiorgio From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bill Davidsen Subject: Re: Performance question Date: Sat, 17 Jan 2009 13:37:57 -0500 Message-ID: <49722585.4020400@tmr.com> References: <20090117171806.GA9432@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090117171806.GA9432@lazy.lzy> Sender: linux-raid-owner@vger.kernel.org To: Piergiorgio Sartor Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids Piergiorgio Sartor wrote: > Hi all, > > I'll have to setup some machines with two HDs (each) > in order to get some redundancy. > > Reading the MD features I noticed there are several > possibilities to create a mirror. > I was wondering which one offer the best perfomances > and/or what are the compromises to accept between > the different solutions. > > One possibility is a classic RAID-1 mirror. > Another is a RAID-10 far. > There would also be the RAID-10 near, but I guess > this is equivalent to RAID-1. > > Any suggestion on which method offers higher "speed"? > Or there are other possibilities with 2 HDs (keeping > the redundancy, of course)? > Mirrored array will offer slower write speed no matter how you do it, usually about the speed of a single drive. With raid10 far you should get about N times faster read than a single drive, where N is drives in the array. Clearly using three or more drives will help a LOT in typical performance. -- Bill Davidsen "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keld =?iso-8859-1?Q?J=F8rn?= Simonsen Subject: Re: Performance question Date: Sat, 17 Jan 2009 23:08:49 +0100 Message-ID: <20090117220849.GB29866@rap.rap.dk> References: <20090117171806.GA9432@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Return-path: Content-Disposition: inline In-Reply-To: <20090117171806.GA9432@lazy.lzy> Sender: linux-raid-owner@vger.kernel.org To: Piergiorgio Sartor Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Sat, Jan 17, 2009 at 06:18:06PM +0100, Piergiorgio Sartor wrote: > Hi all, > > I'll have to setup some machines with two HDs (each) > in order to get some redundancy. > > Reading the MD features I noticed there are several > possibilities to create a mirror. > I was wondering which one offer the best perfomances > and/or what are the compromises to accept between > the different solutions. > > One possibility is a classic RAID-1 mirror. > Another is a RAID-10 far. > There would also be the RAID-10 near, but I guess > this is equivalent to RAID-1. Yes, raid10,n2 is quite the same as raid1 for 2 drives, That is the disk layout is the same. There may be some differences due to the use of different drivers, tho. It was reported at some time that there were some errors that one of the drivers handled better than the other. I am not sure which one was the better. Also syncing and rebuilding etc. may have different performance. > Any suggestion on which method offers higher "speed"? > Or there are other possibilities with 2 HDs (keeping > the redundancy, of course)? raid10,f2 offers something like double the speed for sequential read, while probably being a little faster on random read, and with a file system about equal in performance on writes. Degraded performance (in tha case that one disk is failing) could be worse for raid10,f2, but in real life, with the fs elevator in operation, the penalty may be minimal. IMHO you could normally replace raid1 and raid10,n2, and raid1+0 with raid10,f2, except for boot devices. Theoretically there is another possibility in raid5 with 2 drives, but I am not sure it even works out in practice, and there is imho no gain in it, except that you can expand the array with more disks. Furthermore there is raid10,o2 which is viable, but does not perform as well as raid10,f2. For linux raid performance have a look at http://linux-raid.osdl.org/index.php/Performance For setting up a system with 2 disks so you can survive that one disk fails, see http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk I am the main author of both wiki pages, so I am interested in feedback. Best regards Keld From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Re: Performance question Date: Mon, 19 Jan 2009 19:12:53 +0100 Message-ID: <20090119181253.GA4290@lazy.lzy> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <20090117220849.GB29866@rap.rap.dk> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi, thanks for the answer, that was exactly what I was looking for. Some feedback for you. About the performance & benchmarking I've nothing special to say. About the setup of two disks, I've some questions, in no particular order. The creation of "mdadm.conf" is done by: mdadm --detail --scan Somewhere else I found: mdadm --examine --scan The two produce different results and the Fedora installer seems to use the second one. Which one is really correct? Can we use one or the other interchangeably? Second question. The wiki page does not mention anything about metadata types. While it is clear that /boot must have the RAID header at the end, it is not clear if the RAID-10,f2 could or should have the metadata at the beginning. In this respect, it would be nice also to have some clarification about the reccommended metadata version, i.e. is it better 0.90 or 1.x? Why? One note. Maybe it could be worth to mention that further "partitioning" could be done with LVM on top of the RAID, so only 3 md devices will be needed. Hope this helps. Thanks again, bye, pg On Sat, Jan 17, 2009 at 11:08:49PM +0100, Keld J=F8rn Simonsen wrote: > On Sat, Jan 17, 2009 at 06:18:06PM +0100, Piergiorgio Sartor wrote: > > Hi all, > >=20 > > I'll have to setup some machines with two HDs (each) > > in order to get some redundancy. > >=20 > > Reading the MD features I noticed there are several > > possibilities to create a mirror. > > I was wondering which one offer the best perfomances > > and/or what are the compromises to accept between > > the different solutions. > >=20 > > One possibility is a classic RAID-1 mirror. > > Another is a RAID-10 far. > > There would also be the RAID-10 near, but I guess > > this is equivalent to RAID-1. >=20 > Yes, raid10,n2 is quite the same as raid1 for 2 drives, > That is the disk layout is the same. There may be some=20 > differences due to the use of different drivers, tho. It was reported= at > some time that there were some errors that one of the drivers handled > better than the other. I am not sure which one was the better. > Also syncing and rebuilding etc. may have different performance. >=20 > > Any suggestion on which method offers higher "speed"? > > Or there are other possibilities with 2 HDs (keeping > > the redundancy, of course)? >=20 > raid10,f2 offers something like double the speed for sequential read, > while probably being a little faster on random read, and with a file > system about equal in performance on writes. Degraded performance (in > tha case that one disk is failing) could be worse for raid10,f2, but = in > real life, with the fs elevator in operation, the penalty may be > minimal. IMHO you could normally replace raid1 and raid10,n2, and > raid1+0 with raid10,f2, except for boot devices. >=20 > Theoretically there is another possibility in raid5 with 2 drives, > but I am not sure it even works out in practice, and there is imho no > gain in it, except that you can expand the array with more disks. > Furthermore there is raid10,o2 which is viable, but does not > perform as well as raid10,f2. >=20 > For linux raid performance have a look at > http://linux-raid.osdl.org/index.php/Performance >=20 > For setting up a system with 2 disks so you can survive that one disk > fails, see > http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_dis= k >=20 > I am the main author of both wiki pages, so I am interested in feedba= ck. >=20 > Best regards > Keld --=20 piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keld =?iso-8859-1?Q?J=F8rn?= Simonsen Subject: Re: Performance question Date: Wed, 21 Jan 2009 01:15:03 +0100 Message-ID: <20090121001503.GA26587@rap.rap.dk> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> <20090119181253.GA4290@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Return-path: Content-Disposition: inline In-Reply-To: <20090119181253.GA4290@lazy.lzy> Sender: linux-raid-owner@vger.kernel.org To: Piergiorgio Sartor Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Mon, Jan 19, 2009 at 07:12:53PM +0100, Piergiorgio Sartor wrote: > Hi, > > thanks for the answer, that was exactly what I > was looking for. Good! > Some feedback for you. > About the performance & benchmarking I've nothing > special to say. > About the setup of two disks, I've some questions, > in no particular order. > > The creation of "mdadm.conf" is done by: > > mdadm --detail --scan > > Somewhere else I found: > > mdadm --examine --scan > > The two produce different results and the Fedora > installer seems to use the second one. > > Which one is really correct? Can we use one or the > other interchangeably? --detail looks at the running arrays, while --examine most likely (depending on mdadm.conf) looks at all partitions on the system. Given that the arrays are just created in the installation process, and the active running arrays are most likely the ones you want your system to know of, I think --detail is the better. --examine does on two of my systems generate info that are in conflict and not suitable for a mdadm.conf file, such as two /dev/md1 with different UUIDs. > Second question. > The wiki page does not mention anything about > metadata types. > While it is clear that /boot must have the RAID > header at the end, it is not clear if the RAID-10,f2 > could or should have the metadata at the beginning. > In this respect, it would be nice also to have some > clarification about the reccommended metadata version, > i.e. is it better 0.90 or 1.x? Why? To me it does not matter that much, except for the booting device. Each partition in the booting device must look like a normal (ext3) partition, as grub and lilo does not know of raids, and just treats a booting partition as a standalone partition. So here you should use 0.90 metadata, which is put at the end of the array. For other arrays I think one important choice is if you have an array greater than 2 TiB to not use 0.90 metadata, as this has a limit of 2 TiB. > One note. Maybe it could be worth to mention that > further "partitioning" could be done with LVM on top > of the RAID, so only 3 md devices will be needed. yes, I have been looking into that. Maybe I will add some words on this. > Hope this helps. yes, thanks for your feedback! best regards keld From mboxrd@z Thu Jan 1 00:00:00 1970 From: Richard Scobie Subject: Re: Performance question Date: Wed, 21 Jan 2009 14:05:42 +1300 Message-ID: <497674E6.50405@sauce.co.nz> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> <20090119181253.GA4290@lazy.lzy> <20090121001503.GA26587@rap.rap.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20090121001503.GA26587@rap.rap.dk> Sender: linux-raid-owner@vger.kernel.org To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= Cc: Piergiorgio Sartor , linux-raid@vger.kernel.org List-Id: linux-raid.ids Keld J=F8rn Simonsen wrote: > For other arrays I think one important choice is if you have an array > greater than 2 TiB to not use 0.90 metadata, as this has a limit of 2 > TiB. This restriction only applies if the individual members of the array ar= e=20 larger than 2TB each. Regards, Richard -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Re: Performance question Date: Wed, 21 Jan 2009 20:14:52 +0100 Message-ID: <20090121191452.GA4752@lazy.lzy> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> <20090119181253.GA4290@lazy.lzy> <20090121001503.GA26587@rap.rap.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20090121001503.GA26587@rap.rap.dk> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi again, [--detail vs. --examine] > --detail looks at the running arrays, while --examine most > likely (depending on mdadm.conf) looks at all partitions > on the system. > > Given that the arrays are just created in the installation process, and > the active running arrays are most likely the ones you want your system > to know of, I think --detail is the better. --examine does on two of my > systems generate info that are in conflict and not suitable for a > mdadm.conf file, such as two /dev/md1 with different UUIDs. yes, but I noticed that with "--detail" and an array (RAID-1) resyincing, it reports "spares=1" too, while when the array is in sync, it prints the correct geometry. So, I was wondering, since I also noticed that "--examine" produces the arrays with /dev/md/"name", so if two arrays have same name, it ends up with the same device. Is this maybe a bug of mdadm? [metadata position] > To me it does not matter that much, except for the booting device. > Each partition in the booting device must look like a normal (ext3) > partition, as grub and lilo does not know of raids, and just treats > a booting partition as a standalone partition. So here you should use > 0.90 metadata, which is put at the end of the array. Well, I was a bit mixing up things with this question. In the back of my head the question was: What about performances, RAID-10 f2, bitmap (important) and metadata 1.0 vs. 1.1? This could be a further test for performances. It would be interesting to know if it is better to have the metadata at the beginning or at the end of a RAID-10 f2, with two HDs, having the bitmap enabled. Or if it does not matter at all. Reading around I found different "opinions" about bitmap and performances, but I did not find a "convincing" test. Thanks again. Different item of the wiki, I run into it today. Maybe the "initrd" description could be updated, since it uses "mdassemble", while the "initrd" I have uses directly "mdadm -As --auto=yes ..." (I do not remember the full line). Hope this helps, bye, -- piergiorgio From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keld =?iso-8859-1?Q?J=F8rn?= Simonsen Subject: Re: Performance question Date: Wed, 21 Jan 2009 21:15:41 +0100 Message-ID: <20090121201541.GA20499@rap.rap.dk> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> <20090119181253.GA4290@lazy.lzy> <20090121001503.GA26587@rap.rap.dk> <20090121191452.GA4752@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Return-path: Content-Disposition: inline In-Reply-To: <20090121191452.GA4752@lazy.lzy> Sender: linux-raid-owner@vger.kernel.org To: Piergiorgio Sartor Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Wed, Jan 21, 2009 at 08:14:52PM +0100, Piergiorgio Sartor wrote: > Hi again, > > [--detail vs. --examine] > > --detail looks at the running arrays, while --examine most > > likely (depending on mdadm.conf) looks at all partitions > > on the system. > > > > Given that the arrays are just created in the installation process, and > > the active running arrays are most likely the ones you want your system > > to know of, I think --detail is the better. --examine does on two of my > > systems generate info that are in conflict and not suitable for a > > mdadm.conf file, such as two /dev/md1 with different UUIDs. > > yes, but I noticed that with "--detail" and an > array (RAID-1) resyincing, it reports "spares=1" > too, while when the array is in sync, it prints > the correct geometry. > So, I was wondering, since I also noticed that > "--examine" produces the arrays with /dev/md/"name", > so if two arrays have same name, it ends up with > the same device. > Is this maybe a bug of mdadm? I leave this to others to answer this one. I think it is strange for --detail to report "spares=1" if it is syncing. > [metadata position] > > To me it does not matter that much, except for the booting device. > > Each partition in the booting device must look like a normal (ext3) > > partition, as grub and lilo does not know of raids, and just treats > > a booting partition as a standalone partition. So here you should use > > 0.90 metadata, which is put at the end of the array. > > Well, I was a bit mixing up things with this question. > In the back of my head the question was: > > What about performances, RAID-10 f2, bitmap (important) > and metadata 1.0 vs. 1.1? > > This could be a further test for performances. It would > be interesting to know if it is better to have the > metadata at the beginning or at the end of a RAID-10 f2, > with two HDs, having the bitmap enabled. > Or if it does not matter at all. > > Reading around I found different "opinions" about bitmap > and performances, but I did not find a "convincing" test. I have not tested it. So yes, I think this is something to do a performance test on. I think it should not matter much whether it is in the beginning or in the end. However, if you make a test, then you most likely will do it on a newly created raid, and then files would tend to be allocated in the beginning of the file system, thus favouring a metadata block in the beginning of the raid. In real operation this will tend to even out. Another issue is that the sectors in the beginning of a disk are much faster, a factor of two perhaps, than the sectors in the end of the drive. > Thanks again. > > Different item of the wiki, I run into it today. > Maybe the "initrd" description could be updated, since > it uses "mdassemble", while the "initrd" I have uses > directly "mdadm -As --auto=yes ..." (I do not remember > the full line). mdasseble is specifically made for initrd, so why not use it here? Best regards keld From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Re: Performance question Date: Wed, 21 Jan 2009 21:26:49 +0100 Message-ID: <20090121202649.GA27298@lazy.lzy> References: <20090117171806.GA9432@lazy.lzy> <20090117220849.GB29866@rap.rap.dk> <20090119181253.GA4290@lazy.lzy> <20090121001503.GA26587@rap.rap.dk> <20090121191452.GA4752@lazy.lzy> <20090121201541.GA20499@rap.rap.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20090121201541.GA20499@rap.rap.dk> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Hi, thanks for the explanation about metadata. > mdasseble is specifically made for initrd, so why not use it here? I do not know, I just noticed that, on Fedora, the initrd with RAID has /etc/mdadm.conf and it calls "mdadm -As ...". Which I found annoying, since I do not know what will happen in case an array is changed (UUID change, /etc/mdadm.conf not more consistent). Anyway, if you say "mdassemble" is OK, no problem. Thanks, bye, -- piergiorgio From mboxrd@z Thu Jan 1 00:00:00 1970 From: Moritz Gartenmeister Subject: performance question Date: Mon, 12 Sep 2005 21:06:27 +0200 Message-ID: <4325D1B3.5030606@uplink-verein.ch> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: netfilter-bounces@lists.netfilter.org Errors-To: netfilter-bounces@lists.netfilter.org Content-Type: text/plain; charset="us-ascii"; format="flowed" To: netfilter lists hi i'm just wondering, if my experienced performance in my network is usual. setup: debian linux kernel 2.6.8.1 (patched with pom espacially l7-filter and ipp2p) linux-brigde everthing is working so far (that's the good part). but i measure different downloadrates: on my machine (behind the bridge) ~70Kbyte/s on the bridge ~200Kbyte/s the linux-bridge has to forward ~500 clients and has to shape transparently the traffic. is this difference in downloadrates normal? my assumption so far: i have 4 interfaces on the linux bridge. eth1 and eth2 doing the bridge, so they are heavly used. eth0 is rarely used, so this may be an explanation. even if i stop iptables, there is no increase. i would just appriciate, if someone can confirm this as ususal behavior. greets moritz From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp =?iso-8859-1?q?G=FChring?= Subject: Performance question Date: Sun, 5 May 2002 16:20:13 +0200 Message-ID: <200205051420.g45EKKo02315@linux1.futureware.at> Reply-To: p.guehring@futureware.at Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: list-help: list-unsubscribe: list-post: List-Id: Content-Type: text/plain; charset="iso-8859-1" To: reiserfs-list@namesys.com -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Let's say I have a directory with 100.000 files in it. The filenames look like name1_name2_name3_id So I have 001_41052_50125_1 001_63216_1212_1 ... I have to create a search engine, that serves for example the 4th Block of = 10=20 files that match the query "001_*_1212_1". The how query would result to 10= 0=20 files, that are spread across the directory. Now my question: Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which=20 should result to about 100 entries, and then take the entries 40 to 49 from= =20 the resulting array?=20 (Is ReiserFS able to directly return 100 files out of 100000 with the=20 globbing function, or is it an iteration over all files in the directory?) Or should I do 2 opendir-readdir loops, one to read over the first 39=20 results, that I do not need, and the second one to geht the results 40 to 4= 9? The problem here is that I have to readdir about 50000 files (40000 to get = through the unneeded results, and 10000 to get the 10 results i need) But on the other hand, I do not have to remember 100 files, from which I on= ly=20 need 10. If ReiserFS has to iterate over 100000 files (the whole directory) to do a = "001_*_1212_1" glob, because the binary tree only speeds up known files, bu= t=20 not patterns, then opendir-readdir should be faster, I guess. Another option would be to use subdirectories like name1/name2/name3/id So the glob would be "001/*/1212/1", which should be faster, anyway. But on the other hand, I would have to do a lot more directory management, = creating and deleting directories ... And implementing an opendir-readdir search through "001/*/1212/1" will be=20 more work too. Thanks for all feedback in advance and many greetings, - --=20 ~ Philipp G=FChring p.guehring@futureware.at ~ http://www.livingxml.net/ ICQ UIN: 6588261 ~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE81T+elqQ+F+0wB3oRAhw/AKCRH5CbdIMt2+ITpDkNBwcPKYpPqQCgmC2e RrYDyo/GgzqJvnn1jy1HjiY=3D =3D/ABd -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Oleg Drokin Subject: Re: Performance question Date: Sun, 5 May 2002 19:07:39 +0400 Message-ID: <20020505190739.A13452@namesys.com> References: <200205051420.g45EKKo02315@linux1.futureware.at> Mime-Version: 1.0 Return-path: list-help: list-unsubscribe: list-post: Content-Disposition: inline In-Reply-To: <200205051420.g45EKKo02315@linux1.futureware.at> List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Philipp G?hring Cc: reiserfs-list@namesys.com Hello! On Sun, May 05, 2002 at 04:20:13PM +0200, Philipp G?hring wrote: > Let's say I have a directory with 100.000 files in it. > The filenames look like > name1_name2_name3_id > So I have > 001_41052_50125_1 > 001_63216_1212_1 > I have to create a search engine, that serves for example the 4th Block of 10 > files that match the query "001_*_1212_1". The how query would result to 100 > files, that are spread across the directory. > Now my question: > Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which > should result to about 100 entries, and then take the entries 40 to 49 from > the resulting array? > (Is ReiserFS able to directly return 100 files out of 100000 with the > globbing function, or is it an iteration over all files in the directory?) *glob functions are implemented by various library functions, that do full readdir scans at least once, I believe. > Or should I do 2 opendir-readdir loops, one to read over the first 39 > results, that I do not need, and the second one to geht the results 40 to 49? In fact I do not see why do you need to do 2 opendir-readdir loops. One loop should be enough. You just compare each filename returned against your query and and if it matched remember it in separate list. So at the end of readdir loop you have a list of all names in a directory that match your query. And you can apply any additional check in place just not to remember unnecesary files. > The problem here is that I have to readdir about 50000 files (40000 to get > through the unneeded results, and 10000 to get the 10 results i need) > But on the other hand, I do not have to remember 100 files, from which I only > need 10. I am completely missing the idea on where these numbers are from. Can you explain in more details. > If ReiserFS has to iterate over 100000 files (the whole directory) to do a > "001_*_1212_1" glob, because the binary tree only speeds up known files, but > not patterns, then opendir-readdir should be faster, I guess. Binary tree is only helps when you know filename, I believe. You calculate a hash and out of that hash you can quickly find desired location. You you come up with a hash that places all filenames like your one near one, this will help, then. > Another option would be to use subdirectories like > name1/name2/name3/id > So the glob would be "001/*/1212/1", which should be faster, anyway. > But on the other hand, I would have to do a lot more directory management, > creating and deleting directories ... > And implementing an opendir-readdir search through "001/*/1212/1" will be > more work too. Readdir would require less iterations through 001/*, because number of entries will be only 100 as you described above. You get all these 100 entries and then loop 100 times trying to open 001/${next_name}/1212/1 and deciding whenever you need this file or not. (If it exists of course, or you might get -ENOENT and proceed to next directory). Also deleting directories would be an overkill. I think this might be faster in many circumfstances. Also what you've descrived looks very like to what squid does. And squid people went to reiserfs-raw interface and are quite happy with it. Bye, Oleg From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hans Reiser Subject: Re: Performance question Date: Mon, 06 May 2002 15:06:51 +0400 Message-ID: <3CD663CB.20703@namesys.com> References: <200205051420.g45EKKo02315@linux1.futureware.at> <20020505190739.A13452@namesys.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: list-help: list-unsubscribe: list-post: List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Oleg Drokin Cc: Philipp G?hring , reiserfs-list@namesys.com glob is implemented by the shell not the filesystem. This is not for good reason, it just is. We could write something for you to do it in the filesystem and it would be faster. Is your need for speed critical enough to justify writing something special for it? Hans Oleg Drokin wrote: >Hello! > >On Sun, May 05, 2002 at 04:20:13PM +0200, Philipp G?hring wrote: > > > >>Let's say I have a directory with 100.000 files in it. >>The filenames look like >>name1_name2_name3_id >>So I have >>001_41052_50125_1 >>001_63216_1212_1 >>I have to create a search engine, that serves for example the 4th Block of 10 >>files that match the query "001_*_1212_1". The how query would result to 100 >>files, that are spread across the directory. >>Now my question: >>Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which >>should result to about 100 entries, and then take the entries 40 to 49 from >>the resulting array? >>(Is ReiserFS able to directly return 100 files out of 100000 with the >>globbing function, or is it an iteration over all files in the directory?) >> >> > >*glob functions are implemented by various library functions, that do full >readdir scans at least once, I believe. > > > >>Or should I do 2 opendir-readdir loops, one to read over the first 39 >>results, that I do not need, and the second one to geht the results 40 to 49? >> >> > >In fact I do not see why do you need to do 2 opendir-readdir loops. >One loop should be enough. >You just compare each filename returned against your query and and if it matched >remember it in separate list. So at the end of readdir loop you have a list of >all names in a directory that match your query. And you can apply any additional >check in place just not to remember unnecesary files. > > > >>The problem here is that I have to readdir about 50000 files (40000 to get >>through the unneeded results, and 10000 to get the 10 results i need) >>But on the other hand, I do not have to remember 100 files, from which I only >>need 10. >> >> > >I am completely missing the idea on where these numbers are from. Can you >explain in more details. > > > >>If ReiserFS has to iterate over 100000 files (the whole directory) to do a >>"001_*_1212_1" glob, because the binary tree only speeds up known files, but >>not patterns, then opendir-readdir should be faster, I guess. >> >> > >Binary tree is only helps when you know filename, I believe. You calculate >a hash and out of that hash you can quickly find desired location. >You you come up with a hash that places all filenames like your one near one, >this will help, then. > > > >>Another option would be to use subdirectories like >>name1/name2/name3/id >>So the glob would be "001/*/1212/1", which should be faster, anyway. >>But on the other hand, I would have to do a lot more directory management, >>creating and deleting directories ... >>And implementing an opendir-readdir search through "001/*/1212/1" will be >>more work too. >> >> > >Readdir would require less iterations through 001/*, because number of >entries will be only 100 as you described above. >You get all these 100 entries and then loop 100 times trying to open >001/${next_name}/1212/1 and deciding whenever you need this file or not. >(If it exists of course, or you might get -ENOENT and proceed to next >directory). >Also deleting directories would be an overkill. >I think this might be faster in many circumfstances. >Also what you've descrived looks very like to what squid does. And squid people >went to reiserfs-raw interface and are quite happy with it. > > >Bye, > Oleg > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp =?koi8-r?q?G=3Fhring?= Subject: Re: Performance question Date: Sun, 5 May 2002 18:43:45 +0200 Message-ID: <200205051644.g45GijA03908@linux1.futureware.at> References: <200205051420.g45EKKo02315@linux1.futureware.at> <20020505190739.A13452@namesys.com> Reply-To: p.guehring@futureware.at Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: list-help: list-unsubscribe: list-post: In-Reply-To: <20020505190739.A13452@namesys.com> List-Id: Content-Type: text/plain; charset="us-ascii" To: Oleg Drokin , reiserfs-list@namesys.com -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello! Thank you Oleg for your answers. > *glob functions are implemented by various library functions, that do full > readdir scans at least once, I believe. I thought I heard about a syscall, that makes it possible to pass the glob to the filesystem, so that the filesystem can optimize globbings as it likes, and pass the result back to the application, but ok. > > Or should I do 2 opendir-readdir loops, one to read over the first 39 > > results, that I do not need, and the second one to geht the results 40 to > > 49? > > In fact I do not see why do you need to do 2 opendir-readdir loops. > One loop should be enough. Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips over unneeded results and the second one serves the data. > You just compare each filename returned against your query and and if it > matched remember it in separate list. So at the end of readdir loop you > have a list of all names in a directory that match your query. And you can > apply any additional check in place just not to remember unnecesary files. > > > The problem here is that I have to readdir about 50000 files (40000 to > > get through the unneeded results, and 10000 to get the 10 results i need) > > But on the other hand, I do not have to remember 100 files, from which I > > only need 10. > > I am completely missing the idea on where these numbers are from. Can you > explain in more details. I will try so. I have a table with 100000 files. A complete search would result for example 100 files, which are spread across the whole directory. About every thousand files, there is one file, that matches the query. Since the client does not want to get 100 files at once, at first I return only 10 results for the first page, and the user can navigate page-wise. So I built up the scenario where the user now wants the see results 40-49 from the query "001_*_1212_1", which I assume as normal behaviour for my application. > Binary tree is only helps when you know filename, I believe. Ok. > Readdir would require less iterations through 001/*, because number of > entries will be only 100 as you described above. > You get all these 100 entries and then loop 100 times trying to open > 001/${next_name}/1212/1 and deciding whenever you need this file or not. > (If it exists of course, or you might get -ENOENT and proceed to next > directory). > Also deleting directories would be an overkill. So the question is, how big that overkill is. Is there perhaps a benchmark that tested it already? > I think this might be faster in many circumfstances. > Also what you've descrived looks very like to what squid does. And squid > people went to reiserfs-raw interface and are quite happy with it. I think the difference to squid is that they only need one result, not a part of a search, with more than one result. But I am thinking about using reiserfs-raw too ... (At the moment flexibility has still more priority for me than raw performance) Many greetings, - -- ~ Philipp G?hring p.guehring@futureware.at ~ http://www.livingxml.net/ ICQ UIN: 6588261 ~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE81WFGlqQ+F+0wB3oRAtYSAJsGgaHnsohasbrjnJEQWAhi4tatSwCfQXDB dGlKoxKq0vcB0jHMOV6AEWQ= =heIa -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Oleg Drokin Subject: Re: Performance question Date: Mon, 6 May 2002 17:01:03 +0400 Message-ID: <20020506170103.A954@namesys.com> References: <200205051420.g45EKKo02315@linux1.futureware.at> <20020505190739.A13452@namesys.com> <200205051644.g45GijA03908@linux1.futureware.at> Mime-Version: 1.0 Return-path: list-help: list-unsubscribe: list-post: Content-Disposition: inline In-Reply-To: <200205051644.g45GijA03908@linux1.futureware.at> List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Philipp G?hring Cc: reiserfs-list@namesys.com Hello! On Sun, May 05, 2002 at 06:43:45PM +0200, Philipp G?hring wrote: > > *glob functions are implemented by various library functions, that do full > > readdir scans at least once, I believe. > I thought I heard about a syscall, that makes it possible to pass the glob to > the filesystem, so that the filesystem can optimize globbings as it likes, > and pass the result back to the application, but ok. I do not think something like that exists in Linux. But if you come up with man page from section 2... > > > Or should I do 2 opendir-readdir loops, one to read over the first 39 > > > results, that I do not need, and the second one to geht the results 40 to > > > 49? > > In fact I do not see why do you need to do 2 opendir-readdir loops. > > One loop should be enough. > Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips > over unneeded results and the second one serves the data. No. Still I think you need only one loop anyway, like this: DIR=opendir(name); while((result=readdir(DIR)) != NULL) { if ( check_filename_criteria(result->filename) ) { add_to_list_of_files_to_process(result->filename); } } for i in list_of_files_to_process { process_file(i); } So only one loop, and the second one does not count because it is serves actual data. > > > The problem here is that I have to readdir about 50000 files (40000 to > > > get through the unneeded results, and 10000 to get the 10 results i need) > > > But on the other hand, I do not have to remember 100 files, from which I > > > only need 10. > > I am completely missing the idea on where these numbers are from. Can you > > explain in more details. > I will try so. > I have a table with 100000 files. A complete search would result for example > 100 files, which are spread across the whole directory. > About every thousand files, there is one file, that matches the query. > Since the client does not want to get 100 files at once, at first I return > only 10 results for the first page, and the user can navigate page-wise. > So I built up the scenario where the user now wants the see results 40-49 > from the query "001_*_1212_1", > which I assume as normal behaviour for my application. Ah, I see what you mean. If you have a lot of resources, you can setup a session and store all the search results for that session at server side. So when second request comes in, you just read search result from the session. Also you kill the session for 5 minutes after 5 minutes of inactivity on it or so. Hm... This requires for cookies to be enabled, though. ;) > > Readdir would require less iterations through 001/*, because number of > > entries will be only 100 as you described above. > > You get all these 100 entries and then loop 100 times trying to open > > 001/${next_name}/1212/1 and deciding whenever you need this file or not. > > (If it exists of course, or you might get -ENOENT and proceed to next > > directory). > > Also deleting directories would be an overkill. > So the question is, how big that overkill is. I mean that you do not need to delete directories, when they are empty. You only need to create the directory structure once. > Is there perhaps a benchmark that tested it already? No, I do not think so, but feel free to compose and run your own benchmark. > > I think this might be faster in many circumfstances. > > Also what you've descrived looks very like to what squid does. And squid > > people went to reiserfs-raw interface and are quite happy with it. > I think the difference to squid is that they only need one result, not a part > of a search, with more than one result. Hm. This is true. Bye, Oleg From mboxrd@z Thu Jan 1 00:00:00 1970 From: david ahern Subject: performance question Date: Thu, 20 Mar 2008 12:01:26 -0600 Message-ID: <47E2A676.4020300@cisco.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: kvm-devel Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: kvm-devel-bounces@lists.sourceforge.net Errors-To: kvm-devel-bounces@lists.sourceforge.net List-Id: kvm.vger.kernel.org I am trying to understand spikes in system time that I am seeing in a VM. The guest OS is RHEL4, with 2 vpcus, and 2.5Gb RAM; host is running 2.6.24.2 kernel. kvm version is kvm-63. Using the stat scripts Christian Ehrhardt posted a few days ago (thanks, Christian, very handy tool) I collected kvm_stat data as a function of time (I added time to the output). Comparing plots of guest system time to plots of kvm_stat the spikes in system time most correlate to the following kvm_stat variables: mmu_cache_miss mmu_flooded mmu_pte_updated mmu_pte_write mmu_shadow_zapped pf_fixed pf_guest remote_tlb_flush tlb_flush Can someone provide some guidance/hints on what would cause spikes in the above and if there is anything I can do to improve it? The load on the VM is fairly constant (network traffic of ~48kB/sec received and ~189kB/sec transmit) with some moderate disk IO as well. thanks, david ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/