From mboxrd@z Thu Jan 1 00:00:00 1970 From: Leen Besselink Subject: Re: [ceph-users] Ceph write performance and my Dell R515's Date: Sun, 22 Sep 2013 14:58:21 +0200 Message-ID: <20130922125820.GD19702@apia.perrit.net> References: <523EE538.3020101@inktank.com> Reply-To: leen@consolejunkie.net Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mta2.perrit.net ([194.213.15.114]:37965 "EHLO mta2.perrit.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752387Ab3IVNFs (ORCPT ); Sun, 22 Sep 2013 09:05:48 -0400 Received: from apia.perrit.net (apia.colo.hnglo.perrit.net [194.0.170.50]) by mail.perrit.nl (Postfix) with ESMTP id 73EDB20013A for ; Sun, 22 Sep 2013 14:58:21 +0200 (CEST) Content-Disposition: inline In-Reply-To: <523EE538.3020101@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "ceph-devel@vger.kernel.org" On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote: > On 09/22/2013 03:12 AM, Quenten Grasso wrote: > > > >Hi All, > > > >I=E2=80=99m finding my write performance is less than I would have > >expected. After spending some considerable amount of time testing > >several different configurations I can never seems to break over > >~360mb/s write even when using tmpfs for journaling. > > > >So I=E2=80=99ve purchased 3x Dell R515=E2=80=99s with 1 x AMD 6C CPU= with 12 x 3TB > >SAS & 2 x 100GB Intel DC S3700 SSD=E2=80=99s & 32GB Ram with the Per= c > >H710p Raid controller and Dual Port 10GBE Network Cards. > > > >So first up I realise the SSD=E2=80=99s were a mistake, I should hav= e > >bought the 200GB Ones as they have considerably better write > >though put of ~375 Mb/s vs 200 Mb/s > > > >So to our Nodes Configuration, > > > >2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks > >in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB > >Stripe size, > > > >(Stripe size this part was particularly important because I found > >the stripe size matters considerably even on a single disk raid0. > >contrary to what you might read on the internet) > > > >Also each disk is configured with (write back cache) is enabled > >and (read head) disabled. > > > >For Networking, All nodes are connected via LACP bond with L3 > >hashing and using iperf I can get up to 16gbit/s tx and rx between > >the nodes. > > > >OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to > >upgrade kernel due to 10Gbit Intel NIC=E2=80=99s driver issues) > > > >So this gives me 11 OSD=E2=80=99s & 2 SSD=E2=80=99s Per Node. > > >=20 > I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but > you definitely will want to do some investigation to make sure that > OSD isn't holding the other ones back. iostat or collectl might be > useful, along with the ceph osd admin socket and the > dump_ops_in_flight and dump_historic_ops commands. >=20 I was wondering if latency on the network was OK, I wondered if there w= as some kind of LACP-bonding not working correctly or L3 hashing cashing problems. ifstat or iptraf or graphs of SNMP from the switch might show you where= the traffic went. I did my last tests I with nping echo-mode from nmap with --rate to do = latency tests under network load. It does generate a lot of output which slows it down a bit you might wa= nt to redirect it somewhere. > >Next I=E2=80=99ve tried several different configurations which I=E2=80= =99ll > >briefly describe 2 of which below, > > > >1)Cluster Configuration 1, > > > >33 OSD=E2=80=99s with 6x SSD=E2=80=99s as Journals, w/ 15GB Journals= on SSD. > > > ># ceph osd pool create benchmark1 1800 1800 > > > ># rados bench -p benchmark1 180 write --no-cleanup > > > >-------------------------------------------------- > > > >Maintaining 16 concurrent writes of 4194304 bytes for up to 180 > >seconds or 0 objects > > > >Total time run: 180.250417 > > > >Total writes made: 10152 > > > >Write size: 4194304 > > > >Bandwidth (MB/sec): 225.287 > > > >Stddev Bandwidth: 35.0897 > > > >Max bandwidth (MB/sec): 312 > > > >Min bandwidth (MB/sec): 0 > > > >Average Latency: 0.284054 > > > >Stddev Latency: 0.199075 > > > >Max latency: 1.46791 > > > >Min latency: 0.038512 > > > >-------------------------------------------------- > > >=20 > What was your pool replication set to? >=20 > ># rados bench -p benchmark1 180 seq > > > >------------------------------------------------- > > > >Total time run: 43.782554 > > > >Total reads made: 10120 > > > >Read size: 4194304 > > > >Bandwidth (MB/sec): 924.569 > > > >Average Latency: 0.0691903 > > > >Max latency: 0.262542 > > > >Min latency: 0.015756 > > > >------------------------------------------------- > > > >In this configuration I found my write performance suffers a lot > >to the SSD=E2=80=99s seem to be a bottleneck and my write performanc= e > >using rados bench was around 224-230mb/s > > > >2)Cluster Configuration 2, > > > >33 OSD=E2=80=99s with 1Gbyte Journals on tmpfs. > > > ># ceph osd pool create benchmark1 1800 1800 > > > ># rados bench -p benchmark1 180 write --no-cleanup > > > >-------------------------------------------------- > > > >Maintaining 16 concurrent writes of 4194304 bytes for up to 180 > >seconds or 0 objects > > > >Total time run: 180.044669 > > > >Total writes made: 15328 > > > >Write size: 4194304 > > > >Bandwidth (MB/sec): 340.538 > > > >Stddev Bandwidth: 26.6096 > > > >Max bandwidth (MB/sec): 380 > > > >Min bandwidth (MB/sec): 0 > > > >Average Latency: 0.187916 > > > >Stddev Latency: 0.0102989 > > > >Max latency: 0.336581 > > > >Min latency: 0.034475 > > > >-------------------------------------------------- > > >=20 > Definitely low, especially with journals on tmpfs. :( How are the I'm no expert, but I did notice the tmpfs journals were only 1GB that seems kinda small. But the systems didn't have a lot more memory, so th= ere wasn't much choice. Even if you make them slightly larger it will cut into the memory avail= able for the filesystem cache. That might be a bad idea as well, I guess. > CPUs doing at this point? We have some R515s in our lab, and they > definitely are slow too. Ours have 7 OSD disks and 1 Dell branded > SSD (usually unused) each and can do about ~150MB/s writes per > system. It's actually a puzzle we've been trying to solve for quite > some time. >=20 > Some thoughts: >=20 > Could the expander backplane be having issues due to having to > tunnel STP for the SATA SSDs (or potentially be causing expander > wide resets)? Could the H700 (and apparently H710) be doing > something unusual that the stock LSI firmware handles better? We > replaced the H700 with an Areca 1880 and definitely saw changes in > performance (better large IO throughput and worse IOPS), but the > performance was still much lower than in a supermicro node with no > expanders in the backplane using either an LSI 2208 or Areca 1880. >=20 > Things you might want to try: >=20 > - single node tests, and if you have an alternate controller you can > try, seeing if that works better. > - removing the S3700s from the chassis entirely and retry the tmpfs > journal tests. > - Since the H710 is SAS2208 based, you may be able to use megacli to > set it into JBOD mode and see if that works any better (it may if > you are using SSD or tmpfs backed journals). >=20 > MegaCli -AdpSetProp -EnableJBOD -val -aN|-a0,1,2|-aALL > MegaCli -PDMakeJBOD -PhysDrv[E0:S0,E1:S1,...] -aN|-a0,1,2|-aALL >=20 I think I remember seeing in a presentation from Dreamhost they mention= ed for their Ceph installtion they replaced the Dell firmware with origina= l LSI firmware to solve some problems. Maybe that is a route that is also pos= sible ? (that is at your own risk obviously. I really don't know if that is pos= sible with this controller, don't blame me if you brick your controller !) > ># rados bench -p benchmark1 180 seq > > > >------------------------------------------------- > > > >Total time run: 76.481303 > > > >Total reads made: 15328 > > > >Read size: 4194304 > > > >Bandwidth (MB/sec): 801.660 > > > >Average Latency: 0.079814 > > > >Max latency: 0.317827 > > > >Min latency: 0.016857 > > > >------------------------------------------------- > > > >Now it seems there is no bottleneck for journaling as we are using > >tmpfs, however still less then what I would expect write speed the > >sas disks are barely busy via iostat.. > > > >So I thought it might be a disk bus throughput issue. > > > >Next I completed some dd tests=E2=80=A6 > > > >This below is in a script dd-x.sh which executes the 11 readers or > >writers at once. > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.0/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.1/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.2/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.3/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.4/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.5/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.6/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.7/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.8/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.9/ddfile bs=3D32k count=3D100k = oflag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.10/ddfile bs=3D32k count=3D100k > >oflag=3Ddirect & > > > >this gives me aggregated write throughput of around 1,135 MB/s Write= =2E > > > >Simular script now to test reads, > > > >dd if=3D/srv/ceph/osd.0/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.1/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.2/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.3/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.4/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.5/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.6/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.7/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.8/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.9/ddfile of=3D/dev/null bs=3D32k count=3D100k = iflag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.10/ddfile of=3D/dev/null bs=3D32k count=3D100k > >iflag=3Ddirect & > > > >this gives me aggregated read throughput of around 1,382 MB/s Read. > > > >Next I=E2=80=99ll lower the block size to show the results, > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.0/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.1/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.2/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.3/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.4/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.5/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.6/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.7/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.8/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.9/ddfile bs=3D4k count=3D100k o= flag=3Ddirect & > > > >dd if=3D/dev/zero of=3D/srv/ceph/osd.10/ddfile bs=3D4k count=3D100k = oflag=3Ddirect & > > > >this gives me aggregated write throughput of around 300 MB/s Write. > > > >dd if=3D/srv/ceph/osd.0/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.1/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.2/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.3/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.4/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.5/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.6/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.7/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.8/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.9/ddfile of=3D/dev/null bs=3D4k count=3D100k i= flag=3Ddirect & > > > >dd if=3D/srv/ceph/osd.10/ddfile of=3D/dev/null bs=3D4k count=3D100k = iflag=3Ddirect & > > > >this gives me aggregated read throughput of around 430 MB/s Read, > > > >This is my ceph.conf, only difference between the configs is the > >journal dio =3D false > > > >---------------- > > > >[global] > > > >auth cluster required =3D cephx > > > >auth service required =3D cephx > > > >auth client required =3D cephx > > > >public network =3D 10.100.96.0/24 > > > >cluster network =3D 10.100.128.0/24 > > > >journal dio =3D false > > > >[mon] > > > >mon data =3D /var/ceph/mon.$id > > > >[mon.a] > > > >host =3D rbd01 > > > >mon addr =3D 10.100.96.10:6789 > > > >[mon.b] > > > >host =3D rbd02 > > > >mon addr =3D 10.100.96.11:6789 > > > >[mon.c] > > > >host =3D rbd03 > > > >mon addr =3D 10.100.96.12:6789 > > > >[osd] > > > >osd data =3D /srv/ceph/osd.$id > > > >osd journal size =3D 1000 > > > >osd mkfs type =3D xfs > > > >osd mkfs options xfs =3D "-f" > > > >osd mount options xfs =3D "rw,noexec,nodev,noatime,nodiratime,barrie= r=3D0,inode64,logbufs=3D8,logbsize=3D256k" > > > >[osd.0] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sda5 > > > >[osd.1] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdb2 > > > >[osd.2] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdc2 > > > >[osd.3] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdd2 > > > >[osd.4] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sde2 > > > >[osd.5] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdf2 > > > >[osd.6] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdg2 > > > >[osd.7] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdh2 > > > >[osd.8] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdi2 > > > >[osd.9] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdj2 > > > >[osd.10] > > > >host =3D rbd01 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdk2 > > > >[osd.11] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sda5 > > > >[osd.12] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdb2 > > > >[osd.13] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdc2 > > > >[osd.14] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdd2 > > > >[osd.15] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sde2 > > > >[osd.16] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdf2 > > > >[osd.17] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdg2 > > > >[osd.18] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdh2 > > > >[osd.19] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdi2 > > > >[osd.20] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdj2 > > > >[osd.21] > > > >host =3D rbd02 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdk2 > > > >[osd.22] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sda5 > > > >[osd.23] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdb2 > > > >[osd.24] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdc2 > > > >[osd.25] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdd2 > > > >[osd.26] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sde2 > > > >[osd.27] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdf2 > > > >[osd.28] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdg2 > > > >[osd.29] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdh2 > > > >[osd.30] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdi2 > > > >[osd.31] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdj2 > > > >[osd.32] > > > >host =3D rbd03 > > > >osd journal =3D /tmp/tmpfs/osd.$id > > > >devs =3D /dev/sdk2 > > > >--------------------- > > > >Any Ideas? > > > >Cheers, > > > >Quenten > > > > > > > >_______________________________________________ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >=20 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html