From: Leen Besselink <leen@consolejunkie.net>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: [ceph-users] Ceph write performance and my Dell R515's
Date: Sun, 22 Sep 2013 14:58:21 +0200 [thread overview]
Message-ID: <20130922125820.GD19702@apia.perrit.net> (raw)
In-Reply-To: <523EE538.3020101@inktank.com>
On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote:
> On 09/22/2013 03:12 AM, Quenten Grasso wrote:
> >
> >Hi All,
> >
> >I’m finding my write performance is less than I would have
> >expected. After spending some considerable amount of time testing
> >several different configurations I can never seems to break over
> >~360mb/s write even when using tmpfs for journaling.
> >
> >So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB
> >SAS & 2 x 100GB Intel DC S3700 SSD’s & 32GB Ram with the Perc
> >H710p Raid controller and Dual Port 10GBE Network Cards.
> >
> >So first up I realise the SSD’s were a mistake, I should have
> >bought the 200GB Ones as they have considerably better write
> >though put of ~375 Mb/s vs 200 Mb/s
> >
> >So to our Nodes Configuration,
> >
> >2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks
> >in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB
> >Stripe size,
> >
> >(Stripe size this part was particularly important because I found
> >the stripe size matters considerably even on a single disk raid0.
> >contrary to what you might read on the internet)
> >
> >Also each disk is configured with (write back cache) is enabled
> >and (read head) disabled.
> >
> >For Networking, All nodes are connected via LACP bond with L3
> >hashing and using iperf I can get up to 16gbit/s tx and rx between
> >the nodes.
> >
> >OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to
> >upgrade kernel due to 10Gbit Intel NIC’s driver issues)
> >
> >So this gives me 11 OSD’s & 2 SSD’s Per Node.
> >
>
> I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but
> you definitely will want to do some investigation to make sure that
> OSD isn't holding the other ones back. iostat or collectl might be
> useful, along with the ceph osd admin socket and the
> dump_ops_in_flight and dump_historic_ops commands.
>
I was wondering if latency on the network was OK, I wondered if there was some kind
of LACP-bonding not working correctly or L3 hashing cashing problems.
ifstat or iptraf or graphs of SNMP from the switch might show you where the traffic went.
I did my last tests I with nping echo-mode from nmap with --rate to do latency tests under network load.
It does generate a lot of output which slows it down a bit you might want to redirect it somewhere.
> >Next I’ve tried several different configurations which I’ll
> >briefly describe 2 of which below,
> >
> >1)Cluster Configuration 1,
> >
> >33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD.
> >
> ># ceph osd pool create benchmark1 1800 1800
> >
> ># rados bench -p benchmark1 180 write --no-cleanup
> >
> >--------------------------------------------------
> >
> >Maintaining 16 concurrent writes of 4194304 bytes for up to 180
> >seconds or 0 objects
> >
> >Total time run: 180.250417
> >
> >Total writes made: 10152
> >
> >Write size: 4194304
> >
> >Bandwidth (MB/sec): 225.287
> >
> >Stddev Bandwidth: 35.0897
> >
> >Max bandwidth (MB/sec): 312
> >
> >Min bandwidth (MB/sec): 0
> >
> >Average Latency: 0.284054
> >
> >Stddev Latency: 0.199075
> >
> >Max latency: 1.46791
> >
> >Min latency: 0.038512
> >
> >--------------------------------------------------
> >
>
> What was your pool replication set to?
>
> ># rados bench -p benchmark1 180 seq
> >
> >-------------------------------------------------
> >
> >Total time run: 43.782554
> >
> >Total reads made: 10120
> >
> >Read size: 4194304
> >
> >Bandwidth (MB/sec): 924.569
> >
> >Average Latency: 0.0691903
> >
> >Max latency: 0.262542
> >
> >Min latency: 0.015756
> >
> >-------------------------------------------------
> >
> >In this configuration I found my write performance suffers a lot
> >to the SSD’s seem to be a bottleneck and my write performance
> >using rados bench was around 224-230mb/s
> >
> >2)Cluster Configuration 2,
> >
> >33 OSD’s with 1Gbyte Journals on tmpfs.
> >
> ># ceph osd pool create benchmark1 1800 1800
> >
> ># rados bench -p benchmark1 180 write --no-cleanup
> >
> >--------------------------------------------------
> >
> >Maintaining 16 concurrent writes of 4194304 bytes for up to 180
> >seconds or 0 objects
> >
> >Total time run: 180.044669
> >
> >Total writes made: 15328
> >
> >Write size: 4194304
> >
> >Bandwidth (MB/sec): 340.538
> >
> >Stddev Bandwidth: 26.6096
> >
> >Max bandwidth (MB/sec): 380
> >
> >Min bandwidth (MB/sec): 0
> >
> >Average Latency: 0.187916
> >
> >Stddev Latency: 0.0102989
> >
> >Max latency: 0.336581
> >
> >Min latency: 0.034475
> >
> >--------------------------------------------------
> >
>
> Definitely low, especially with journals on tmpfs. :( How are the
I'm no expert, but I did notice the tmpfs journals were only 1GB that
seems kinda small. But the systems didn't have a lot more memory, so there
wasn't much choice.
Even if you make them slightly larger it will cut into the memory available
for the filesystem cache. That might be a bad idea as well, I guess.
> CPUs doing at this point? We have some R515s in our lab, and they
> definitely are slow too. Ours have 7 OSD disks and 1 Dell branded
> SSD (usually unused) each and can do about ~150MB/s writes per
> system. It's actually a puzzle we've been trying to solve for quite
> some time.
>
> Some thoughts:
>
> Could the expander backplane be having issues due to having to
> tunnel STP for the SATA SSDs (or potentially be causing expander
> wide resets)? Could the H700 (and apparently H710) be doing
> something unusual that the stock LSI firmware handles better? We
> replaced the H700 with an Areca 1880 and definitely saw changes in
> performance (better large IO throughput and worse IOPS), but the
> performance was still much lower than in a supermicro node with no
> expanders in the backplane using either an LSI 2208 or Areca 1880.
>
> Things you might want to try:
>
> - single node tests, and if you have an alternate controller you can
> try, seeing if that works better.
> - removing the S3700s from the chassis entirely and retry the tmpfs
> journal tests.
> - Since the H710 is SAS2208 based, you may be able to use megacli to
> set it into JBOD mode and see if that works any better (it may if
> you are using SSD or tmpfs backed journals).
>
> MegaCli -AdpSetProp -EnableJBOD -val -aN|-a0,1,2|-aALL
> MegaCli -PDMakeJBOD -PhysDrv[E0:S0,E1:S1,...] -aN|-a0,1,2|-aALL
>
I think I remember seeing in a presentation from Dreamhost they mentioned
for their Ceph installtion they replaced the Dell firmware with original LSI
firmware to solve some problems. Maybe that is a route that is also possible ?
(that is at your own risk obviously. I really don't know if that is possible
with this controller, don't blame me if you brick your controller !)
> ># rados bench -p benchmark1 180 seq
> >
> >-------------------------------------------------
> >
> >Total time run: 76.481303
> >
> >Total reads made: 15328
> >
> >Read size: 4194304
> >
> >Bandwidth (MB/sec): 801.660
> >
> >Average Latency: 0.079814
> >
> >Max latency: 0.317827
> >
> >Min latency: 0.016857
> >
> >-------------------------------------------------
> >
> >Now it seems there is no bottleneck for journaling as we are using
> >tmpfs, however still less then what I would expect write speed the
> >sas disks are barely busy via iostat..
> >
> >So I thought it might be a disk bus throughput issue.
> >
> >Next I completed some dd tests…
> >
> >This below is in a script dd-x.sh which executes the 11 readers or
> >writers at once.
> >
> >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=32k count=100k
> >oflag=direct &
> >
> >this gives me aggregated write throughput of around 1,135 MB/s Write.
> >
> >Simular script now to test reads,
> >
> >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=32k count=100k
> >iflag=direct &
> >
> >this gives me aggregated read throughput of around 1,382 MB/s Read.
> >
> >Next I’ll lower the block size to show the results,
> >
> >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=4k count=100k oflag=direct &
> >
> >this gives me aggregated write throughput of around 300 MB/s Write.
> >
> >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >this gives me aggregated read throughput of around 430 MB/s Read,
> >
> >This is my ceph.conf, only difference between the configs is the
> >journal dio = false
> >
> >----------------
> >
> >[global]
> >
> >auth cluster required = cephx
> >
> >auth service required = cephx
> >
> >auth client required = cephx
> >
> >public network = 10.100.96.0/24
> >
> >cluster network = 10.100.128.0/24
> >
> >journal dio = false
> >
> >[mon]
> >
> >mon data = /var/ceph/mon.$id
> >
> >[mon.a]
> >
> >host = rbd01
> >
> >mon addr = 10.100.96.10:6789
> >
> >[mon.b]
> >
> >host = rbd02
> >
> >mon addr = 10.100.96.11:6789
> >
> >[mon.c]
> >
> >host = rbd03
> >
> >mon addr = 10.100.96.12:6789
> >
> >[osd]
> >
> >osd data = /srv/ceph/osd.$id
> >
> >osd journal size = 1000
> >
> >osd mkfs type = xfs
> >
> >osd mkfs options xfs = "-f"
> >
> >osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"
> >
> >[osd.0]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.1]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.2]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.3]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.4]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.5]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.6]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.7]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.8]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.9]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.10]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >[osd.11]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.12]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.13]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.14]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.15]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.16]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.17]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.18]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.19]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.20]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.21]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >[osd.22]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.23]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.24]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.25]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.26]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.27]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.28]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.29]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.30]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.31]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.32]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >---------------------
> >
> >Any Ideas?
> >
> >Cheers,
> >
> >Quenten
> >
> >
> >
> >_______________________________________________
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2013-09-22 13:05 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cb37eba1f3b24d5190cc6195c5ab8ad7@onqbne-ex01.onq.com.au>
2013-09-22 12:40 ` [ceph-users] Ceph write performance and my Dell R515's Mark Nelson
2013-09-22 12:58 ` Leen Besselink [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130922125820.GD19702@apia.perrit.net \
--to=leen@consolejunkie.net \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.