From: Josh Durgin <josh.durgin@inktank.com>
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: Sage Weil <sage@inktank.com>,
"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
Anirban Ray <Anirban.Ray@sandisk.com>,
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Scaling RBD module
Date: Wed, 18 Sep 2013 18:10:19 -0700 [thread overview]
Message-ID: <523A4EFB.8040601@inktank.com> (raw)
In-Reply-To: <755F6B91B3BE364F9BCA11EA3F9E0C6F0FC4A040@SACMBXIP01.sdcorp.global.sandisk.com>
On 09/17/2013 03:30 PM, Somnath Roy wrote:
> Hi,
> I am running Ceph on a 3 node cluster and each of my server node is running 10 OSDs, one for each disk. I have one admin node and all the nodes are connected with 2 X 10G network. One network is for cluster and other one configured as public network.
>
> Here is the status of my cluster.
>
> ~/fio_test# ceph -s
>
> cluster b2e0b4db-6342-490e-9c28-0aadf0188023
> health HEALTH_WARN clock skew detected on mon. <server-name-2>, mon. <server-name-3>
> monmap e1: 3 mons at {<server-name-1>=xxx.xxx.xxx.xxx:6789/0, <server-name-2>=xxx.xxx.xxx.xxx:6789/0, <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3>
> osdmap e391: 30 osds: 30 up, 30 in
> pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 11145 GB / 11172 GB avail
> mdsmap e1: 0/0/1 up
>
>
> I started with rados bench command to benchmark the read performance of this Cluster on a large pool (~10K PGs) and found that each rados client has a limitation. Each client can only drive up to a certain mark. Each server node cpu utilization shows it is around 85-90% idle and the admin node (from where rados client is running) is around ~80-85% idle. I am trying with 4K object size.
Note that rados bench with 4k objects is different from rbd with
4k-sized I/Os - rados bench sends each request to a new object,
while rbd objects are 4M by default.
> Now, I started running more clients on the admin node and the performance is scaling till it hits the client cpu limit. Server still has the cpu of 30-35% idle. With small object size I must say that the ceph per osd cpu utilization is not promising!
>
> After this, I started testing the rados block interface with kernel rbd module from my admin node.
> I have created 8 images mapped on the pool having around 10K PGs and I am not able to scale up the performance by running fio (either by creating a software raid or running on individual /dev/rbd* instances). For example, running multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2) the performance I am getting is half of what I am getting if running one instance. Here is my fio job script.
>
> [random-reads]
> ioengine=libaio
> iodepth=32
> filename=/dev/rbd1
> rw=randread
> bs=4k
> direct=1
> size=2G
> numjobs=64
>
> Let me know if I am following the proper procedure or not.
>
> But, If my understanding is correct, kernel rbd module is acting as a client to the cluster and in one admin node I can run only one of such kernel instance.
> If so, I am then limited to the client bottleneck that I stated earlier. The cpu utilization of the server side is around 85-90% idle, so, it is clear that client is not driving.
>
> My question is, is there any way to hit the cluster with more client from a single box while testing the rbd module ?
You can run multiple librbd instances easily (for example with
multiple runs of the rbd bench-write command).
The kernel rbd driver uses the same rados client instance for multiple
block devices by default. There's an option (noshare) to use a new
rados client instance for a newly mapped device, but it's not exposed
by the rbd cli. You need to use the sysfs interface that 'rbd map' uses
instead.
Once you've used rbd map once on a machine, the kernel will already
have the auth key stored, and you can use:
echo '1.2.3.4:6789 name=admin,key=client.admin,noshare poolname
imagename' > /sys/bus/rbd/add
Where 1.2.3.4:6789 is the address of a monitor, and you're connecting
as client.admin.
You can use 'rbd unmap' as usual.
Josh
next prev parent reply other threads:[~2013-09-19 1:12 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-17 22:30 Scaling RBD module Somnath Roy
2013-09-19 1:10 ` Josh Durgin [this message]
2013-09-19 19:04 ` [ceph-users] " Somnath Roy
2013-09-19 19:23 ` Josh Durgin
2013-09-20 0:03 ` Somnath Roy
[not found] ` <755F6B91B3BE364F9BCA11EA3F9E0C6F0FC4A738-cXZ6iGhjG0il5HHZYNR2WTJ2aSJ780jGSxCzGc5ayCJWk0Htik3J/w@public.gmane.org>
2013-09-24 21:09 ` Travis Rhoden
[not found] ` <CACkq2mrfO+eFCYaEdoTQpJ2tOoDyVCkedSMAAztnQVYPBsv7gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-24 21:16 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1309241413280.25142-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-09-24 21:24 ` Travis Rhoden
2013-09-24 22:17 ` Somnath Roy
2013-09-24 22:47 ` [ceph-users] " Sage Weil
[not found] ` <alpine.DEB.2.00.1309241535430.25142-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-09-24 23:59 ` Somnath Roy
2013-09-24 22:23 ` Somnath Roy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=523A4EFB.8040601@inktank.com \
--to=josh.durgin@inktank.com \
--cc=Anirban.Ray@sandisk.com \
--cc=Somnath.Roy@sandisk.com \
--cc=ceph-devel@vger.kernel.org \
--cc=ceph-users@lists.ceph.com \
--cc=sage@inktank.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.