From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Priebe - Profihost AG Subject: Re: speedup ceph / scaling / find the bottleneck Date: Mon, 02 Jul 2012 15:19:48 +0200 Message-ID: <4FF19FF4.10104@profihost.ag> References: <4FED8792.1090905@profihost.ag> <4FED964D.3080201@inktank.com> <4FEDA777.1060309@profihost.ag> <4FEE1B91.8080404@profihost.ag> <4FF0BAB8.3070503@profihost.ag> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail.profihost.ag ([85.158.179.208]:44072 "EHLO mail.profihost.ag" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750793Ab2GBNTv (ORCPT ); Mon, 2 Jul 2012 09:19:51 -0400 In-Reply-To: <4FF0BAB8.3070503@profihost.ag> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: Mark Nelson , "ceph-devel@vger.kernel.org" Hello, i just want to report back some test results. Just some results from a sheepdog test using the same hardware. Sheepdog: 1 VM: write: io=12544MB, bw=142678KB/s, iops=35669, runt= 90025msec read : io=14519MB, bw=165186KB/s, iops=41296, runt= 90003msec write: io=16520MB, bw=185842KB/s, iops=45, runt= 91026msec read : io=102936MB, bw=1135MB/s, iops=283, runt= 90684msec 2 VMs: write: io=7042MB, bw=80062KB/s, iops=20015, runt= 90062msec read : io=8672MB, bw=98661KB/s, iops=24665, runt= 90004msec write: io=14008MB, bw=157443KB/s, iops=38, runt= 91107msec read : io=43924MB, bw=498462KB/s, iops=121, runt= 90234msec write: io=6048MB, bw=68772KB/s, iops=17192, runt= 90055msec read : io=9151MB, bw=104107KB/s, iops=26026, runt= 90006msec write: io=12716MB, bw=142693KB/s, iops=34, runt= 91253msec read : io=59616MB, bw=675648KB/s, iops=164, runt= 90353msec Ceph: 2 VMs: write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec So ceph has pretty good values for sequential stuff but for random I/O it would be really cool to improve it. Right now my testsystem has a theoretical 4k random I/Os bandwith of 350.000 iops - 14 disks with 25 000 iops each (test with fio too). Greets Stefan Am 01.07.2012 23:01, schrieb Stefan Priebe: > Hello list, > Hello sage, > > i've made some further tests. > > Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops > > Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops > > When i make random 4k writes over 100MB: 450% CPU usage of kvm process > and !! 25059 iops !! > > Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops > > So the range where the random I/O happen seem to be important and the > cpu usage just seem to reflect the iops. > > So i'm not sure if the problem is really the client rbd driver. Mark i > hope you can make some tests next week. > > Greets > Stefan > > > Am 29.06.2012 23:18, schrieb Stefan Priebe: >> Am 29.06.2012 17:28, schrieb Sage Weil: >>> On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote: >>>> Am 29.06.2012 13:49, schrieb Mark Nelson: >>>>> I'll try to replicate your findings in house. I've got some other >>>>> things I have to do today, but hopefully I can take a look next >>>>> week. If >>>>> I recall correctly, in the other thread you said that sequential >>>>> writes >>>>> are using much less CPU time on your systems? >>>> >>>> Random 4k writes: 10% idle >>>> Seq 4k writes: !! 99,7% !! idle >>>> Seq 4M writes: 90% idle >>> >>> I take it 'rbd cache = true'? >> Yes >> >>> It sounds like librbd (or the guest file >>> system) is coalescing the sequential writes into big writes. I'm a bit >>> surprised that the 4k ones have lower CPU utilization, but there are >>> lots >>> of opportunity for noise there, so I would > > > n't read too far into it yet. >> 90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was >> the overall system load. Not just ceph-osd. >> >>>>> Do you see better scaling in that case? >>>> >>>> 3 osd nodes: >>>> 1 VM: >>>> Rand 4k writes: 7000 iops >> <-- this one is WRONG! sorry it is 14100 iops >> >> >>>> Seq 4k writes: 19900 iops >>>> >>>> 2 VMs: >>>> Rand 4k writes: 6000 iops each >>>> Seq 4k writes: 4000 iops VM 1 >>>> Seq 4k writes: 18500 iops VM 2 >>>> >>>> >>>> 4 osd nodes: >>>> 1 VM: >>>> Rand 4k writes: 14400 iops <------ ???? >>> >>> Can you double-check this number? >> Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was >> wrong. Sorry. >> >>>> Seq 4k writes: 19000 iops >>>> >>>> 2 VMs: >>>> Rand 4k writes: 7000 iops each >>>> Seq 4k writes: 18000 iops each >>> >>> With the exception of that one number above, it really sounds like the >>> bottleneck is in the client (VM or librbd+librados) and not in the >>> cluster. Performance won't improve when you add OSDs if the limiting >>> factor is the clients ability to dispatch/stream/sustatin IOs. That >>> also >>> seems concistent with the fact that limiting the # of CPUs on the OSDs >>> doesn't affect much. >> ACK >> >>> Aboe, with 2 VMs, for instance, your total iops for the cluster doubled >>> (36000 total). Can you try with 4 VMs and see if it continues to >>> scale in >>> that dimension? At some point you will start to saturate the OSDs, >>> and at >>> that point adding more OSDs should show aggregate throughput going up. >> From where did you get that value? It scales to VMs on some points but >> it does not scale with OSDs. >> >> Stefan >