From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: messaging/IO/radosbench results Date: Wed, 12 Sep 2012 19:39:48 -0500 Message-ID: <50512B54.10805@inktank.com> References: <20120910201539.GA5733@splice> <504E501E.5080108@inktank.com> <20120912200804.GA4993@oder.mch.fsc.net> <50510BE0.9060706@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ie0-f174.google.com ([209.85.223.174]:47532 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751255Ab2IMAjv (ORCPT ); Wed, 12 Sep 2012 20:39:51 -0400 Received: by ieje11 with SMTP id e11so3974417iej.19 for ; Wed, 12 Sep 2012 17:39:50 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Joseph Glanville Cc: Dieter Kasper , Mike Ryan , "ceph-devel@vger.kernel.org" On 09/12/2012 06:24 PM, Joseph Glanville wrote: > On 13 September 2012 08:25, Mark Nelson wrote: >> On 09/12/2012 03:08 PM, Dieter Kasper wrote: >>> >>> On Mon, Sep 10, 2012 at 10:39:58PM +0200, Mark Nelson wrote: >>>> >>>> On 09/10/2012 03:15 PM, Mike Ryan wrote: >>>>> >>>>> *Disclaimer*: these results are an investigation into potential >>>>> bottlenecks in RADOS. >>> >>> I appreciate this investigation very much ! >>> >>>>> The test setup is wholly unrealistic, and these >>>>> numbers SHOULD NOT be used as an indication of the performance of OSDs, >>>>> messaging, RADOS, or ceph in general. >>>>> >>>>> >>>>> Executive summary: rados bench has some internal bottleneck. Once that's >>>>> cleared up, we're still having some issues saturating a single >>>>> connection to an OSD. Having 2-3 connection in parallel alleviates that >>>>> (either by having> 1 OSD or having multiple bencher clients). >>>>> >>>>> >>>>> I've run three separate tests: msbench, smalliobench, and rados bench. >>>>> In all cases I was trying to determine where bottleneck(s) exist. All >>>>> the tests were run on a machine with 192 GB of RAM. The backing stores >>>>> for all OSDs and journals are RAMdisks. The stores are running XFS. >>>>> >>>>> smalliobench: I ran tests varying the number of OSDs and bencher >>>>> clients. In all cases, the number of PG's per OSD is 100. >>>>> >>>>> OSD Bencher Throughput (mbyte/sec) >>>>> 1 1 510 >>>>> 1 2 800 >>>>> 1 3 850 >>>>> 2 1 640 >>>>> 2 2 660 >>>>> 2 3 670 >>>>> 3 1 780 >>>>> 3 2 820 >>>>> 3 3 870 >>>>> 4 1 850 >>>>> 4 2 970 >>>>> 4 3 990 >>>>> >>>>> Note: these numbers are fairly fuzzy. I eyeballed them and they're only >>>>> really accurate to about 10 mbyte/sec. The small IO bencher was run with >>>>> 100 ops in flight, 4 mbyte io's, 4 mbyte files. >>>>> >>>>> msbench: ran tests trying to determine max throughput of raw messaging >>>>> layer. Varied the number of concurrently connected msbench clients and >>>>> measured aggregate throughput. Take-away: a messaging client can very >>>>> consistently push 400-500 mbytes/sec through a single socket. >>>>> >>>>> Clients Throughput (mbyte/sec) >>>>> 1 520 >>>>> 2 880 >>>>> 3 1300 >>>>> 4 1900 >>>>> >>>>> Finally, rados bench, which seems to have its own bottleneck. Running >>>>> varying numbers of these, each client seems to get 250 mbyte/sec up till >>>>> the aggregate rate is around 1000 mbyte/sec (appx line speed as measured >>>>> by iperf). These were run on a pool with 100 PGs/OSD. >>>>> >>>>> Clients Throughput (mbyte/sec) >>>>> 1 250 >>>>> 2 500 >>>>> 3 750 >>>>> 4 1000 (very fuzzy, probably 1000 +/- 75) >>>>> 5 1000, seems to level out here >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> Hi guys, >>>> >>>> Some background on all of this: >>>> >>>> We've been doing some performance testing at Inktank and noticed that >>>> performance with a single rados bench instance was plateauing at between >>>> 600-700MB/s. >>> >>> >>> 4-nodes with 10GbE interconnect; journals in RAM-Disk; replica=2 >>> >>> # rados bench -p pbench 20 write >>> Maintaining 16 concurrent writes of 4194304 bytes for at least 20 >>> seconds. >>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>> lat >>> 0 0 0 0 0 0 - >>> 0 >>> 1 16 288 272 1087.81 1088 0.051123 >>> 0.0571643 >>> 2 16 579 563 1125.85 1164 0.045729 >>> 0.0561784 >>> 3 16 863 847 1129.19 1136 0.042012 >>> 0.0560869 >>> 4 16 1150 1134 1133.87 1148 0.05466 >>> 0.0559281 >>> 5 16 1441 1425 1139.87 1164 0.036852 >>> 0.0556809 >>> 6 16 1733 1717 1144.54 1168 0.054594 >>> 0.0556124 >>> 7 16 2007 1991 1137.59 1096 0.04454 >>> 0.0556698 >>> 8 16 2290 2274 1136.88 1132 0.046777 >>> 0.0560103 >>> 9 16 2580 2564 1139.44 1160 0.073328 >>> 0.0559353 >>> 10 16 2871 2855 1141.88 1164 0.034091 >>> 0.0558576 >>> 11 16 3158 3142 1142.43 1148 0.250688 >>> 0.0558404 >>> 12 16 3445 3429 1142.88 1148 0.046941 >>> 0.0558071 >>> 13 16 3726 3710 1141.42 1124 0.054092 >>> 0.0559 >>> 14 16 4014 3998 1142.17 1152 0.03531 >>> 0.0558533 >>> 15 16 4298 4282 1141.75 1136 0.040005 >>> 0.0559383 >>> 16 16 4582 4566 1141.39 1136 0.048431 >>> 0.0559162 >>> 17 16 4859 4843 1139.42 1108 0.045805 >>> 0.0559891 >>> 18 16 5145 5129 1139.66 1144 0.046805 >>> 0.0560177 >>> 19 16 5422 5406 1137.99 1108 0.037295 >>> 0.0561341 >>> 2012-09-08 14:36:32.460311min lat: 0.029503 max lat: 0.47757 avg lat: >>> 0.0561424 >>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>> lat >>> 20 16 5701 5685 1136.89 1116 0.041493 >>> 0.0561424 >>> Total time run: 20.197129 >>> Total writes made: 5702 >>> Write size: 4194304 >>> Bandwidth (MB/sec): 1129.269 >>> >>> Stddev Bandwidth: 23.7487 >>> Max bandwidth (MB/sec): 1168 >>> Min bandwidth (MB/sec): 1088 >>> Average Latency: 0.0564675 >>> Stddev Latency: 0.0327582 >>> Max latency: 0.47757 >>> Min latency: 0.029503 >>> >>> >>> Best Regards, >>> -Dieter >>> >> >> Well look at that! :) Now I've gotta figure out what the difference is. >> How fast are the CPUs in your rados bench machine there? >> >> Also, I should mention that at these speeds, we noticed that crc32c >> calculations were actually having a pretty big effect. Turning them off >> gave us a 10% performance boost. We're looking at faster implementations >> now. >> >> >> Mark >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Hi Mark > > If using primarily Intel machines that are Nahalem or better (I would > imagine most boxes running Ceph would fit this category) then consider > using the Intel CRC32 instructions. > Most of the work to use them is laid out here: > http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411 > Hi Dieter, Yes, I've been looking at for Nehalem. We actually have a number of machines using last gen AMD processors so we'll need to consider options for that as well. Earlier today I was reading through the whitepaper here: http://code.google.com/p/crcutil/ Mark