From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: perf counters from a performance discrepancy Date: Wed, 23 Sep 2015 13:42:32 -0500 Message-ID: <5602F298.7060609@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx1.redhat.com ([209.132.183.28]:39034 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753291AbbIWSmg (ORCPT ); Wed, 23 Sep 2015 14:42:36 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum , Sage Weil Cc: "Deneau, Tom" , "ceph-devel@vger.kernel.org" On 09/23/2015 01:25 PM, Gregory Farnum wrote: > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil wrote= : >> On Wed, 23 Sep 2015, Deneau, Tom wrote: >>> Hi all -- >>> >>> Looking for guidance with perf counters... >>> I am trying to see whether the perf counters can tell me anything a= bout the following discrepancy >>> >>> I populate a number of 40k size objects in each of two pools, poolA= and poolB. >>> Both pools cover osds on a single node, 5 osds total. >>> >>> * Config 1 (1p): >>> * use single rados bench client with 32 threads to do seq re= ad of 20000 objects from poolA. >>> >>> * Config 2 (2p): >>> * use two concurrent rados bench clients (running on same cl= ient node) with 16 threads each, >>> one reading 10000 objects from poolA, >>> one reading 10000 objects from poolB, >>> >>> So in both configs, we have 32 threads total and the number of obje= cts read is the same. >>> Note: in all cases, we drop the caches before doing the seq reads >>> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is ab= out 1/3 of the bandwidth for >>> the single client in config 1. >> >> How were the object written? I assume the cluster is backed by spin= ning >> disks? >> >> I wonder if this is a disk layout issue. If the 20,000 objects are >> written in order, they willb e roughly sequential on disk, and the 3= 2 >> thread case will read them in order. In the 2x 10,000 case, the two >> clients are reading two sequences of objects written at different >> times, and the disk arms will be swinging around more. >> >> My guess is that if the reads were reading the objects in a random o= rder >> the performance would be the same... I'm not sure that rados bench d= oes >> that though? >> >> sage >> >>> >>> >>> I gathered perf counters before and after each run and looked at th= e difference of >>> the before and after counters for both the 1p and 2p cases. Here a= re some things I noticed >>> that are different between the two runs. Can someone take a look a= nd let me know >>> whether any of these differences are significant. In particular, f= or the >>> throttle-msgr_dispatch_throttler ones, since I don't know the detai= led definitions of these fields. >>> Note: these are the numbers for one of the 5 osds, the other osds a= re similar... >>> >>> * The field osd/loadavg is always about 3 times higher on the 2p c >>> >>> some latency-related counters >>> ------------------------------ >>> osd/op_latency/sum 1p=3D6.24801117205061, 2p=3D579.722513078945 >>> osd/op_process_latency/sum 1p=3D3.48506945394911, 2p=3D42.627849454= 9915 >>> osd/op_r_latency/sum 1p=3D6.2480111719924, 2p=3D579.722513079003 >>> osd/op_r_process_latency/sum 1p=3D3.48506945399276, 2p=3D42.6278494= 550061 > > So, yep, the individual read ops are taking much longer in the > two-client case. Naively that's pretty odd. > >>> >>> >>> and some throttle-msgr_dispatch_throttler related counters >>> ---------------------------------------------------------- >>> throttle-msgr_dispatch_throttler-client/get 1p=3D1337, 2p=3D1339, d= iff=3D2 >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=3D222877, 2p=3D2= 23088, diff=3D211 >>> throttle-msgr_dispatch_throttler-client/put 1p=3D1337, 2p=3D1339, d= iff=3D2 >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=3D222877, 2p=3D2= 23088, diff=3D211 >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=3D58, 2p=3D1= 34, diff=3D76 >>> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=3D2726, = 2p=3D6298, diff=3D3572 >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=3D58, 2p=3D1= 34, diff=3D76 >>> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=3D2726, = 2p=3D6298, diff=3D3572 >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=3D58, 2p=3D= 134, diff=3D76 >>> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=3D2726,= 2p=3D6298, diff=3D3572 >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=3D58, 2p=3D= 134, diff=3D76 >>> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=3D2726,= 2p=3D6298, diff=3D3572 >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=3D168, 2p=3D252, d= iff=3D84 >>> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=3D7896, 2p=3D1= 1844, diff=3D3948 >>> throttle-msgr_dispatch_throttler-hbclient/put 1p=3D168, 2p=3D252, d= iff=3D84 >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=3D7896, 2p=3D1= 1844, diff=3D3948 > > IIRC these are just saying how many times the dispatch throttler was > accessed on each messenger =E2=80=94 nothing here is surprising, you'= re doing > basically the same number of messages on the client messengers, and > the heartbeat messengers are passing more because the test takes > longer. > > I'd go with Sage's idea for what is actually causing this, or try and > look at how the latency changes over time =E2=80=94 if you're going t= o two > pools instead of one, presumably you're doubling the amount of > metadata that needs to be read into memory during the run? Perhaps > that's just a significant enough effect with your settings that you'r= e > seeing a bunch of extra directory lookups impact your throughput more > than expected... :/ =46WIW, typically if I've seen an effect, it's been the opposite where=20 multiple rados bench processes are slightly faster (maybe simply relate= d=20 to the client side implementation). Running collectl or iostat would=20 show various interval statistics that would help diagnose if this is=20 related to slower accesses on the disks. blktrace of course would give= =20 a more nuanced view. Might be worth doing if there are extra metadata=20 accesses. A couple of other random thoughts: - Are any of the reads coming from buffer cache? - Readahead not working well? - Were the pools recreated between tests? - If not, what were the total number of objects and PGs for each pool=20 (Is it possible that the per-pg directory hirearchy was deeper for the=20 2nd set of tests?) - If the pools still exist, what does the following script say about th= em? https://github.com/ceph/cbt/blob/master/tools/readpgdump.py > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html