From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mark Nelson <mnelson@redhat.com>
Subject: Re: perf counters from a performance discrepancy
Date: Wed, 23 Sep 2015 13:42:32 -0500
Message-ID: <5602F298.7060609@redhat.com>
References: <BC97738F8E7C8742BABED7F06FB9DF9192500992@SATLEXDAG01.amd.com>	<alpine.DEB.2.00.1509231116580.13265@cobra.newdream.net> <CAJ4mKGavY4ZdutpjnKduQ1AcpCpHDeWST8NyOa_RT19W+DvrNQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:39034 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753291AbbIWSmg (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
	Wed, 23 Sep 2015 14:42:36 -0400
In-Reply-To: <CAJ4mKGavY4ZdutpjnKduQ1AcpCpHDeWST8NyOa_RT19W+DvrNQ@mail.gmail.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Gregory Farnum <gfarnum@redhat.com>, Sage Weil <sage@newdream.net>
Cc: "Deneau, Tom" <tom.deneau@amd.com>, "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>


On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote=
:
>> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>>> Hi all --
>>>
>>> Looking for guidance with perf counters...
>>> I am trying to see whether the perf counters can tell me anything a=
bout the following discrepancy
>>>
>>> I populate a number of 40k size objects in each of two pools, poolA=
 and poolB.
>>> Both pools cover osds on a single node, 5 osds total.
>>>
>>>     * Config 1 (1p):
>>>        * use single rados bench client with 32 threads to do seq re=
ad of 20000 objects from poolA.
>>>
>>>     * Config 2 (2p):
>>>        * use two concurrent rados bench clients (running on same cl=
ient node) with 16 threads each,
>>>             one reading 10000 objects from poolA,
>>>             one reading 10000 objects from poolB,
>>>
>>> So in both configs, we have 32 threads total and the number of obje=
cts read is the same.
>>> Note: in all cases, we drop the caches before doing the seq reads
>>>
>>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is ab=
out 1/3 of the bandwidth for
>>> the single client in config 1.
>>
>> How were the object written?  I assume the cluster is backed by spin=
ning
>> disks?
>>
>> I wonder if this is a disk layout issue.  If the 20,000 objects are
>> written in order, they willb e roughly sequential on disk, and the 3=
2
>> thread case will read them in order.  In the 2x 10,000 case, the two
>> clients are reading two sequences of objects written at different
>> times, and the disk arms will be swinging around more.
>>
>> My guess is that if the reads were reading the objects in a random o=
rder
>> the performance would be the same... I'm not sure that rados bench d=
oes
>> that though?
>>
>> sage
>>
>>>
>>>
>>> I gathered perf counters before and after each run and looked at th=
e difference of
>>> the before and after counters for both the 1p and 2p cases.  Here a=
re some things I noticed
>>> that are different between the two runs.  Can someone take a look a=
nd let me know
>>> whether any of these differences are significant.  In particular, f=
or the
>>> throttle-msgr_dispatch_throttler ones, since I don't know the detai=
led definitions of these fields.
>>> Note: these are the numbers for one of the 5 osds, the other osds a=
re similar...
>>>
>>> * The field osd/loadavg is always about 3 times higher on the 2p c
>>>
>>> some latency-related counters
>>> ------------------------------
>>> osd/op_latency/sum 1p=3D6.24801117205061, 2p=3D579.722513078945
>>> osd/op_process_latency/sum 1p=3D3.48506945394911, 2p=3D42.627849454=
9915
>>> osd/op_r_latency/sum 1p=3D6.2480111719924, 2p=3D579.722513079003
>>> osd/op_r_process_latency/sum 1p=3D3.48506945399276, 2p=3D42.6278494=
550061
>
> So, yep, the individual read ops are taking much longer in the
> two-client case. Naively that's pretty odd.
>
>>>
>>>
>>> and some throttle-msgr_dispatch_throttler related counters
>>> ----------------------------------------------------------
>>> throttle-msgr_dispatch_throttler-client/get 1p=3D1337, 2p=3D1339, d=
iff=3D2
>>> throttle-msgr_dispatch_throttler-client/get_sum 1p=3D222877, 2p=3D2=
23088, diff=3D211
>>> throttle-msgr_dispatch_throttler-client/put 1p=3D1337, 2p=3D1339, d=
iff=3D2
>>> throttle-msgr_dispatch_throttler-client/put_sum 1p=3D222877, 2p=3D2=
23088, diff=3D211
>>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=3D58, 2p=3D1=
34, diff=3D76
>>> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=3D2726, =
2p=3D6298, diff=3D3572
>>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=3D58, 2p=3D1=
34, diff=3D76
>>> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=3D2726, =
2p=3D6298, diff=3D3572
>>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=3D58, 2p=3D=
134, diff=3D76
>>> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=3D2726,=
 2p=3D6298, diff=3D3572
>>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=3D58, 2p=3D=
134, diff=3D76
>>> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=3D2726,=
 2p=3D6298, diff=3D3572
>>> throttle-msgr_dispatch_throttler-hbclient/get 1p=3D168, 2p=3D252, d=
iff=3D84
>>> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=3D7896, 2p=3D1=
1844, diff=3D3948
>>> throttle-msgr_dispatch_throttler-hbclient/put 1p=3D168, 2p=3D252, d=
iff=3D84
>>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=3D7896, 2p=3D1=
1844, diff=3D3948
>
> IIRC these are just saying how many times the dispatch throttler was
> accessed on each messenger =E2=80=94 nothing here is surprising, you'=
re doing
> basically the same number of messages on the client messengers, and
> the heartbeat messengers are passing more because the test takes
> longer.
>
> I'd go with Sage's idea for what is actually causing this, or try and
> look at how the latency changes over time =E2=80=94 if you're going t=
o two
> pools instead of one, presumably you're doubling the amount of
> metadata that needs to be read into memory during the run? Perhaps
> that's just a significant enough effect with your settings that you'r=
e
> seeing a bunch of extra directory lookups impact your throughput more
> than expected... :/

=46WIW, typically if I've seen an effect, it's been the opposite where=20
multiple rados bench processes are slightly faster (maybe simply relate=
d=20
to the client side implementation).  Running collectl or iostat would=20
show various interval statistics that would help diagnose if this is=20
related to slower accesses on the disks.  blktrace of course would give=
=20
a more nuanced view.  Might be worth doing if there are extra metadata=20
accesses.

A couple of other random thoughts:

- Are any of the reads coming from buffer cache?
- Readahead not working well?
- Were the pools recreated between tests?
- If not, what were the total number of objects and PGs for each pool=20
(Is it possible that the per-pg directory hirearchy was deeper for the=20
2nd set of tests?)
- If the pools still exist, what does the following script say about th=
em?

https://github.com/ceph/cbt/blob/master/tools/readpgdump.py


> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
 in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html