perf counters from a performance discrepancy

All of lore.kernel.org
 help / color / mirror / Atom feed

* perf counters from a performance discrepancy
@ 2015-09-23 16:33 Deneau, Tom
  2015-09-23 18:19 ` Sage Weil
  2015-09-23 20:39 ` Gregory Farnum
  0 siblings, 2 replies; 15+ messages in thread
From: Deneau, Tom @ 2015-09-23 16:33 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi all --

Looking for guidance with perf counters...
I am trying to see whether the perf counters can tell me anything about the following discrepancy

I populate a number of 40k size objects in each of two pools, poolA and poolB.
Both pools cover osds on a single node, 5 osds total.

   * Config 1 (1p): 
      * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.

   * Config 2 (2p):
      * use two concurrent rados bench clients (running on same client node) with 16 threads each,
           one reading 10000 objects from poolA,
           one reading 10000 objects from poolB,

So in both configs, we have 32 threads total and the number of objects read is the same.
Note: in all cases, we drop the caches before doing the seq reads

The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
the single client in config 1.

I gathered perf counters before and after each run and looked at the difference of
the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
that are different between the two runs.  Can someone take a look and let me know
whether any of these differences are significant.  In particular, for the
throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
Note: these are the numbers for one of the 5 osds, the other osds are similar...

* The field osd/loadavg is always about 3 times higher on the 2p c

some latency-related counters
------------------------------
osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

and some throttle-msgr_dispatch_throttler related counters
----------------------------------------------------------
throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948

-- Tom Deneau, AMD

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 16:33 perf counters from a performance discrepancy Deneau, Tom
@ 2015-09-23 18:19 ` Sage Weil
  2015-09-23 18:25   ` Gregory Farnum
  2015-09-23 20:39 ` Gregory Farnum
  1 sibling, 1 reply; 15+ messages in thread
From: Sage Weil @ 2015-09-23 18:19 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel@vger.kernel.org

On Wed, 23 Sep 2015, Deneau, Tom wrote:
> Hi all --
> 
> Looking for guidance with perf counters...
> I am trying to see whether the perf counters can tell me anything about the following discrepancy
> 
> I populate a number of 40k size objects in each of two pools, poolA and poolB.
> Both pools cover osds on a single node, 5 osds total.
> 
>    * Config 1 (1p): 
>       * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.
> 
>    * Config 2 (2p):
>       * use two concurrent rados bench clients (running on same client node) with 16 threads each,
>            one reading 10000 objects from poolA,
>            one reading 10000 objects from poolB,
> 
> So in both configs, we have 32 threads total and the number of objects read is the same.
> Note: in all cases, we drop the caches before doing the seq reads
> 
> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
> the single client in config 1.

How were the object written?  I assume the cluster is backed by spinning 
disks?

I wonder if this is a disk layout issue.  If the 20,000 objects are 
written in order, they willb e roughly sequential on disk, and the 32 
thread case will read them in order.  In the 2x 10,000 case, the two 
clients are reading two sequences of objects written at different 
times, and the disk arms will be swinging around more.

My guess is that if the reads were reading the objects in a random order 
the performance would be the same... I'm not sure that rados bench does 
that though?

sage

> 
> 
> I gathered perf counters before and after each run and looked at the difference of
> the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
> that are different between the two runs.  Can someone take a look and let me know
> whether any of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
> Note: these are the numbers for one of the 5 osds, the other osds are similar...
> 
> * The field osd/loadavg is always about 3 times higher on the 2p c
> 
> some latency-related counters
> ------------------------------
> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061
> 
> 
> and some throttle-msgr_dispatch_throttler related counters
> ----------------------------------------------------------
> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
> throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, diff=3572
> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, diff=3572
> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, diff=3572
> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, diff=3572
> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
> throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948
> 
> -- Tom Deneau, AMD
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 18:19 ` Sage Weil
@ 2015-09-23 18:25   ` Gregory Farnum
  2015-09-23 18:42     ` Mark Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2015-09-23 18:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Deneau, Tom, ceph-devel@vger.kernel.org

On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>> Hi all --
>>
>> Looking for guidance with perf counters...
>> I am trying to see whether the perf counters can tell me anything about the following discrepancy
>>
>> I populate a number of 40k size objects in each of two pools, poolA and poolB.
>> Both pools cover osds on a single node, 5 osds total.
>>
>>    * Config 1 (1p):
>>       * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.
>>
>>    * Config 2 (2p):
>>       * use two concurrent rados bench clients (running on same client node) with 16 threads each,
>>            one reading 10000 objects from poolA,
>>            one reading 10000 objects from poolB,
>>
>> So in both configs, we have 32 threads total and the number of objects read is the same.
>> Note: in all cases, we drop the caches before doing the seq reads
>>
>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
>> the single client in config 1.
>
> How were the object written?  I assume the cluster is backed by spinning
> disks?
>
> I wonder if this is a disk layout issue.  If the 20,000 objects are
> written in order, they willb e roughly sequential on disk, and the 32
> thread case will read them in order.  In the 2x 10,000 case, the two
> clients are reading two sequences of objects written at different
> times, and the disk arms will be swinging around more.
>
> My guess is that if the reads were reading the objects in a random order
> the performance would be the same... I'm not sure that rados bench does
> that though?
>
> sage
>
>>
>>
>> I gathered perf counters before and after each run and looked at the difference of
>> the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
>> that are different between the two runs.  Can someone take a look and let me know
>> whether any of these differences are significant.  In particular, for the
>> throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
>> Note: these are the numbers for one of the 5 osds, the other osds are similar...
>>
>> * The field osd/loadavg is always about 3 times higher on the 2p c
>>
>> some latency-related counters
>> ------------------------------
>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
>> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

So, yep, the individual read ops are taking much longer in the
two-client case. Naively that's pretty odd.

>>
>>
>> and some throttle-msgr_dispatch_throttler related counters
>> ----------------------------------------------------------
>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
>> throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, diff=3572
>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, diff=3572
>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, diff=3572
>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, diff=3572
>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
>> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
>> throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948

IIRC these are just saying how many times the dispatch throttler was
accessed on each messenger — nothing here is surprising, you're doing
basically the same number of messages on the client messengers, and
the heartbeat messengers are passing more because the test takes
longer.

I'd go with Sage's idea for what is actually causing this, or try and
look at how the latency changes over time — if you're going to two
pools instead of one, presumably you're doubling the amount of
metadata that needs to be read into memory during the run? Perhaps
that's just a significant enough effect with your settings that you're
seeing a bunch of extra directory lookups impact your throughput more
than expected... :/
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 18:25   ` Gregory Farnum
@ 2015-09-23 18:42     ` Mark Nelson
  2015-09-23 20:05       ` Deneau, Tom
  2015-10-07 22:39       ` Deneau, Tom
  0 siblings, 2 replies; 15+ messages in thread
From: Mark Nelson @ 2015-09-23 18:42 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil; +Cc: Deneau, Tom, ceph-devel@vger.kernel.org



On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote:
>> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>>> Hi all --
>>>
>>> Looking for guidance with perf counters...
>>> I am trying to see whether the perf counters can tell me anything about the following discrepancy
>>>
>>> I populate a number of 40k size objects in each of two pools, poolA and poolB.
>>> Both pools cover osds on a single node, 5 osds total.
>>>
>>>     * Config 1 (1p):
>>>        * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.
>>>
>>>     * Config 2 (2p):
>>>        * use two concurrent rados bench clients (running on same client node) with 16 threads each,
>>>             one reading 10000 objects from poolA,
>>>             one reading 10000 objects from poolB,
>>>
>>> So in both configs, we have 32 threads total and the number of objects read is the same.
>>> Note: in all cases, we drop the caches before doing the seq reads
>>>
>>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
>>> the single client in config 1.
>>
>> How were the object written?  I assume the cluster is backed by spinning
>> disks?
>>
>> I wonder if this is a disk layout issue.  If the 20,000 objects are
>> written in order, they willb e roughly sequential on disk, and the 32
>> thread case will read them in order.  In the 2x 10,000 case, the two
>> clients are reading two sequences of objects written at different
>> times, and the disk arms will be swinging around more.
>>
>> My guess is that if the reads were reading the objects in a random order
>> the performance would be the same... I'm not sure that rados bench does
>> that though?
>>
>> sage
>>
>>>
>>>
>>> I gathered perf counters before and after each run and looked at the difference of
>>> the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
>>> that are different between the two runs.  Can someone take a look and let me know
>>> whether any of these differences are significant.  In particular, for the
>>> throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
>>> Note: these are the numbers for one of the 5 osds, the other osds are similar...
>>>
>>> * The field osd/loadavg is always about 3 times higher on the 2p c
>>>
>>> some latency-related counters
>>> ------------------------------
>>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
>>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
>>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
>>> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061
>
> So, yep, the individual read ops are taking much longer in the
> two-client case. Naively that's pretty odd.
>
>>>
>>>
>>> and some throttle-msgr_dispatch_throttler related counters
>>> ----------------------------------------------------------
>>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
>>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
>>> throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
>>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
>>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
>>> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, diff=3572
>>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
>>> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, diff=3572
>>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
>>> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, diff=3572
>>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
>>> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, diff=3572
>>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
>>> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
>>> throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
>>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948
>
> IIRC these are just saying how many times the dispatch throttler was
> accessed on each messenger — nothing here is surprising, you're doing
> basically the same number of messages on the client messengers, and
> the heartbeat messengers are passing more because the test takes
> longer.
>
> I'd go with Sage's idea for what is actually causing this, or try and
> look at how the latency changes over time — if you're going to two
> pools instead of one, presumably you're doubling the amount of
> metadata that needs to be read into memory during the run? Perhaps
> that's just a significant enough effect with your settings that you're
> seeing a bunch of extra directory lookups impact your throughput more
> than expected... :/

FWIW, typically if I've seen an effect, it's been the opposite where 
multiple rados bench processes are slightly faster (maybe simply related 
to the client side implementation).  Running collectl or iostat would 
show various interval statistics that would help diagnose if this is 
related to slower accesses on the disks.  blktrace of course would give 
a more nuanced view.  Might be worth doing if there are extra metadata 
accesses.

A couple of other random thoughts:

- Are any of the reads coming from buffer cache?
- Readahead not working well?
- Were the pools recreated between tests?
- If not, what were the total number of objects and PGs for each pool 
(Is it possible that the per-pg directory hirearchy was deeper for the 
2nd set of tests?)
- If the pools still exist, what does the following script say about them?

https://github.com/ceph/cbt/blob/master/tools/readpgdump.py


> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-09-23 18:42     ` Mark Nelson
@ 2015-09-23 20:05       ` Deneau, Tom
  2015-09-23 20:28         ` Samuel Just
  2015-10-07 22:39       ` Deneau, Tom
  1 sibling, 1 reply; 15+ messages in thread
From: Deneau, Tom @ 2015-09-23 20:05 UTC (permalink / raw)
  To: Mark Nelson, Gregory Farnum, Sage Weil; +Cc: ceph-devel@vger.kernel.org



> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Wednesday, September 23, 2015 1:43 PM
> To: Gregory Farnum; Sage Weil
> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> 
> 
> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote:
> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >>> Hi all --
> >>>
> >>> Looking for guidance with perf counters...
> >>> I am trying to see whether the perf counters can tell me anything
> >>> about the following discrepancy
> >>>
> >>> I populate a number of 40k size objects in each of two pools, poolA
> and poolB.
> >>> Both pools cover osds on a single node, 5 osds total.
> >>>
> >>>     * Config 1 (1p):
> >>>        * use single rados bench client with 32 threads to do seq read
> of 20000 objects from poolA.
> >>>
> >>>     * Config 2 (2p):
> >>>        * use two concurrent rados bench clients (running on same
> client node) with 16 threads each,
> >>>             one reading 10000 objects from poolA,
> >>>             one reading 10000 objects from poolB,
> >>>
> >>> So in both configs, we have 32 threads total and the number of objects
> read is the same.
> >>> Note: in all cases, we drop the caches before doing the seq reads
> >>>
> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >>> about 1/3 of the bandwidth for the single client in config 1.
> >>
> >> How were the object written?  I assume the cluster is backed by
> >> spinning disks?
> >>
> >> I wonder if this is a disk layout issue.  If the 20,000 objects are
> >> written in order, they willb e roughly sequential on disk, and the 32
> >> thread case will read them in order.  In the 2x 10,000 case, the two
> >> clients are reading two sequences of objects written at different
> >> times, and the disk arms will be swinging around more.
> >>
> >> My guess is that if the reads were reading the objects in a random
> >> order the performance would be the same... I'm not sure that rados
> >> bench does that though?
> >>
> >> sage
> >>
> >>>
> >>>
> >>> I gathered perf counters before and after each run and looked at the
> >>> difference of the before and after counters for both the 1p and 2p
> >>> cases.  Here are some things I noticed that are different between
> >>> the two runs.  Can someone take a look and let me know whether any
> >>> of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed
> definitions of these fields.
> >>> Note: these are the numbers for one of the 5 osds, the other osds are
> similar...
> >>>
> >>> * The field osd/loadavg is always about 3 times higher on the 2p c
> >>>
> >>> some latency-related counters
> >>> ------------------------------
> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> >>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> >>> osd/op_r_process_latency/sum 1p=3.48506945399276,
> >>> 2p=42.6278494550061
> >
> > So, yep, the individual read ops are taking much longer in the
> > two-client case. Naively that's pretty odd.
> >
> >>>
> >>>
> >>> and some throttle-msgr_dispatch_throttler related counters
> >>> ----------------------------------------------------------
> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >>> 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >>> 2p=223088, diff=211
> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
> >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
> >>> 1p=168, 2p=252, diff=84
> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844,
> >>> diff=3948
> >
> > IIRC these are just saying how many times the dispatch throttler was
> > accessed on each messenger — nothing here is surprising, you're doing
> > basically the same number of messages on the client messengers, and
> > the heartbeat messengers are passing more because the test takes
> > longer.
> >
> > I'd go with Sage's idea for what is actually causing this, or try and
> > look at how the latency changes over time — if you're going to two
> > pools instead of one, presumably you're doubling the amount of
> > metadata that needs to be read into memory during the run? Perhaps
> > that's just a significant enough effect with your settings that you're
> > seeing a bunch of extra directory lookups impact your throughput more
> > than expected... :/
> 
> FWIW, typically if I've seen an effect, it's been the opposite where
> multiple rados bench processes are slightly faster (maybe simply related
> to the client side implementation).  Running collectl or iostat would show
> various interval statistics that would help diagnose if this is related to
> slower accesses on the disks.  blktrace of course would give a more
> nuanced view.  Might be worth doing if there are extra metadata accesses.
> 
> A couple of other random thoughts:
> 
> - Are any of the reads coming from buffer cache?
> - Readahead not working well?
> - Were the pools recreated between tests?
> - If not, what were the total number of objects and PGs for each pool (Is
> it possible that the per-pg directory hirearchy was deeper for the 2nd set
> of tests?)
> - If the pools still exist, what does the following script say about them?
> 
> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> 
> 

Thanks Sage, Greg and Mark for the suggestions --

The objects in the pools were created by running two simultaneous
client rados bench write processes, one to each pool.  I started with
empty osd disks.  The pools both cover the same osds and since
there was a replication factor of 2, so I would have expected that not
all the primary objects in each pool would be sequential on the disk.

But when I changed the read test to do rados bench rand rather than
seq, both the 1p and 2p configs went down in bandwidth, so clearly the
disk layout had something to do with it.  With random reads, 1p is now
"only" 40% faster than 2p.

Mark had mentioned that he's often seen the opposite effect, ie, more
client processes increasing the bandwidth, and we have seen the same,
which is why I started using multiple clients.  In fact if I do the
same configuration tests as above but with 4M objects rather than 40K,
I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more).

So there's something about smaller object sizes that gets hurt by
multiple clients vs. 1 client.

I will try to look into this with some of the low level stat tools
that Mark mentioned.

-- Tom

> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 20:05       ` Deneau, Tom
@ 2015-09-23 20:28         ` Samuel Just
  2015-09-23 20:46           ` Deneau, Tom
  0 siblings, 1 reply; 15+ messages in thread
From: Samuel Just @ 2015-09-23 20:28 UTC (permalink / raw)
  To: Deneau, Tom
  Cc: Mark Nelson, Gregory Farnum, Sage Weil,
	ceph-devel@vger.kernel.org

Just to eliminate a variable, can you reproduce this on master, first
with the simple messenger, and then with the async messenger? (make
sure to switch the messengers on all daemons and clients, just put it
in the [global] section on all configs).
-Sam

On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom <tom.deneau@amd.com> wrote:
>
>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Wednesday, September 23, 2015 1:43 PM
>> To: Gregory Farnum; Sage Weil
>> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
>> Subject: Re: perf counters from a performance discrepancy
>>
>>
>>
>> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
>> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote:
>> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>> >>> Hi all --
>> >>>
>> >>> Looking for guidance with perf counters...
>> >>> I am trying to see whether the perf counters can tell me anything
>> >>> about the following discrepancy
>> >>>
>> >>> I populate a number of 40k size objects in each of two pools, poolA
>> and poolB.
>> >>> Both pools cover osds on a single node, 5 osds total.
>> >>>
>> >>>     * Config 1 (1p):
>> >>>        * use single rados bench client with 32 threads to do seq read
>> of 20000 objects from poolA.
>> >>>
>> >>>     * Config 2 (2p):
>> >>>        * use two concurrent rados bench clients (running on same
>> client node) with 16 threads each,
>> >>>             one reading 10000 objects from poolA,
>> >>>             one reading 10000 objects from poolB,
>> >>>
>> >>> So in both configs, we have 32 threads total and the number of objects
>> read is the same.
>> >>> Note: in all cases, we drop the caches before doing the seq reads
>> >>>
>> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
>> >>> about 1/3 of the bandwidth for the single client in config 1.
>> >>
>> >> How were the object written?  I assume the cluster is backed by
>> >> spinning disks?
>> >>
>> >> I wonder if this is a disk layout issue.  If the 20,000 objects are
>> >> written in order, they willb e roughly sequential on disk, and the 32
>> >> thread case will read them in order.  In the 2x 10,000 case, the two
>> >> clients are reading two sequences of objects written at different
>> >> times, and the disk arms will be swinging around more.
>> >>
>> >> My guess is that if the reads were reading the objects in a random
>> >> order the performance would be the same... I'm not sure that rados
>> >> bench does that though?
>> >>
>> >> sage
>> >>
>> >>>
>> >>>
>> >>> I gathered perf counters before and after each run and looked at the
>> >>> difference of the before and after counters for both the 1p and 2p
>> >>> cases.  Here are some things I noticed that are different between
>> >>> the two runs.  Can someone take a look and let me know whether any
>> >>> of these differences are significant.  In particular, for the
>> throttle-msgr_dispatch_throttler ones, since I don't know the detailed
>> definitions of these fields.
>> >>> Note: these are the numbers for one of the 5 osds, the other osds are
>> similar...
>> >>>
>> >>> * The field osd/loadavg is always about 3 times higher on the 2p c
>> >>>
>> >>> some latency-related counters
>> >>> ------------------------------
>> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
>> >>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
>> >>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
>> >>> osd/op_r_process_latency/sum 1p=3.48506945399276,
>> >>> 2p=42.6278494550061
>> >
>> > So, yep, the individual read ops are taking much longer in the
>> > two-client case. Naively that's pretty odd.
>> >
>> >>>
>> >>>
>> >>> and some throttle-msgr_dispatch_throttler related counters
>> >>> ----------------------------------------------------------
>> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
>> >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
>> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
>> >>> 1p=1337, 2p=1339, diff=2
>> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
>> >>> 2p=223088, diff=211
>> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
>> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
>> >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
>> >>> 1p=168, 2p=252, diff=84
>> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844,
>> >>> diff=3948
>> >
>> > IIRC these are just saying how many times the dispatch throttler was
>> > accessed on each messenger — nothing here is surprising, you're doing
>> > basically the same number of messages on the client messengers, and
>> > the heartbeat messengers are passing more because the test takes
>> > longer.
>> >
>> > I'd go with Sage's idea for what is actually causing this, or try and
>> > look at how the latency changes over time — if you're going to two
>> > pools instead of one, presumably you're doubling the amount of
>> > metadata that needs to be read into memory during the run? Perhaps
>> > that's just a significant enough effect with your settings that you're
>> > seeing a bunch of extra directory lookups impact your throughput more
>> > than expected... :/
>>
>> FWIW, typically if I've seen an effect, it's been the opposite where
>> multiple rados bench processes are slightly faster (maybe simply related
>> to the client side implementation).  Running collectl or iostat would show
>> various interval statistics that would help diagnose if this is related to
>> slower accesses on the disks.  blktrace of course would give a more
>> nuanced view.  Might be worth doing if there are extra metadata accesses.
>>
>> A couple of other random thoughts:
>>
>> - Are any of the reads coming from buffer cache?
>> - Readahead not working well?
>> - Were the pools recreated between tests?
>> - If not, what were the total number of objects and PGs for each pool (Is
>> it possible that the per-pg directory hirearchy was deeper for the 2nd set
>> of tests?)
>> - If the pools still exist, what does the following script say about them?
>>
>> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
>>
>>
>
> Thanks Sage, Greg and Mark for the suggestions --
>
> The objects in the pools were created by running two simultaneous
> client rados bench write processes, one to each pool.  I started with
> empty osd disks.  The pools both cover the same osds and since
> there was a replication factor of 2, so I would have expected that not
> all the primary objects in each pool would be sequential on the disk.
>
> But when I changed the read test to do rados bench rand rather than
> seq, both the 1p and 2p configs went down in bandwidth, so clearly the
> disk layout had something to do with it.  With random reads, 1p is now
> "only" 40% faster than 2p.
>
> Mark had mentioned that he's often seen the opposite effect, ie, more
> client processes increasing the bandwidth, and we have seen the same,
> which is why I started using multiple clients.  In fact if I do the
> same configuration tests as above but with 4M objects rather than 40K,
> I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more).
>
> So there's something about smaller object sizes that gets hurt by
> multiple clients vs. 1 client.
>
> I will try to look into this with some of the low level stat tools
> that Mark mentioned.
>
> -- Tom
>
>> > -Greg
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majordomo@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-09-23 20:28         ` Samuel Just
@ 2015-09-23 20:46           ` Deneau, Tom
  0 siblings, 0 replies; 15+ messages in thread
From: Deneau, Tom @ 2015-09-23 20:46 UTC (permalink / raw)
  To: Samuel Just
  Cc: Mark Nelson, Gregory Farnum, Sage Weil,
	ceph-devel@vger.kernel.org

I will be out of office for a week but will put this on the list of things to try when I get back.

-- Tom

> -----Original Message-----
> From: Samuel Just [mailto:sjust@redhat.com]
> Sent: Wednesday, September 23, 2015 3:28 PM
> To: Deneau, Tom
> Cc: Mark Nelson; Gregory Farnum; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> Just to eliminate a variable, can you reproduce this on master, first with
> the simple messenger, and then with the async messenger? (make sure to
> switch the messengers on all daemons and clients, just put it in the
> [global] section on all configs).
> -Sam
> 
> On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom <tom.deneau@amd.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@redhat.com]
> >> Sent: Wednesday, September 23, 2015 1:43 PM
> >> To: Gregory Farnum; Sage Weil
> >> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> >> Subject: Re: perf counters from a performance discrepancy
> >>
> >>
> >>
> >> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> >> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net>
> wrote:
> >> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >> >>> Hi all --
> >> >>>
> >> >>> Looking for guidance with perf counters...
> >> >>> I am trying to see whether the perf counters can tell me anything
> >> >>> about the following discrepancy
> >> >>>
> >> >>> I populate a number of 40k size objects in each of two pools,
> >> >>> poolA
> >> and poolB.
> >> >>> Both pools cover osds on a single node, 5 osds total.
> >> >>>
> >> >>>     * Config 1 (1p):
> >> >>>        * use single rados bench client with 32 threads to do seq
> >> >>> read
> >> of 20000 objects from poolA.
> >> >>>
> >> >>>     * Config 2 (2p):
> >> >>>        * use two concurrent rados bench clients (running on same
> >> client node) with 16 threads each,
> >> >>>             one reading 10000 objects from poolA,
> >> >>>             one reading 10000 objects from poolB,
> >> >>>
> >> >>> So in both configs, we have 32 threads total and the number of
> >> >>> objects
> >> read is the same.
> >> >>> Note: in all cases, we drop the caches before doing the seq reads
> >> >>>
> >> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >> >>> about 1/3 of the bandwidth for the single client in config 1.
> >> >>
> >> >> How were the object written?  I assume the cluster is backed by
> >> >> spinning disks?
> >> >>
> >> >> I wonder if this is a disk layout issue.  If the 20,000 objects
> >> >> are written in order, they willb e roughly sequential on disk, and
> >> >> the 32 thread case will read them in order.  In the 2x 10,000
> >> >> case, the two clients are reading two sequences of objects written
> >> >> at different times, and the disk arms will be swinging around more.
> >> >>
> >> >> My guess is that if the reads were reading the objects in a random
> >> >> order the performance would be the same... I'm not sure that rados
> >> >> bench does that though?
> >> >>
> >> >> sage
> >> >>
> >> >>>
> >> >>>
> >> >>> I gathered perf counters before and after each run and looked at
> >> >>> the difference of the before and after counters for both the 1p
> >> >>> and 2p cases.  Here are some things I noticed that are different
> >> >>> between the two runs.  Can someone take a look and let me know
> >> >>> whether any of these differences are significant.  In particular,
> >> >>> for the
> >> throttle-msgr_dispatch_throttler ones, since I don't know the
> >> detailed definitions of these fields.
> >> >>> Note: these are the numbers for one of the 5 osds, the other osds
> >> >>> are
> >> similar...
> >> >>>
> >> >>> * The field osd/loadavg is always about 3 times higher on the 2p
> >> >>> c
> >> >>>
> >> >>> some latency-related counters
> >> >>> ------------------------------
> >> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >> >>> osd/op_process_latency/sum 1p=3.48506945394911,
> >> >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924,
> >> >>> 2p=579.722513079003 osd/op_r_process_latency/sum
> >> >>> 1p=3.48506945399276,
> >> >>> 2p=42.6278494550061
> >> >
> >> > So, yep, the individual read ops are taking much longer in the
> >> > two-client case. Naively that's pretty odd.
> >> >
> >> >>>
> >> >>>
> >> >>> and some throttle-msgr_dispatch_throttler related counters
> >> >>> ----------------------------------------------------------
> >> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339,
> >> >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >> >>> 1p=1337, 2p=1339, diff=2
> >> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >> >>> 2p=223088, diff=211
> >> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> >> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum
> >> >>> 1p=7896, 2p=11844, diff=3948
> >> >>> throttle-msgr_dispatch_throttler-hbclient/put
> >> >>> 1p=168, 2p=252, diff=84
> >> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896,
> >> >>> 2p=11844,
> >> >>> diff=3948
> >> >
> >> > IIRC these are just saying how many times the dispatch throttler
> >> > was accessed on each messenger — nothing here is surprising, you're
> >> > doing basically the same number of messages on the client
> >> > messengers, and the heartbeat messengers are passing more because
> >> > the test takes longer.
> >> >
> >> > I'd go with Sage's idea for what is actually causing this, or try
> >> > and look at how the latency changes over time — if you're going to
> >> > two pools instead of one, presumably you're doubling the amount of
> >> > metadata that needs to be read into memory during the run? Perhaps
> >> > that's just a significant enough effect with your settings that
> >> > you're seeing a bunch of extra directory lookups impact your
> >> > throughput more than expected... :/
> >>
> >> FWIW, typically if I've seen an effect, it's been the opposite where
> >> multiple rados bench processes are slightly faster (maybe simply
> >> related to the client side implementation).  Running collectl or
> >> iostat would show various interval statistics that would help
> >> diagnose if this is related to slower accesses on the disks.
> >> blktrace of course would give a more nuanced view.  Might be worth
> doing if there are extra metadata accesses.
> >>
> >> A couple of other random thoughts:
> >>
> >> - Are any of the reads coming from buffer cache?
> >> - Readahead not working well?
> >> - Were the pools recreated between tests?
> >> - If not, what were the total number of objects and PGs for each pool
> >> (Is it possible that the per-pg directory hirearchy was deeper for
> >> the 2nd set of tests?)
> >> - If the pools still exist, what does the following script say about
> them?
> >>
> >> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> >>
> >>
> >
> > Thanks Sage, Greg and Mark for the suggestions --
> >
> > The objects in the pools were created by running two simultaneous
> > client rados bench write processes, one to each pool.  I started with
> > empty osd disks.  The pools both cover the same osds and since there
> > was a replication factor of 2, so I would have expected that not all
> > the primary objects in each pool would be sequential on the disk.
> >
> > But when I changed the read test to do rados bench rand rather than
> > seq, both the 1p and 2p configs went down in bandwidth, so clearly the
> > disk layout had something to do with it.  With random reads, 1p is now
> > "only" 40% faster than 2p.
> >
> > Mark had mentioned that he's often seen the opposite effect, ie, more
> > client processes increasing the bandwidth, and we have seen the same,
> > which is why I started using multiple clients.  In fact if I do the
> > same configuration tests as above but with 4M objects rather than 40K,
> > I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more).
> >
> > So there's something about smaller object sizes that gets hurt by
> > multiple clients vs. 1 client.
> >
> > I will try to look into this with some of the low level stat tools
> > that Mark mentioned.
> >
> > -- Tom
> >
> >> > -Greg
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-09-23 18:42     ` Mark Nelson
  2015-09-23 20:05       ` Deneau, Tom
@ 2015-10-07 22:39       ` Deneau, Tom
  2015-10-08  2:47         ` Sage Weil
  1 sibling, 1 reply; 15+ messages in thread
From: Deneau, Tom @ 2015-10-07 22:39 UTC (permalink / raw)
  To: Mark Nelson, Gregory Farnum, Sage Weil; +Cc: ceph-devel@vger.kernel.org

> -----Original Message-----
> From: Deneau, Tom
> Sent: Wednesday, September 23, 2015 3:05 PM
> To: 'Mark Nelson'; Gregory Farnum; Sage Weil
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: perf counters from a performance discrepancy
> 
> 
> 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Wednesday, September 23, 2015 1:43 PM
> > To: Gregory Farnum; Sage Weil
> > Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> > Subject: Re: perf counters from a performance discrepancy
> >
> >
> >
> > On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> > > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@newdream.net> wrote:
> > >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> > >>> Hi all --
> > >>>
> > >>> Looking for guidance with perf counters...
> > >>> I am trying to see whether the perf counters can tell me anything
> > >>> about the following discrepancy
> > >>>
> > >>> I populate a number of 40k size objects in each of two pools,
> > >>> poolA
> > and poolB.
> > >>> Both pools cover osds on a single node, 5 osds total.
> > >>>
> > >>>     * Config 1 (1p):
> > >>>        * use single rados bench client with 32 threads to do seq
> > >>> read
> > of 20000 objects from poolA.
> > >>>
> > >>>     * Config 2 (2p):
> > >>>        * use two concurrent rados bench clients (running on same
> > client node) with 16 threads each,
> > >>>             one reading 10000 objects from poolA,
> > >>>             one reading 10000 objects from poolB,
> > >>>
> > >>> So in both configs, we have 32 threads total and the number of
> > >>> objects
> > read is the same.
> > >>> Note: in all cases, we drop the caches before doing the seq reads
> > >>>
> > >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> > >>> about 1/3 of the bandwidth for the single client in config 1.
> > >>
> > >> How were the object written?  I assume the cluster is backed by
> > >> spinning disks?
> > >>
> > >> I wonder if this is a disk layout issue.  If the 20,000 objects are
> > >> written in order, they willb e roughly sequential on disk, and the
> > >> 32 thread case will read them in order.  In the 2x 10,000 case, the
> > >> two clients are reading two sequences of objects written at
> > >> different times, and the disk arms will be swinging around more.
> > >>
> > >> My guess is that if the reads were reading the objects in a random
> > >> order the performance would be the same... I'm not sure that rados
> > >> bench does that though?
> > >>
> > >> sage
> > >>
> > >>>
> > >>>
> > >>> I gathered perf counters before and after each run and looked at
> > >>> the difference of the before and after counters for both the 1p
> > >>> and 2p cases.  Here are some things I noticed that are different
> > >>> between the two runs.  Can someone take a look and let me know
> > >>> whether any of these differences are significant.  In particular,
> > >>> for the
> > throttle-msgr_dispatch_throttler ones, since I don't know the detailed
> > definitions of these fields.
> > >>> Note: these are the numbers for one of the 5 osds, the other osds
> > >>> are
> > similar...
> > >>>
> > >>> * The field osd/loadavg is always about 3 times higher on the 2p c
> > >>>
> > >>> some latency-related counters
> > >>> ------------------------------
> > >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> > >>> osd/op_process_latency/sum 1p=3.48506945394911,
> > >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924,
> > >>> 2p=579.722513079003 osd/op_r_process_latency/sum
> > >>> 1p=3.48506945399276,
> > >>> 2p=42.6278494550061
> > >
> > > So, yep, the individual read ops are taking much longer in the
> > > two-client case. Naively that's pretty odd.
> > >
> > >>>
> > >>>
> > >>> and some throttle-msgr_dispatch_throttler related counters
> > >>> ----------------------------------------------------------
> > >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339,
> > >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> > >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> > >>> 1p=1337, 2p=1339, diff=2
> > >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> > >>> 2p=223088, diff=211
> > >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58,
> > >>> 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58,
> > >>> 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> > >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
> > >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
> > >>> 1p=168, 2p=252, diff=84
> > >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896,
> > >>> 2p=11844,
> > >>> diff=3948
> > >
> > > IIRC these are just saying how many times the dispatch throttler was
> > > accessed on each messenger — nothing here is surprising, you're
> > > doing basically the same number of messages on the client
> > > messengers, and the heartbeat messengers are passing more because
> > > the test takes longer.
> > >
> > > I'd go with Sage's idea for what is actually causing this, or try
> > > and look at how the latency changes over time — if you're going to
> > > two pools instead of one, presumably you're doubling the amount of
> > > metadata that needs to be read into memory during the run? Perhaps
> > > that's just a significant enough effect with your settings that
> > > you're seeing a bunch of extra directory lookups impact your
> > > throughput more than expected... :/
> >
> > FWIW, typically if I've seen an effect, it's been the opposite where
> > multiple rados bench processes are slightly faster (maybe simply
> > related to the client side implementation).  Running collectl or
> > iostat would show various interval statistics that would help diagnose
> > if this is related to slower accesses on the disks.  blktrace of
> > course would give a more nuanced view.  Might be worth doing if there
> are extra metadata accesses.
> >
> > A couple of other random thoughts:
> >
> > - Are any of the reads coming from buffer cache?
> > - Readahead not working well?
> > - Were the pools recreated between tests?
> > - If not, what were the total number of objects and PGs for each pool
> > (Is it possible that the per-pg directory hirearchy was deeper for the
> > 2nd set of tests?)
> > - If the pools still exist, what does the following script say about
> them?
> >
> > https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> >
> >
> 
> Thanks Sage, Greg and Mark for the suggestions --
> 
> The objects in the pools were created by running two simultaneous client
> rados bench write processes, one to each pool.  I started with empty osd
> disks.  The pools both cover the same osds and since there was a
> replication factor of 2, so I would have expected that not all the primary
> objects in each pool would be sequential on the disk.
> 
> But when I changed the read test to do rados bench rand rather than seq,
> both the 1p and 2p configs went down in bandwidth, so clearly the disk
> layout had something to do with it.  With random reads, 1p is now "only"
> 40% faster than 2p.
> 
> Mark had mentioned that he's often seen the opposite effect, ie, more
> client processes increasing the bandwidth, and we have seen the same,
> which is why I started using multiple clients.  In fact if I do the same
> configuration tests as above but with 4M objects rather than 40K, I see
> the 2p beats 1p by 30% or so. (and I've seen 4p help even more).
> 
> So there's something about smaller object sizes that gets hurt by multiple
> clients vs. 1 client.
> 
> I will try to look into this with some of the low level stat tools that
> Mark mentioned.
> 
> -- Tom
> 
> > > -Greg
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >

I finally got around to looking at the dump_historic_ops output for the 1-client and 2-client cases.
As you recall these are all read-ops.  so the events in the dump are
   initiated
   reached_pg
   started
   done

The pattern I see for most of the slow ops recorded in the dump is:

   * In the 1-client case the typical slow op has duration between 50-65 ms
     and usually most of this is the interval between reached_pg and started.

   * In the 2-client case the typical slow op has duration between 95-120 ms
     and again usually most of this is the interval between reached_pg and started.

Could someone describe what the interval between reached_pg and started means?

I did see this earlier mail thread on the definition of the events, but
maybe this discussion only applies to writes (or else things have changed for v9.0.3)
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12783.html

-- Tom

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-10-07 22:39       ` Deneau, Tom
@ 2015-10-08  2:47         ` Sage Weil
  2015-10-08 15:49           ` Deneau, Tom
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2015-10-08  2:47 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: Mark Nelson, Gregory Farnum, ceph-devel@vger.kernel.org

> I finally got around to looking at the dump_historic_ops output for the 
> 1-client and 2-client cases.
> As you recall these are all read-ops.  so the events in the dump are
>    initiated
>    reached_pg
>    started
>    done
> 
> The pattern I see for most of the slow ops recorded in the dump is:
> 
>    * In the 1-client case the typical slow op has duration between 50-65 ms
>      and usually most of this is the interval between reached_pg and started.
> 
>    * In the 2-client case the typical slow op has duration between 95-120 ms
>      and again usually most of this is the interval between reached_pg 
>      and started.
> 
> Could someone describe what the interval between reached_pg and started 
> means?

I think the slow part is probably find_object_context() (although to be 
fair tons of stuff happens here, see do_op()).  You could test this 
theory or otherwise narrow this down with additional event 
markers lke

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index d6f3084..6faccc2 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1691,10 +1691,12 @@ void ReplicatedPG::do_op(OpRequestRef& op)
     return;
   }
 
+  op->mark_event("about to find");
   int r = find_object_context(
     oid, &obc, can_create,
     m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
     &missing_oid);
+  op->mark_event("found");
 
   if (r == -EAGAIN) {
     // If we're not the primary of this OSD, and we have


sage

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-10-08  2:47         ` Sage Weil
@ 2015-10-08 15:49           ` Deneau, Tom
  2015-10-08 15:55             ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Deneau, Tom @ 2015-10-08 15:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Gregory Farnum, ceph-devel@vger.kernel.org



> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, October 07, 2015 9:48 PM
> To: Deneau, Tom
> Cc: Mark Nelson; Gregory Farnum; ceph-devel@vger.kernel.org
> Subject: RE: perf counters from a performance discrepancy
> 
> > I finally got around to looking at the dump_historic_ops output for
> > the 1-client and 2-client cases.
> > As you recall these are all read-ops.  so the events in the dump are
> >    initiated
> >    reached_pg
> >    started
> >    done
> >
> > The pattern I see for most of the slow ops recorded in the dump is:
> >
> >    * In the 1-client case the typical slow op has duration between 50-65
> ms
> >      and usually most of this is the interval between reached_pg and
> started.
> >
> >    * In the 2-client case the typical slow op has duration between 95-
> 120 ms
> >      and again usually most of this is the interval between reached_pg
> >      and started.
> >
> > Could someone describe what the interval between reached_pg and
> > started means?
> 
> I think the slow part is probably find_object_context() (although to be
> fair tons of stuff happens here, see do_op()).  You could test this theory
> or otherwise narrow this down with additional event markers lke
> 
> diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index
> d6f3084..6faccc2 100644
> --- a/src/osd/ReplicatedPG.cc
> +++ b/src/osd/ReplicatedPG.cc
> @@ -1691,10 +1691,12 @@ void ReplicatedPG::do_op(OpRequestRef& op)
>      return;
>    }
> 
> +  op->mark_event("about to find");
>    int r = find_object_context(
>      oid, &obc, can_create,
>      m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
>      &missing_oid);
> +  op->mark_event("found");
> 
>    if (r == -EAGAIN) {
>      // If we're not the primary of this OSD, and we have
> 
> 
> sage

Sage --

Is it likely that find_object_context would take longer when there are two clients
each using their own pool (compared to one client using one pool)?

And would two clients using the same pool spend less time in find_object_context?

-- Tom

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-10-08 15:49           ` Deneau, Tom
@ 2015-10-08 15:55             ` Sage Weil
  2015-10-08 17:58               ` Somnath Roy
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2015-10-08 15:55 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: Mark Nelson, Gregory Farnum, ceph-devel@vger.kernel.org

On Thu, 8 Oct 2015, Deneau, Tom wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, October 07, 2015 9:48 PM
> > To: Deneau, Tom
> > Cc: Mark Nelson; Gregory Farnum; ceph-devel@vger.kernel.org
> > Subject: RE: perf counters from a performance discrepancy
> > 
> > > I finally got around to looking at the dump_historic_ops output for
> > > the 1-client and 2-client cases.
> > > As you recall these are all read-ops.  so the events in the dump are
> > >    initiated
> > >    reached_pg
> > >    started
> > >    done
> > >
> > > The pattern I see for most of the slow ops recorded in the dump is:
> > >
> > >    * In the 1-client case the typical slow op has duration between 50-65
> > ms
> > >      and usually most of this is the interval between reached_pg and
> > started.
> > >
> > >    * In the 2-client case the typical slow op has duration between 95-
> > 120 ms
> > >      and again usually most of this is the interval between reached_pg
> > >      and started.
> > >
> > > Could someone describe what the interval between reached_pg and
> > > started means?
> > 
> > I think the slow part is probably find_object_context() (although to be
> > fair tons of stuff happens here, see do_op()).  You could test this theory
> > or otherwise narrow this down with additional event markers lke
> > 
> > diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index
> > d6f3084..6faccc2 100644
> > --- a/src/osd/ReplicatedPG.cc
> > +++ b/src/osd/ReplicatedPG.cc
> > @@ -1691,10 +1691,12 @@ void ReplicatedPG::do_op(OpRequestRef& op)
> >      return;
> >    }
> > 
> > +  op->mark_event("about to find");
> >    int r = find_object_context(
> >      oid, &obc, can_create,
> >      m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
> >      &missing_oid);
> > +  op->mark_event("found");
> > 
> >    if (r == -EAGAIN) {
> >      // If we're not the primary of this OSD, and we have
> > 
> > 
> > sage
> 
> Sage --
> 
> Is it likely that find_object_context would take longer when there are two clients
> each using their own pool (compared to one client using one pool)?
> 
> And would two clients using the same pool spend less time in find_object_context?

Maybe.. find_object_context is where we lookup the file, and if there 
are caching effects that would be one place where we'd see it.

sage


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-10-08 15:55             ` Sage Weil
@ 2015-10-08 17:58               ` Somnath Roy
  0 siblings, 0 replies; 15+ messages in thread
From: Somnath Roy @ 2015-10-08 17:58 UTC (permalink / raw)
  To: Sage Weil, Deneau, Tom
  Cc: Mark Nelson, Gregory Farnum, ceph-devel@vger.kernel.org

If I remember correctly, Nick faced similar issue and we debugged down to the xattr access issue in the find_object_context(). I am not sure if it is resolved though for him or not.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, October 08, 2015 8:56 AM
To: Deneau, Tom
Cc: Mark Nelson; Gregory Farnum; ceph-devel@vger.kernel.org
Subject: RE: perf counters from a performance discrepancy

On Thu, 8 Oct 2015, Deneau, Tom wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, October 07, 2015 9:48 PM
> > To: Deneau, Tom
> > Cc: Mark Nelson; Gregory Farnum; ceph-devel@vger.kernel.org
> > Subject: RE: perf counters from a performance discrepancy
> >
> > > I finally got around to looking at the dump_historic_ops output
> > > for the 1-client and 2-client cases.
> > > As you recall these are all read-ops.  so the events in the dump are
> > >    initiated
> > >    reached_pg
> > >    started
> > >    done
> > >
> > > The pattern I see for most of the slow ops recorded in the dump is:
> > >
> > >    * In the 1-client case the typical slow op has duration between
> > > 50-65
> > ms
> > >      and usually most of this is the interval between reached_pg
> > > and
> > started.
> > >
> > >    * In the 2-client case the typical slow op has duration between
> > > 95-
> > 120 ms
> > >      and again usually most of this is the interval between reached_pg
> > >      and started.
> > >
> > > Could someone describe what the interval between reached_pg and
> > > started means?
> >
> > I think the slow part is probably find_object_context() (although to
> > be fair tons of stuff happens here, see do_op()).  You could test
> > this theory or otherwise narrow this down with additional event
> > markers lke
> >
> > diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index
> > d6f3084..6faccc2 100644
> > --- a/src/osd/ReplicatedPG.cc
> > +++ b/src/osd/ReplicatedPG.cc
> > @@ -1691,10 +1691,12 @@ void ReplicatedPG::do_op(OpRequestRef& op)
> >      return;
> >    }
> >
> > +  op->mark_event("about to find");
> >    int r = find_object_context(
> >      oid, &obc, can_create,
> >      m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
> >      &missing_oid);
> > +  op->mark_event("found");
> >
> >    if (r == -EAGAIN) {
> >      // If we're not the primary of this OSD, and we have
> >
> >
> > sage
>
> Sage --
>
> Is it likely that find_object_context would take longer when there are
> two clients each using their own pool (compared to one client using one pool)?
>
> And would two clients using the same pool spend less time in find_object_context?

Maybe.. find_object_context is where we lookup the file, and if there are caching effects that would be one place where we'd see it.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 16:33 perf counters from a performance discrepancy Deneau, Tom
  2015-09-23 18:19 ` Sage Weil
@ 2015-09-23 20:39 ` Gregory Farnum
  2015-09-23 20:51   ` Deneau, Tom
  1 sibling, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2015-09-23 20:39 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel@vger.kernel.org

On Wed, Sep 23, 2015 at 9:33 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> Hi all --
>
> Looking for guidance with perf counters...
> I am trying to see whether the perf counters can tell me anything about the following discrepancy
>
> I populate a number of 40k size objects in each of two pools, poolA and poolB.
> Both pools cover osds on a single node, 5 osds total.
>
>    * Config 1 (1p):
>       * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.
>
>    * Config 2 (2p):
>       * use two concurrent rados bench clients (running on same client node) with 16 threads each,
>            one reading 10000 objects from poolA,
>            one reading 10000 objects from poolB,
>
> So in both configs, we have 32 threads total and the number of objects read is the same.
> Note: in all cases, we drop the caches before doing the seq reads
>
> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
> the single client in config 1.
>
>
> I gathered perf counters before and after each run and looked at the difference of
> the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
> that are different between the two runs.  Can someone take a look and let me know
> whether any of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
> Note: these are the numbers for one of the 5 osds, the other osds are similar...
>
> * The field osd/loadavg is always about 3 times higher on the 2p c
>
> some latency-related counters
> ------------------------------
> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k
reads during this test. Which if I'm reading these properly means
OSD-side latency is something like 1.5 milliseconds for the single
client and...144 milliseconds for the two-client case! You might try
dumping some of the historic ops out of the admin socket and seeing
where the time is getting spent (is it all on disk accesses?). And
trying to reproduce something like this workload on your disks without
Ceph involved.
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: perf counters from a performance discrepancy
  2015-09-23 20:39 ` Gregory Farnum
@ 2015-09-23 20:51   ` Deneau, Tom
  2015-09-23 20:52     ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Deneau, Tom @ 2015-09-23 20:51 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org



> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, September 23, 2015 3:39 PM
> To: Deneau, Tom
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> On Wed, Sep 23, 2015 at 9:33 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> > Hi all --
> >
> > Looking for guidance with perf counters...
> > I am trying to see whether the perf counters can tell me anything
> > about the following discrepancy
> >
> > I populate a number of 40k size objects in each of two pools, poolA and
> poolB.
> > Both pools cover osds on a single node, 5 osds total.
> >
> >    * Config 1 (1p):
> >       * use single rados bench client with 32 threads to do seq read of
> 20000 objects from poolA.
> >
> >    * Config 2 (2p):
> >       * use two concurrent rados bench clients (running on same client
> node) with 16 threads each,
> >            one reading 10000 objects from poolA,
> >            one reading 10000 objects from poolB,
> >
> > So in both configs, we have 32 threads total and the number of objects
> read is the same.
> > Note: in all cases, we drop the caches before doing the seq reads
> >
> > The combined bandwidth (MB/sec) for the 2 clients in config 2 is about
> > 1/3 of the bandwidth for the single client in config 1.
> >
> >
> > I gathered perf counters before and after each run and looked at the
> > difference of the before and after counters for both the 1p and 2p
> > cases.  Here are some things I noticed that are different between the
> > two runs.  Can someone take a look and let me know whether any of
> > these differences are significant.  In particular, for the throttle-
> msgr_dispatch_throttler ones, since I don't know the detailed definitions
> of these fields.
> > Note: these are the numbers for one of the 5 osds, the other osds are
> similar...
> >
> > * The field osd/loadavg is always about 3 times higher on the 2p c
> >
> > some latency-related counters
> > ------------------------------
> > osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> > osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> > osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> > osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061
> 
> So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k reads
> during this test. Which if I'm reading these properly means OSD-side
> latency is something like 1.5 milliseconds for the single client and...144
> milliseconds for the two-client case! You might try dumping some of the
> historic ops out of the admin socket and seeing where the time is getting
> spent (is it all on disk accesses?). And trying to reproduce something
> like this workload on your disks without Ceph involved.
> -Greg

Greg --

Not sure how much it matters but in looking at the pools more closely I
was getting mixed up with an earlier experiment with pools that just used 5 osds.
The pools for this example actually distributed across 15 osds on 3 nodes.

What is the recommended command for dumping historic ops out of the admin socket?

-- Tom





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: perf counters from a performance discrepancy
  2015-09-23 20:51   ` Deneau, Tom
@ 2015-09-23 20:52     ` Gregory Farnum
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2015-09-23 20:52 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel@vger.kernel.org

On Wed, Sep 23, 2015 at 1:51 PM, Deneau, Tom <tom.deneau@amd.com> wrote:
>
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k reads
>> during this test. Which if I'm reading these properly means OSD-side
>> latency is something like 1.5 milliseconds for the single client and...144
>> milliseconds for the two-client case! You might try dumping some of the
>> historic ops out of the admin socket and seeing where the time is getting
>> spent (is it all on disk accesses?). And trying to reproduce something
>> like this workload on your disks without Ceph involved.
>> -Greg
>
> Greg --
>
> Not sure how much it matters but in looking at the pools more closely I
> was getting mixed up with an earlier experiment with pools that just used 5 osds.
> The pools for this example actually distributed across 15 osds on 3 nodes.

Okay, so this is running on hard drives, not SSDs. The speed
differential is a lot more plausibly in the drive/filesystem/total
layout in that case, then...

>
> What is the recommended command for dumping historic ops out of the admin socket?

"ceph daemon osd.<N> dump_historic_ops", I think. "ceph daemon osd.<N>
help" will include it in the list, though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-08 17:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-23 16:33 perf counters from a performance discrepancy Deneau, Tom
2015-09-23 18:19 ` Sage Weil
2015-09-23 18:25   ` Gregory Farnum
2015-09-23 18:42     ` Mark Nelson
2015-09-23 20:05       ` Deneau, Tom
2015-09-23 20:28         ` Samuel Just
2015-09-23 20:46           ` Deneau, Tom
2015-10-07 22:39       ` Deneau, Tom
2015-10-08  2:47         ` Sage Weil
2015-10-08 15:49           ` Deneau, Tom
2015-10-08 15:55             ` Sage Weil
2015-10-08 17:58               ` Somnath Roy
2015-09-23 20:39 ` Gregory Farnum
2015-09-23 20:51   ` Deneau, Tom
2015-09-23 20:52     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.