ceph osd commit latency increase over time, until restart

All of lore.kernel.org
 help / color / mirror / Atom feed

* ceph osd commit latency increase over time, until restart
       [not found] ` <395511117.2665.1548405853447.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-25  9:14   ` Alexandre DERUMIER
       [not found]     ` <387140705.12275.1548407699184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-25  9:14 UTC (permalink / raw)
  To: ceph-users, ceph-devel

Hi, 

I have a strange behaviour of my osd, on multiple clusters, 

All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 

When the osd are refreshly started, the commit latency is between 0,5-1ms. 

But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
values like 20-200ms. 

Some example graphs:

http://odisoweb1.odiso.net/osdlatency1.png
http://odisoweb1.odiso.net/osdlatency2.png

All osds have this behaviour, in all clusters. 

The latency of physical disks is ok. (Clusters are far to be full loaded) 

And if I restart the osd, the latency come back to 0,5-1ms. 

That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 

Any Hints for counters/logs to check ? 

Regards, 

Alexandre 

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <387140705.12275.1548407699184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]     ` <387140705.12275.1548407699184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-25  9:49       ` Sage Weil
       [not found]         ` <alpine.DEB.2.11.1901250948390.1384-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Sage Weil @ 2019-01-25  9:49 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency?

Thanks!
sage


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs:
> 
> http://odisoweb1.odiso.net/osdlatency1.png
> http://odisoweb1.odiso.net/osdlatency2.png
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <alpine.DEB.2.11.1901250948390.1384-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]         ` <alpine.DEB.2.11.1901250948390.1384-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2019-01-25 10:06           ` Alexandre DERUMIER
       [not found]             ` <837655257.15253.1548410811958.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  2019-01-30  7:33           ` Alexandre DERUMIER
  1 sibling, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-25 10:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency?

Yes, sure. I'll do it next week and send result to the mailing list.

Thanks Sage !
 
----- Mail original -----
De: "Sage Weil" <sage@newdream.net>
À: "aderumier" <aderumier@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <837655257.15253.1548410811958.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]             ` <837655257.15253.1548410811958.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-25 16:32               ` Alexandre DERUMIER
       [not found]                 ` <787014196.28895.1548433922173.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-25 16:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

Hi again,

I was able to perf it today,

before restart, commit latency was between 3-5ms

after restart at 17:11, latency is around 1ms

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png


here some perf reports:

with 3ms latency:
-----------------
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt


with 1ms latency
-----------------
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt



I'll retry next week, trying to have bigger latency difference.

Alexandre

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Sage Weil" <sage@newdream.net>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 25 Janvier 2019 11:06:51
Objet: Re: ceph osd commit latency increase over time, until restart

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

----- Mail original ----- 
De: "Sage Weil" <sage@newdream.net> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <787014196.28895.1548433922173.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                 ` <787014196.28895.1548433922173.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-25 16:40                   ` Alexandre DERUMIER
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-25 16:40 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

also, here the result of "perf diff 1mslatency.perfdata  3mslatency.perfdata"

http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt




----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Sage Weil" <sage@newdream.net>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 25 Janvier 2019 17:32:02
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi again, 

I was able to perf it today, 

before restart, commit latency was between 3-5ms 

after restart at 17:11, latency is around 1ms 

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png 


here some perf reports: 

with 3ms latency: 
----------------- 
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt 


with 1ms latency 
----------------- 
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt 



I'll retry next week, trying to have bigger latency difference. 

Alexandre 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Sage Weil" <sage@newdream.net> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 25 Janvier 2019 11:06:51 
Objet: Re: ceph osd commit latency increase over time, until restart 

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

----- Mail original ----- 
De: "Sage Weil" <sage@newdream.net> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]         ` <alpine.DEB.2.11.1901250948390.1384-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  2019-01-25 10:06           ` Alexandre DERUMIER
@ 2019-01-30  7:33           ` Alexandre DERUMIER
       [not found]             ` <1548181710.219518.1548833599717.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-30  7:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

Hi,

here some new results,
different osd/ different cluster

before osd restart latency was between 2-5ms
after osd restart is around 1-1.5ms

http://odisoweb1.odiso.net/cephperf2/bad.txt  (2-5ms)
http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
http://odisoweb1.odiso.net/cephperf2/diff.txt


From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.

(I'm using tcmalloc 2.5-2.2)


----- Mail original -----
De: "Sage Weil" <sage@newdream.net>
À: "aderumier" <aderumier@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1548181710.219518.1548833599717.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]             ` <1548181710.219518.1548833599717.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-30  7:45               ` Stefan Priebe - Profihost AG
       [not found]                 ` <e81456d6-8361-5ca5-2b98-7a90948c0218-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
  2019-01-30 13:33               ` Sage Weil
  1 sibling, 1 reply; 42+ messages in thread
From: Stefan Priebe - Profihost AG @ 2019-01-30  7:45 UTC (permalink / raw)
  To: Alexandre DERUMIER, Sage Weil; +Cc: ceph-users, ceph-devel

Hi,

Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
> Hi,
> 
> here some new results,
> different osd/ different cluster
> 
> before osd restart latency was between 2-5ms
> after osd restart is around 1-1.5ms
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt  (2-5ms)
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
> http://odisoweb1.odiso.net/cephperf2/diff.txt
> 
> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
> (I'm using tcmalloc 2.5-2.2)

currently i'm in the process of switching back from jemalloc to tcmalloc
like suggested. This report makes me a little nervous about my change.

Also i'm currently only monitoring latency for filestore osds. Which
exact values out of the daemon do you use for bluestore?

I would like to check if i see the same behaviour.

Greets,
Stefan

> 
> ----- Mail original -----
> De: "Sage Weil" <sage@newdream.net>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Vendredi 25 Janvier 2019 10:49:02
> Objet: Re: ceph osd commit latency increase over time, until restart
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
>>
>> Hi, 
>>
>> I have a strange behaviour of my osd, on multiple clusters, 
>>
>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>
>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>
>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>> values like 20-200ms. 
>>
>> Some example graphs: 
>>
>> http://odisoweb1.odiso.net/osdlatency1.png 
>> http://odisoweb1.odiso.net/osdlatency2.png 
>>
>> All osds have this behaviour, in all clusters. 
>>
>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>
>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>
>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>
>> Any Hints for counters/logs to check ? 
>>
>>
>> Regards, 
>>
>> Alexandre 
>>
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <e81456d6-8361-5ca5-2b98-7a90948c0218-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                 ` <e81456d6-8361-5ca5-2b98-7a90948c0218-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
@ 2019-01-30 13:59                   ` Alexandre DERUMIER
       [not found]                     ` <317086845.245472.1548856741512.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-30 13:59 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG; +Cc: ceph-users, ceph-devel

Hi Stefan,

>>currently i'm in the process of switching back from jemalloc to tcmalloc 
>>like suggested. This report makes me a little nervous about my change. 
Well,I'm really not sure that it's a tcmalloc bug. 
maybe bluestore related (don't have filestore anymore to compare)
I need to compare with bigger latencies

here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
http://odisoweb1.odiso.net/latencybad.png

I observe the latency in my guest vm too, on disks iowait.

http://odisoweb1.odiso.net/latencybadvm.png

>>Also i'm currently only monitoring latency for filestore osds. Which
>>exact values out of the daemon do you use for bluestore?

here my influxdb queries:

It take op_latency.sum/op_latency.avgcount on last second.


SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~  /^([[host]])$/  AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)


SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)


SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)





----- Mail original -----
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 30 Janvier 2019 08:45:33
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi, 

Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
> Hi, 
> 
> here some new results, 
> different osd/ different cluster 
> 
> before osd restart latency was between 2-5ms 
> after osd restart is around 1-1.5ms 
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
> http://odisoweb1.odiso.net/cephperf2/diff.txt 
> 
> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
> (I'm using tcmalloc 2.5-2.2) 

currently i'm in the process of switching back from jemalloc to tcmalloc 
like suggested. This report makes me a little nervous about my change. 

Also i'm currently only monitoring latency for filestore osds. Which 
exact values out of the daemon do you use for bluestore? 

I would like to check if i see the same behaviour. 

Greets, 
Stefan 

> 
> ----- Mail original ----- 
> De: "Sage Weil" <sage@newdream.net> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
>> 
>> Hi, 
>> 
>> I have a strange behaviour of my osd, on multiple clusters, 
>> 
>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>> 
>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>> 
>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>> values like 20-200ms. 
>> 
>> Some example graphs: 
>> 
>> http://odisoweb1.odiso.net/osdlatency1.png 
>> http://odisoweb1.odiso.net/osdlatency2.png 
>> 
>> All osds have this behaviour, in all clusters. 
>> 
>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>> 
>> And if I restart the osd, the latency come back to 0,5-1ms. 
>> 
>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>> 
>> Any Hints for counters/logs to check ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
>> 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <317086845.245472.1548856741512.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                     ` <317086845.245472.1548856741512.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-01-30 18:50                       ` Stefan Priebe - Profihost AG
       [not found]                         ` <85320911-75f8-0e9d-af71-151391839153-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Stefan Priebe - Profihost AG @ 2019-01-30 18:50 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel

Hi,

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
> Hi Stefan,
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare)
> I need to compare with bigger latencies
> 
> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
> http://odisoweb1.odiso.net/latencybad.png
> 
> I observe the latency in my guest vm too, on disks iowait.
> 
> http://odisoweb1.odiso.net/latencybadvm.png
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which
>>> exact values out of the daemon do you use for bluestore?
> 
> here my influxdb queries:
> 
> It take op_latency.sum/op_latency.avgcount on last second.
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~  /^([[host]])$/  AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)   FROM "ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)

Thanks. Is there any reason you monitor op_w_latency but not
op_r_latency but instead op_latency?

Also why do you monitor op_w_process_latency? but not op_r_process_latency?

greets,
Stefan

> 
> 
> 
> 
> 
> ----- Mail original -----
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 30 Janvier 2019 08:45:33
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>>
>> here some new results, 
>> different osd/ different cluster 
>>
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>>
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>>
>> ----- Mail original ----- 
>> De: "Sage Weil" <sage@newdream.net> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>>
>> Thanks! 
>> sage 
>>
>>
>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>
>>>
>>> Hi, 
>>>
>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>
>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>
>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>
>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>> values like 20-200ms. 
>>>
>>> Some example graphs: 
>>>
>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>
>>> All osds have this behaviour, in all clusters. 
>>>
>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>
>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>
>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>
>>> Any Hints for counters/logs to check ? 
>>>
>>>
>>> Regards, 
>>>
>>> Alexandre 
>>>
>>>
>>
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <85320911-75f8-0e9d-af71-151391839153-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                         ` <85320911-75f8-0e9d-af71-151391839153-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
@ 2019-01-30 18:58                           ` Alexandre DERUMIER
       [not found]                             ` <1814646360.255765.1548874695212.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-30 18:58 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG; +Cc: ceph-users, ceph-devel

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>>
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).

I just don't see latency difference on reads. (or they are very very small  vs the write latency increase)



----- Mail original -----
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
À: "aderumier" <aderumier@odiso.com>
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 30 Janvier 2019 19:50:20
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>> 
>> ----- Mail original ----- 
>> De: "Sage Weil" <sage@newdream.net> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>> 
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>> 
>> Thanks! 
>> sage 
>> 
>> 
>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>> 
>>> 
>>> Hi, 
>>> 
>>> I have a strange behaviour of my osd, on multiple clusters, 
>>> 
>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>> 
>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>> 
>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>> values like 20-200ms. 
>>> 
>>> Some example graphs: 
>>> 
>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>> 
>>> All osds have this behaviour, in all clusters. 
>>> 
>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>> 
>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>> 
>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>> 
>>> Any Hints for counters/logs to check ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> 
>>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1814646360.255765.1548874695212.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                             ` <1814646360.255765.1548874695212.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-04  8:38                               ` Alexandre DERUMIER
       [not found]                                 ` <494474215.139609.1549269491013.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-04  8:38 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG; +Cc: ceph-users, ceph-devel

Hi,

some news:

I have tried with different transparent hugepage values (madvise, never) : no change

I have tried to increase bluestore_cache_size_ssd to 8G: no change

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
my others clusters user 1,6TB ssd.

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?


Regards,

Alexandre


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 30 Janvier 2019 19:58:15
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 

I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 



----- Mail original ----- 
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 30 Janvier 2019 19:50:20 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>> 
>> ----- Mail original ----- 
>> De: "Sage Weil" <sage@newdream.net> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>> 
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>> 
>> Thanks! 
>> sage 
>> 
>> 
>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>> 
>>> 
>>> Hi, 
>>> 
>>> I have a strange behaviour of my osd, on multiple clusters, 
>>> 
>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>> 
>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>> 
>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>> values like 20-200ms. 
>>> 
>>> Some example graphs: 
>>> 
>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>> 
>>> All osds have this behaviour, in all clusters. 
>>> 
>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>> 
>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>> 
>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>> 
>>> Any Hints for counters/logs to check ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> 
>>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <494474215.139609.1549269491013.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                 ` <494474215.139609.1549269491013.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-04 14:17                                   ` Alexandre DERUMIER
       [not found]                                     ` <229754897.167048.1549289833437.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-04 14:17 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG, Mark Nelson; +Cc: ceph-users, ceph-devel

Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me : 

StupidAllocator::_aligned_len
and
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
const, unsigned long>*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len


+ 100.00% clone
  + 100.00% start_thread
    + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
      + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
        + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
          + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
          | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
          |   + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
          |     + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
          |     | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
          |     |   + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
          |     |     + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
          |     |     | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
          |     |     |   + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
          |     |     |   | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
          |     |     |   |   + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
          |     |     |   |     + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
          |     |     |   |     | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
          |     |     |   |     | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
          |     |     |   |     | |   + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
          |     |     |   |     | |   + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)



----- Mail original -----
De: "Alexandre Derumier" <aderumier@odiso.com>
À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Lundi 4 Février 2019 09:38:11
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi, 

some news: 

I have tried with different transparent hugepage values (madvise, never) : no change 

I have tried to increase bluestore_cache_size_ssd to 8G: no change 

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
my others clusters user 1,6TB ssd. 

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 


Regards, 

Alexandre 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 30 Janvier 2019 19:58:15 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 

I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 



----- Mail original ----- 
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 30 Janvier 2019 19:50:20 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>> 
>> ----- Mail original ----- 
>> De: "Sage Weil" <sage@newdream.net> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>> 
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>> 
>> Thanks! 
>> sage 
>> 
>> 
>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>> 
>>> 
>>> Hi, 
>>> 
>>> I have a strange behaviour of my osd, on multiple clusters, 
>>> 
>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>> 
>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>> 
>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>> values like 20-200ms. 
>>> 
>>> Some example graphs: 
>>> 
>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>> 
>>> All osds have this behaviour, in all clusters. 
>>> 
>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>> 
>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>> 
>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>> 
>>> Any Hints for counters/logs to check ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> 
>>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <229754897.167048.1549289833437.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                     ` <229754897.167048.1549289833437.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-04 14:51                                       ` Igor Fedotov
       [not found]                                         ` <0ab7d2b9-3611-c380-cbf6-c39cec0e673d-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-04 14:51 UTC (permalink / raw)
  To: Alexandre DERUMIER, Stefan Priebe, Profihost AG, Mark Nelson
  Cc: ceph-users, ceph-devel

Hi Alexandre,

looks like a bug in StupidAllocator.

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency.

Specifically 'l_bluestore_fragmentation' parameter is of interest.

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details.


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference...


Thanks,

Igor


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
> Hi again,
>
> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>
>
> I have notice something using a simple "perf top",
>
> each time I have this problem (I have seen exactly 4 times the same behaviour),
>
> when latency is bad, perf top give me :
>
> StupidAllocator::_aligned_len
> and
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
> const, unsigned long>*>::increment_slow()
>
> (around 10-20% time for both)
>
>
> when latency is good, I don't see them at all.
>
>
> I have used the Mark wallclock profiler, here the results:
>
> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>
> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>
>
> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>
>
> + 100.00% clone
>    + 100.00% start_thread
>      + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>        + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>          + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>            + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>            | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>            |   + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>            |     + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>            |     | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>            |     |   + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>            |     |     + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>            |     |     | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>            |     |     |   + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>            |     |     |   | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>            |     |     |   |   + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>            |     |     |   |     + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>            |     |     |   |     | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>            |     |     |   |     | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>            |     |     |   |     | |   + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>            |     |     |   |     | |   + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>
>
>
> ----- Mail original -----
> De: "Alexandre Derumier" <aderumier@odiso.com>
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 4 Février 2019 09:38:11
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> Hi,
>
> some news:
>
> I have tried with different transparent hugepage values (madvise, never) : no change
>
> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>
> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>
>
> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
> my others clusters user 1,6TB ssd.
>
> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>
>
> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>
>
> Regards,
>
> Alexandre
>
>
> ----- Mail original -----
> De: "aderumier" <aderumier@odiso.com>
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 30 Janvier 2019 19:58:15
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
>>> Thanks. Is there any reason you monitor op_w_latency but not
>>> op_r_latency but instead op_latency?
>>>
>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>
> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>
>
>
> ----- Mail original -----
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 30 Janvier 2019 19:50:20
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> Hi,
>
> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>> Hi Stefan,
>>
>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>> like suggested. This report makes me a little nervous about my change.
>> Well,I'm really not sure that it's a tcmalloc bug.
>> maybe bluestore related (don't have filestore anymore to compare)
>> I need to compare with bigger latencies
>>
>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>> http://odisoweb1.odiso.net/latencybad.png
>>
>> I observe the latency in my guest vm too, on disks iowait.
>>
>> http://odisoweb1.odiso.net/latencybadvm.png
>>
>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>> exact values out of the daemon do you use for bluestore?
>> here my influxdb queries:
>>
>> It take op_latency.sum/op_latency.avgcount on last second.
>>
>>
>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>
>>
>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>
>>
>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
> Thanks. Is there any reason you monitor op_w_latency but not
> op_r_latency but instead op_latency?
>
> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>
> greets,
> Stefan
>
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> Hi,
>>
>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>> Hi,
>>>
>>> here some new results,
>>> different osd/ different cluster
>>>
>>> before osd restart latency was between 2-5ms
>>> after osd restart is around 1-1.5ms
>>>
>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>
>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>> (I'm using tcmalloc 2.5-2.2)
>> currently i'm in the process of switching back from jemalloc to tcmalloc
>> like suggested. This report makes me a little nervous about my change.
>>
>> Also i'm currently only monitoring latency for filestore osds. Which
>> exact values out of the daemon do you use for bluestore?
>>
>> I would like to check if i see the same behaviour.
>>
>> Greets,
>> Stefan
>>
>>> ----- Mail original -----
>>> De: "Sage Weil" <sage@newdream.net>
>>> À: "aderumier" <aderumier@odiso.com>
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>
>>> Can you capture a perf top or perf record to see where teh CPU time is
>>> going on one of the OSDs wth a high latency?
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>
>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>
>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>
>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>> values like 20-200ms.
>>>>
>>>> Some example graphs:
>>>>
>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>
>>>> All osds have this behaviour, in all clusters.
>>>>
>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>
>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>
>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>
>>>> Any Hints for counters/logs to check ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre
>>>>
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <0ab7d2b9-3611-c380-cbf6-c39cec0e673d-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                         ` <0ab7d2b9-3611-c380-cbf6-c39cec0e673d-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-04 15:04                                           ` Alexandre DERUMIER
       [not found]                                             ` <1323366475.173629.1549292678511.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-04 15:04 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

Thanks Igor,

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>>
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with
"ceph daemon osd.x perf dump ",  (I have 2months history will all counters)

but I don't see l_bluestore_fragmentation counter.

(but I have bluestore_fragmentation_micros)


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :(
But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus)
perf results of new bitmap allocator seem very promising from what I've seen in PR.



----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Lundi 4 Février 2019 15:51:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
> const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
> 
> 
> 
> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier@odiso.com> 
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 09:38:11 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> some news: 
> 
> I have tried with different transparent hugepage values (madvise, never) : no change 
> 
> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
> 
> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
> 
> 
> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
> my others clusters user 1,6TB ssd. 
> 
> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
> 
> 
> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier@odiso.com> 
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>> op_r_latency but instead op_latency? 
>>> 
>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
> 
> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
> 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>> Hi Stefan, 
>> 
>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>> like suggested. This report makes me a little nervous about my change. 
>> Well,I'm really not sure that it's a tcmalloc bug. 
>> maybe bluestore related (don't have filestore anymore to compare) 
>> I need to compare with bigger latencies 
>> 
>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>> http://odisoweb1.odiso.net/latencybad.png 
>> 
>> I observe the latency in my guest vm too, on disks iowait. 
>> 
>> http://odisoweb1.odiso.net/latencybadvm.png 
>> 
>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>> exact values out of the daemon do you use for bluestore? 
>> here my influxdb queries: 
>> 
>> It take op_latency.sum/op_latency.avgcount on last second. 
>> 
>> 
>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> 
>> 
>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> 
>> 
>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> Thanks. Is there any reason you monitor op_w_latency but not 
> op_r_latency but instead op_latency? 
> 
> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
> 
> greets, 
> Stefan 
> 
>> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>> Hi, 
>>> 
>>> here some new results, 
>>> different osd/ different cluster 
>>> 
>>> before osd restart latency was between 2-5ms 
>>> after osd restart is around 1-1.5ms 
>>> 
>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>> 
>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>> (I'm using tcmalloc 2.5-2.2) 
>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>> like suggested. This report makes me a little nervous about my change. 
>> 
>> Also i'm currently only monitoring latency for filestore osds. Which 
>> exact values out of the daemon do you use for bluestore? 
>> 
>> I would like to check if i see the same behaviour. 
>> 
>> Greets, 
>> Stefan 
>> 
>>> ----- Mail original ----- 
>>> De: "Sage Weil" <sage@newdream.net> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>> 
>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>> going on one of the OSDs wth a high latency? 
>>> 
>>> Thanks! 
>>> sage 
>>> 
>>> 
>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>> 
>>>> Hi, 
>>>> 
>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>> 
>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>> 
>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>> 
>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>> values like 20-200ms. 
>>>> 
>>>> Some example graphs: 
>>>> 
>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>> 
>>>> All osds have this behaviour, in all clusters. 
>>>> 
>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>> 
>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>> 
>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>> 
>>>> Any Hints for counters/logs to check ? 
>>>> 
>>>> 
>>>> Regards, 
>>>> 
>>>> Alexandre 
>>>> 
>>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1323366475.173629.1549292678511.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                             ` <1323366475.173629.1549292678511.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-04 15:40                                               ` Alexandre DERUMIER
       [not found]                                                 ` <2062110719.174905.1549294821422.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-04 15:40 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

>>but I don't see l_bluestore_fragmentation counter.
>>(but I have bluestore_fragmentation_micros)

ok, this is the same

  b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
            "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");


Here a graph on last month, with bluestore_fragmentation_micros and latency,

http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png

----- Mail original -----
De: "Alexandre Derumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Lundi 4 Février 2019 16:04:38
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Thanks Igor, 

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>> 
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with 
"ceph daemon osd.x perf dump ", (I have 2months history will all counters) 

but I don't see l_bluestore_fragmentation counter. 

(but I have bluestore_fragmentation_micros) 


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :( 
But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus) 
perf results of new bitmap allocator seem very promising from what I've seen in PR. 



----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 4 Février 2019 15:51:30 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
> const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
> 
> 
> 
> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier@odiso.com> 
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 09:38:11 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> some news: 
> 
> I have tried with different transparent hugepage values (madvise, never) : no change 
> 
> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
> 
> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
> 
> 
> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
> my others clusters user 1,6TB ssd. 
> 
> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
> 
> 
> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier@odiso.com> 
> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>> op_r_latency but instead op_latency? 
>>> 
>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
> 
> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
> 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi, 
> 
> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>> Hi Stefan, 
>> 
>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>> like suggested. This report makes me a little nervous about my change. 
>> Well,I'm really not sure that it's a tcmalloc bug. 
>> maybe bluestore related (don't have filestore anymore to compare) 
>> I need to compare with bigger latencies 
>> 
>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>> http://odisoweb1.odiso.net/latencybad.png 
>> 
>> I observe the latency in my guest vm too, on disks iowait. 
>> 
>> http://odisoweb1.odiso.net/latencybadvm.png 
>> 
>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>> exact values out of the daemon do you use for bluestore? 
>> here my influxdb queries: 
>> 
>> It take op_latency.sum/op_latency.avgcount on last second. 
>> 
>> 
>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> 
>> 
>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> 
>> 
>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> Thanks. Is there any reason you monitor op_w_latency but not 
> op_r_latency but instead op_latency? 
> 
> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
> 
> greets, 
> Stefan 
> 
>> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>> Hi, 
>>> 
>>> here some new results, 
>>> different osd/ different cluster 
>>> 
>>> before osd restart latency was between 2-5ms 
>>> after osd restart is around 1-1.5ms 
>>> 
>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>> 
>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>> (I'm using tcmalloc 2.5-2.2) 
>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>> like suggested. This report makes me a little nervous about my change. 
>> 
>> Also i'm currently only monitoring latency for filestore osds. Which 
>> exact values out of the daemon do you use for bluestore? 
>> 
>> I would like to check if i see the same behaviour. 
>> 
>> Greets, 
>> Stefan 
>> 
>>> ----- Mail original ----- 
>>> De: "Sage Weil" <sage@newdream.net> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>> 
>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>> going on one of the OSDs wth a high latency? 
>>> 
>>> Thanks! 
>>> sage 
>>> 
>>> 
>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>> 
>>>> Hi, 
>>>> 
>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>> 
>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>> 
>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>> 
>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>> values like 20-200ms. 
>>>> 
>>>> Some example graphs: 
>>>> 
>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>> 
>>>> All osds have this behaviour, in all clusters. 
>>>> 
>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>> 
>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>> 
>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>> 
>>>> Any Hints for counters/logs to check ? 
>>>> 
>>>> 
>>>> Regards, 
>>>> 
>>>> Alexandre 
>>>> 
>>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <2062110719.174905.1549294821422.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                 ` <2062110719.174905.1549294821422.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-05 17:56                                                   ` Igor Fedotov
       [not found]                                                     ` <d4558d4b-b1c9-211a-626a-0c14df3e29b9-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-05 17:56 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel


On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>> but I don't see l_bluestore_fragmentation counter.
>>> (but I have bluestore_fragmentation_micros)
> ok, this is the same
>
>    b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>              "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");
>
>
> Here a graph on last month, with bluestore_fragmentation_micros and latency,
>
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs?

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time.

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high.

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data?

If not may I have current output and say  a couple more samples with 
8-12 hours interval?


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly.


Thanks,

Igor

> ----- Mail original -----
> De: "Alexandre Derumier" <aderumier@odiso.com>
> À: "Igor Fedotov" <ifedotov@suse.de>
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 4 Février 2019 16:04:38
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> Thanks Igor,
>
>>> Could you please collect BlueStore performance counters right after OSD
>>> startup and once you get high latency.
>>>
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
> I'm already monitoring with
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters)
>
> but I don't see l_bluestore_fragmentation counter.
>
> (but I have bluestore_fragmentation_micros)
>
>
>>> Also if you're able to rebuild the code I can probably make a simple
>>> patch to track latency and some other internal allocator's paramter to
>>> make sure it's degraded and learn more details.
> Sorry, It's a critical production cluster, I can't test on it :(
> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.
>
>
>
>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>> and try the difference...
> Any plan to backport it to mimic ? (But I can wait for Nautilus)
> perf results of new bitmap allocator seem very promising from what I've seen in PR.
>
>
>
> ----- Mail original -----
> De: "Igor Fedotov" <ifedotov@suse.de>
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 4 Février 2019 15:51:30
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> Hi Alexandre,
>
> looks like a bug in StupidAllocator.
>
> Could you please collect BlueStore performance counters right after OSD
> startup and once you get high latency.
>
> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>
> Also if you're able to rebuild the code I can probably make a simple
> patch to track latency and some other internal allocator's paramter to
> make sure it's degraded and learn more details.
>
>
> More vigorous fix would be to backport bitmap allocator from Nautilus
> and try the difference...
>
>
> Thanks,
>
> Igor
>
>
> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>> Hi again,
>>
>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>>
>>
>> I have notice something using a simple "perf top",
>>
>> each time I have this problem (I have seen exactly 4 times the same behaviour),
>>
>> when latency is bad, perf top give me :
>>
>> StupidAllocator::_aligned_len
>> and
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
>> const, unsigned long>*>::increment_slow()
>>
>> (around 10-20% time for both)
>>
>>
>> when latency is good, I don't see them at all.
>>
>>
>> I have used the Mark wallclock profiler, here the results:
>>
>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>
>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>
>>
>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>>
>>
>> + 100.00% clone
>> + 100.00% start_thread
>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>>
>>
>>
>> ----- Mail original -----
>> De: "Alexandre Derumier" <aderumier@odiso.com>
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 4 Février 2019 09:38:11
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> Hi,
>>
>> some news:
>>
>> I have tried with different transparent hugepage values (madvise, never) : no change
>>
>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>
>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>>
>>
>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
>> my others clusters user 1,6TB ssd.
>>
>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>>
>>
>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>>
>>
>> Regards,
>>
>> Alexandre
>>
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier@odiso.com>
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>> op_r_latency but instead op_latency?
>>>>
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>>
>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>>
>>
>>
>> ----- Mail original -----
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>> À: "aderumier" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> Hi,
>>
>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>> Hi Stefan,
>>>
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>> like suggested. This report makes me a little nervous about my change.
>>> Well,I'm really not sure that it's a tcmalloc bug.
>>> maybe bluestore related (don't have filestore anymore to compare)
>>> I need to compare with bigger latencies
>>>
>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>>> http://odisoweb1.odiso.net/latencybad.png
>>>
>>> I observe the latency in my guest vm too, on disks iowait.
>>>
>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>
>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>> exact values out of the daemon do you use for bluestore?
>>> here my influxdb queries:
>>>
>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>
>>>
>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>
>>>
>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>
>>>
>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>> Thanks. Is there any reason you monitor op_w_latency but not
>> op_r_latency but instead op_latency?
>>
>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>
>> greets,
>> Stefan
>>
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi,
>>>
>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>> Hi,
>>>>
>>>> here some new results,
>>>> different osd/ different cluster
>>>>
>>>> before osd restart latency was between 2-5ms
>>>> after osd restart is around 1-1.5ms
>>>>
>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>
>>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>>> (I'm using tcmalloc 2.5-2.2)
>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>> like suggested. This report makes me a little nervous about my change.
>>>
>>> Also i'm currently only monitoring latency for filestore osds. Which
>>> exact values out of the daemon do you use for bluestore?
>>>
>>> I would like to check if i see the same behaviour.
>>>
>>> Greets,
>>> Stefan
>>>
>>>> ----- Mail original -----
>>>> De: "Sage Weil" <sage@newdream.net>
>>>> À: "aderumier" <aderumier@odiso.com>
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>>
>>>> Can you capture a perf top or perf record to see where teh CPU time is
>>>> going on one of the OSDs wth a high latency?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>>
>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>
>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>
>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>>
>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>>> values like 20-200ms.
>>>>>
>>>>> Some example graphs:
>>>>>
>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>
>>>>> All osds have this behaviour, in all clusters.
>>>>>
>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>>
>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>
>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>>
>>>>> Any Hints for counters/logs to check ?
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Alexandre
>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <d4558d4b-b1c9-211a-626a-0c14df3e29b9-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                     ` <d4558d4b-b1c9-211a-626a-0c14df3e29b9-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-08 15:08                                                       ` Alexandre DERUMIER
  2019-02-08 15:14                                                       ` Alexandre DERUMIER
  1 sibling, 0 replies; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-08 15:08 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

>>hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>it? 
yes
>>The same for other OSDs? 
yes



>>Wondering if you have OSD mempool monitoring (dump_mempools command 
>>output on admin socket) reports? Do you have any historic data? 

not currently (I only have perf dump), I'll add them in my monitoring stats.


>>If not may I have current output and say a couple more samples with 
>>8-12 hours interval? 

I'll do it next week.

Thanks again for helping.


----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
À: "aderumier" <aderumier@odiso.com>
Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 5 Février 2019 18:56:51
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> (but I have bluestore_fragmentation_micros) 
> ok, this is the same 
> 
> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
> 
> 
> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
> 
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs? 

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time. 

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high. 

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data? 

If not may I have current output and say a couple more samples with 
8-12 hours interval? 


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly. 


Thanks, 

Igor 

> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier@odiso.com> 
> À: "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 16:04:38 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Thanks Igor, 
> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> I'm already monitoring with 
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
> 
> but I don't see l_bluestore_fragmentation counter. 
> 
> (but I have bluestore_fragmentation_micros) 
> 
> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
> Sorry, It's a critical production cluster, I can't test on it :( 
> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
> 
> 
> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
> 
> 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 15:51:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi Alexandre, 
> 
> looks like a bug in StupidAllocator. 
> 
> Could you please collect BlueStore performance counters right after OSD 
> startup and once you get high latency. 
> 
> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> 
> Also if you're able to rebuild the code I can probably make a simple 
> patch to track latency and some other internal allocator's paramter to 
> make sure it's degraded and learn more details. 
> 
> 
> More vigorous fix would be to backport bitmap allocator from Nautilus 
> and try the difference... 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>> Hi again, 
>> 
>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>> 
>> 
>> I have notice something using a simple "perf top", 
>> 
>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>> 
>> when latency is bad, perf top give me : 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>> const, unsigned long>*>::increment_slow() 
>> 
>> (around 10-20% time for both) 
>> 
>> 
>> when latency is good, I don't see them at all. 
>> 
>> 
>> I have used the Mark wallclock profiler, here the results: 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>> 
>> 
>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>> 
>> 
>> + 100.00% clone 
>> + 100.00% start_thread 
>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 4 Février 2019 09:38:11 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> some news: 
>> 
>> I have tried with different transparent hugepage values (madvise, never) : no change 
>> 
>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>> 
>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>> 
>> 
>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>> my others clusters user 1,6TB ssd. 
>> 
>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>> 
>> 
>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>> op_r_latency but instead op_latency? 
>>>> 
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>> 
>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>> Hi Stefan, 
>>> 
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>> like suggested. This report makes me a little nervous about my change. 
>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>> maybe bluestore related (don't have filestore anymore to compare) 
>>> I need to compare with bigger latencies 
>>> 
>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>> http://odisoweb1.odiso.net/latencybad.png 
>>> 
>>> I observe the latency in my guest vm too, on disks iowait. 
>>> 
>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>> 
>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>> exact values out of the daemon do you use for bluestore? 
>>> here my influxdb queries: 
>>> 
>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> Thanks. Is there any reason you monitor op_w_latency but not 
>> op_r_latency but instead op_latency? 
>> 
>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> 
>> greets, 
>> Stefan 
>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi, 
>>> 
>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>> Hi, 
>>>> 
>>>> here some new results, 
>>>> different osd/ different cluster 
>>>> 
>>>> before osd restart latency was between 2-5ms 
>>>> after osd restart is around 1-1.5ms 
>>>> 
>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>> 
>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>> (I'm using tcmalloc 2.5-2.2) 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
>>> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
>>> 
>>> I would like to check if i see the same behaviour. 
>>> 
>>> Greets, 
>>> Stefan 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Sage Weil" <sage@newdream.net> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>> going on one of the OSDs wth a high latency? 
>>>> 
>>>> Thanks! 
>>>> sage 
>>>> 
>>>> 
>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>> 
>>>>> Hi, 
>>>>> 
>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>> 
>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>> 
>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>> 
>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>> values like 20-200ms. 
>>>>> 
>>>>> Some example graphs: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>> 
>>>>> All osds have this behaviour, in all clusters. 
>>>>> 
>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>> 
>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>> 
>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>> 
>>>>> Any Hints for counters/logs to check ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                     ` <d4558d4b-b1c9-211a-626a-0c14df3e29b9-l3A5Bk7waGM@public.gmane.org>
  2019-02-08 15:08                                                       ` Alexandre DERUMIER
@ 2019-02-08 15:14                                                       ` Alexandre DERUMIER
       [not found]                                                         ` <825077993.841032.1549638894023.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-08 15:14 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

I'm just seeing 

StupidAllocator::_aligned_len 
and 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 

on 1 osd, both 10%.

here the dump_mempools

{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 210243456,
                "bytes": 210243456
            },
            "bluestore_cache_data": {
                "items": 54,
                "bytes": 643072
            },
            "bluestore_cache_onode": {
                "items": 105637,
                "bytes": 70988064
            },
            "bluestore_cache_other": {
                "items": 48661920,
                "bytes": 1539544228
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 12,
                "bytes": 8928
            },
            "bluestore_writing_deferred": {
                "items": 406,
                "bytes": 4792868
            },
            "bluestore_writing": {
                "items": 66,
                "bytes": 1085440
            },
            "bluefs": {
                "items": 1882,
                "bytes": 93600
            },
            "buffer_anon": {
                "items": 138986,
                "bytes": 24983701
            },
          "buffer_meta": {
                "items": 544,
                "bytes": 34816
            },
            "osd": {
                "items": 243,
                "bytes": 3089016
            },
            "osd_mapbl": {
                "items": 36,
                "bytes": 179308
            },
            "osd_pglog": {
                "items": 952564,
                "bytes": 372459684
            },
            "osdmap": {
                "items": 3639,
                "bytes": 224664
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 260109445,
            "bytes": 2228370845
        }
    }
}


and the perf dump

root@ceph5-2:~# ceph daemon osd.4 perf dump
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 22948570,
        "msgr_send_messages": 22561570,
        "msgr_recv_bytes": 333085080271,
        "msgr_send_bytes": 261798871204,
        "msgr_created_connections": 6152,
        "msgr_active_connections": 2701,
        "msgr_running_total_time": 1055.197867330,
        "msgr_running_send_time": 352.764480121,
        "msgr_running_recv_time": 499.206831955,
        "msgr_running_fast_dispatch_time": 130.982201607
    },
    "AsyncMessenger::Worker-1": {
        "msgr_recv_messages": 18801593,
        "msgr_send_messages": 18430264,
        "msgr_recv_bytes": 306871760934,
        "msgr_send_bytes": 192789048666,
        "msgr_created_connections": 5773,
        "msgr_active_connections": 2721,
        "msgr_running_total_time": 816.821076305,
        "msgr_running_send_time": 261.353228926,
        "msgr_running_recv_time": 394.035587911,
        "msgr_running_fast_dispatch_time": 104.012155720
    },
    "AsyncMessenger::Worker-2": {
        "msgr_recv_messages": 18463400,
        "msgr_send_messages": 18105856,
        "msgr_recv_bytes": 187425453590,
        "msgr_send_bytes": 220735102555,
        "msgr_created_connections": 5897,
        "msgr_active_connections": 2605,
        "msgr_running_total_time": 807.186854324,
        "msgr_running_send_time": 296.834435839,
        "msgr_running_recv_time": 351.364389691,
        "msgr_running_fast_dispatch_time": 101.215776792
    },
    "bluefs": {
        "gift_bytes": 0,
        "reclaim_bytes": 0,
        "db_total_bytes": 256050724864,
        "db_used_bytes": 12413042688,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 0,
        "slow_used_bytes": 0,
        "num_files": 209,
        "log_bytes": 10383360,
        "log_compactions": 14,
        "logged_bytes": 336498688,
        "files_written_wal": 2,
        "files_written_sst": 4499,
        "bytes_written_wal": 417989099783,
        "bytes_written_sst": 213188750209
    },
    "bluestore": {
        "kv_flush_lat": {
            "avgcount": 26371957,
            "sum": 26.734038497,
            "avgtime": 0.000001013
        },
        "kv_commit_lat": {
            "avgcount": 26371957,
            "sum": 3397.491150603,
            "avgtime": 0.000128829
        },
        "kv_lat": {
            "avgcount": 26371957,
            "sum": 3424.225189100,
            "avgtime": 0.000129843
        },
        "state_prepare_lat": {
            "avgcount": 30484924,
            "sum": 3689.542105337,
            "avgtime": 0.000121028
        },
        "state_aio_wait_lat": {
            "avgcount": 30484924,
            "sum": 509.864546111,
            "avgtime": 0.000016725
        },
        "state_io_done_lat": {
            "avgcount": 30484924,
            "sum": 24.534052953,
            "avgtime": 0.000000804
        },
        "state_kv_queued_lat": {
            "avgcount": 30484924,
            "sum": 3488.338424238,
            "avgtime": 0.000114428
        },
        "state_kv_commiting_lat": {
            "avgcount": 30484924,
            "sum": 5660.437003432,
            "avgtime": 0.000185679
        },
        "state_kv_done_lat": {
            "avgcount": 30484924,
            "sum": 7.763511500,
            "avgtime": 0.000000254
        },
        "state_deferred_queued_lat": {
            "avgcount": 26346134,
            "sum": 666071.296856696,
            "avgtime": 0.025281557
        },
        "state_deferred_aio_wait_lat": {
            "avgcount": 26346134,
            "sum": 1755.660547071,
            "avgtime": 0.000066638
        },
        "state_deferred_cleanup_lat": {
            "avgcount": 26346134,
            "sum": 185465.151653703,
            "avgtime": 0.007039558
        },
        "state_finishing_lat": {
            "avgcount": 30484920,
            "sum": 3.046847481,
            "avgtime": 0.000000099
        },
        "state_done_lat": {
            "avgcount": 30484920,
            "sum": 13193.362685280,
            "avgtime": 0.000432783
        },
        "throttle_lat": {
            "avgcount": 30484924,
            "sum": 14.634269979,
            "avgtime": 0.000000480
        },
        "submit_lat": {
            "avgcount": 30484924,
            "sum": 3873.883076148,
            "avgtime": 0.000127075
        },
        "commit_lat": {
            "avgcount": 30484924,
            "sum": 13376.492317331,
            "avgtime": 0.000438790
        },
        "read_lat": {
            "avgcount": 5873923,
            "sum": 1817.167582057,
            "avgtime": 0.000309361
        },
        "read_onode_meta_lat": {
            "avgcount": 19608201,
            "sum": 146.770464482,
            "avgtime": 0.000007485
        },
        "read_wait_aio_lat": {
            "avgcount": 13734278,
            "sum": 2532.578077242,
            "avgtime": 0.000184398
        },
        "compress_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "decompress_lat": {
            "avgcount": 1346945,
            "sum": 26.227575896,
            "avgtime": 0.000019471
        },
        "csum_lat": {
            "avgcount": 28020392,
            "sum": 149.587819041,
            "avgtime": 0.000005338
        },
        "compress_success_count": 0,
        "compress_rejected_count": 0,
        "write_pad_bytes": 352923605,
        "deferred_write_ops": 24373340,
        "deferred_write_bytes": 216791842816,
        "write_penalty_read_ops": 8062366,
        "bluestore_allocated": 3765566013440,
        "bluestore_stored": 4186255221852,
        "bluestore_compressed": 39981379040,
        "bluestore_compressed_allocated": 73748348928,
        "bluestore_compressed_original": 165041381376,
        "bluestore_onodes": 104232,
        "bluestore_onode_hits": 71206874,
        "bluestore_onode_misses": 1217914,
        "bluestore_onode_shard_hits": 260183292,
        "bluestore_onode_shard_misses": 22851573,
        "bluestore_extents": 3394513,
        "bluestore_blobs": 2773587,
        "bluestore_buffers": 0,
        "bluestore_buffer_bytes": 0,
        "bluestore_buffer_hit_bytes": 62026011221,
        "bluestore_buffer_miss_bytes": 995233669922,
        "bluestore_write_big": 5648815,
        "bluestore_write_big_bytes": 552502214656,
        "bluestore_write_big_blobs": 12440992,
        "bluestore_write_small": 35883770,
        "bluestore_write_small_bytes": 223436965719,
        "bluestore_write_small_unused": 408125,
        "bluestore_write_small_deferred": 34961455,
        "bluestore_write_small_pre_read": 34961455,
        "bluestore_write_small_new": 514190,
        "bluestore_txc": 30484924,
        "bluestore_onode_reshard": 5144189,
        "bluestore_blob_split": 60104,
        "bluestore_extent_compress": 53347252,
        "bluestore_gc_merged": 21142528,
        "bluestore_read_eio": 0,
        "bluestore_fragmentation_micros": 67
    },
    "finisher-defered_finisher": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "finisher-finisher-0": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 26625163,
            "sum": 1057.506990951,
            "avgtime": 0.000039718
        }
    },
    "finisher-objecter-finisher-0": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.0::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.0::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.1::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.1::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.2::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.2::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.3::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.3::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.4::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.4::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.5::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.5::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.6::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.6::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.7::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.7::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "objecter": {
        "op_active": 0,
        "op_laggy": 0,
        "op_send": 0,
        "op_send_bytes": 0,
        "op_resend": 0,
        "op_reply": 0,
        "op": 0,
        "op_r": 0,
        "op_w": 0,
        "op_rmw": 0,
        "op_pg": 0,
        "osdop_stat": 0,
        "osdop_create": 0,
        "osdop_read": 0,
        "osdop_write": 0,
        "osdop_writefull": 0,
        "osdop_writesame": 0,
        "osdop_append": 0,
        "osdop_zero": 0,
        "osdop_truncate": 0,
        "osdop_delete": 0,
        "osdop_mapext": 0,
        "osdop_sparse_read": 0,
        "osdop_clonerange": 0,
        "osdop_getxattr": 0,
        "osdop_setxattr": 0,
        "osdop_cmpxattr": 0,
        "osdop_rmxattr": 0,
        "osdop_resetxattrs": 0,
        "osdop_tmap_up": 0,
        "osdop_tmap_put": 0,
        "osdop_tmap_get": 0,
        "osdop_call": 0,
        "osdop_watch": 0,
        "osdop_notify": 0,
        "osdop_src_cmpxattr": 0,
        "osdop_pgls": 0,
        "osdop_pgls_filter": 0,
        "osdop_other": 0,
        "linger_active": 0,
        "linger_send": 0,
        "linger_resend": 0,
        "linger_ping": 0,
        "poolop_active": 0,
        "poolop_send": 0,
        "poolop_resend": 0,
        "poolstat_active": 0,
        "poolstat_send": 0,
        "poolstat_resend": 0,
        "statfs_active": 0,
        "statfs_send": 0,
        "statfs_resend": 0,
        "command_active": 0,
        "command_send": 0,
        "command_resend": 0,
        "map_epoch": 105913,
        "map_full": 0,
        "map_inc": 828,
        "osd_sessions": 0,
        "osd_session_open": 0,
        "osd_session_close": 0,
        "osd_laggy": 0,
        "omap_wr": 0,
        "omap_rd": 0,
        "omap_del": 0
    },
    "osd": {
        "op_wip": 0,
        "op": 16758102,
        "op_in_bytes": 238398820586,
        "op_out_bytes": 165484999463,
        "op_latency": {
            "avgcount": 16758102,
            "sum": 38242.481640842,
            "avgtime": 0.002282029
        },
        "op_process_latency": {
            "avgcount": 16758102,
            "sum": 28644.906310687,
            "avgtime": 0.001709316
        },
        "op_prepare_latency": {
            "avgcount": 16761367,
            "sum": 3489.856599934,
            "avgtime": 0.000208208
        },
        "op_r": 6188565,
        "op_r_out_bytes": 165484999463,
        "op_r_latency": {
            "avgcount": 6188565,
            "sum": 4507.365756792,
            "avgtime": 0.000728337
        },
        "op_r_process_latency": {
            "avgcount": 6188565,
            "sum": 942.363063429,
            "avgtime": 0.000152274
        },
        "op_r_prepare_latency": {
            "avgcount": 6188644,
            "sum": 982.866710389,
            "avgtime": 0.000158817
        },
        "op_w": 10546037,
        "op_w_in_bytes": 238334329494,
        "op_w_latency": {
            "avgcount": 10546037,
            "sum": 33160.719998316,
            "avgtime": 0.003144377
        },
        "op_w_process_latency": {
            "avgcount": 10546037,
            "sum": 27668.702029030,
            "avgtime": 0.002623611
        },
        "op_w_prepare_latency": {
            "avgcount": 10548652,
            "sum": 2499.688609173,
            "avgtime": 0.000236967
        },
        "op_rw": 23500,
        "op_rw_in_bytes": 64491092,
        "op_rw_out_bytes": 0,
        "op_rw_latency": {
            "avgcount": 23500,
            "sum": 574.395885734,
            "avgtime": 0.024442378
        },
        "op_rw_process_latency": {
            "avgcount": 23500,
            "sum": 33.841218228,
            "avgtime": 0.001440051
        },
        "op_rw_prepare_latency": {
            "avgcount": 24071,
            "sum": 7.301280372,
            "avgtime": 0.000303322
        },
        "op_before_queue_op_lat": {
            "avgcount": 57892986,
            "sum": 1502.117718889,
            "avgtime": 0.000025946
        },
        "op_before_dequeue_op_lat": {
            "avgcount": 58091683,
            "sum": 45194.453254037,
            "avgtime": 0.000777984
        },
        "subop": 19784758,
        "subop_in_bytes": 547174969754,
        "subop_latency": {
            "avgcount": 19784758,
            "sum": 13019.714424060,
            "avgtime": 0.000658067
        },
        "subop_w": 19784758,
        "subop_w_in_bytes": 547174969754,
        "subop_w_latency": {
            "avgcount": 19784758,
            "sum": 13019.714424060,
            "avgtime": 0.000658067
        },
        "subop_pull": 0,
        "subop_pull_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "subop_push": 0,
        "subop_push_in_bytes": 0,
        "subop_push_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "pull": 0,
        "push": 2003,
        "push_out_bytes": 5560009728,
        "recovery_ops": 1940,
        "loadavg": 118,
        "buffer_bytes": 0,
        "history_alloc_Mbytes": 0,
        "history_alloc_num": 0,
        "cached_crc": 0,
        "cached_crc_adjusted": 0,
        "missed_crc": 0,
        "numpg": 243,
        "numpg_primary": 82,
        "numpg_replica": 161,
        "numpg_stray": 0,
        "numpg_removing": 0,
        "heartbeat_to_peers": 10,
        "map_messages": 7013,
        "map_message_epochs": 7143,
        "map_message_epoch_dups": 6315,
        "messages_delayed_for_map": 0,
        "osd_map_cache_hit": 203309,
        "osd_map_cache_miss": 33,
        "osd_map_cache_miss_low": 0,
        "osd_map_cache_miss_low_avg": {
            "avgcount": 0,
            "sum": 0
        },
        "osd_map_bl_cache_hit": 47012,
        "osd_map_bl_cache_miss": 1681,
        "stat_bytes": 6401248198656,
        "stat_bytes_used": 3777979072512,
        "stat_bytes_avail": 2623269126144,
        "copyfrom": 0,
        "tier_promote": 0,
        "tier_flush": 0,
        "tier_flush_fail": 0,
        "tier_try_flush": 0,
        "tier_try_flush_fail": 0,
        "tier_evict": 0,
        "tier_whiteout": 1631,
        "tier_dirty": 22360,
        "tier_clean": 0,
        "tier_delay": 0,
        "tier_proxy_read": 0,
        "tier_proxy_write": 0,
        "agent_wake": 0,
        "agent_skip": 0,
        "agent_flush": 0,
        "agent_evict": 0,
        "object_ctx_cache_hit": 16311156,
        "object_ctx_cache_total": 17426393,
        "op_cache_hit": 0,
        "osd_tier_flush_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_tier_promote_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_tier_r_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_pg_info": 30483113,
        "osd_pg_fastinfo": 29619885,
        "osd_pg_biginfo": 81703
    },
    "recoverystate_perf": {
        "initial_latency": {
            "avgcount": 243,
            "sum": 6.869296500,
            "avgtime": 0.028268709
        },
        "started_latency": {
            "avgcount": 1125,
            "sum": 13551384.917335850,
            "avgtime": 12045.675482076
        },
        "reset_latency": {
            "avgcount": 1368,
            "sum": 1101.727799040,
            "avgtime": 0.805356578
        },
        "start_latency": {
            "avgcount": 1368,
            "sum": 0.002014799,
            "avgtime": 0.000001472
        },
        "primary_latency": {
            "avgcount": 507,
            "sum": 4575560.638823428,
            "avgtime": 9024.774435549
        },
        "peering_latency": {
            "avgcount": 550,
            "sum": 499.372283616,
            "avgtime": 0.907949606
        },
        "backfilling_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "waitremotebackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "waitlocalbackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "notbackfilling_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "repnotrecovering_latency": {
            "avgcount": 1009,
            "sum": 8975301.082274411,
            "avgtime": 8895.243887288
        },
        "repwaitrecoveryreserved_latency": {
            "avgcount": 420,
            "sum": 99.846056520,
            "avgtime": 0.237728706
        },
        "repwaitbackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "reprecovering_latency": {
            "avgcount": 420,
            "sum": 241.682764382,
            "avgtime": 0.575435153
        },
        "activating_latency": {
            "avgcount": 507,
            "sum": 16.893347339,
            "avgtime": 0.033320211
        },
        "waitlocalrecoveryreserved_latency": {
            "avgcount": 199,
            "sum": 672.335512769,
            "avgtime": 3.378570415
        },
        "waitremoterecoveryreserved_latency": {
            "avgcount": 199,
            "sum": 213.536439363,
            "avgtime": 1.073047433
        },
        "recovering_latency": {
            "avgcount": 199,
            "sum": 79.007696479,
            "avgtime": 0.397023600
        },
        "recovered_latency": {
            "avgcount": 507,
            "sum": 14.000732748,
            "avgtime": 0.027614857
        },
        "clean_latency": {
            "avgcount": 395,
            "sum": 4574325.900371083,
            "avgtime": 11580.571899673
        },
        "active_latency": {
            "avgcount": 425,
            "sum": 4575107.630123680,
            "avgtime": 10764.959129702
        },
        "replicaactive_latency": {
            "avgcount": 589,
            "sum": 8975184.499049954,
            "avgtime": 15238.004242869
        },
        "stray_latency": {
            "avgcount": 818,
            "sum": 800.729455666,
            "avgtime": 0.978886865
        },
        "getinfo_latency": {
            "avgcount": 550,
            "sum": 15.085667048,
            "avgtime": 0.027428485
        },
        "getlog_latency": {
            "avgcount": 546,
            "sum": 3.482175693,
            "avgtime": 0.006377611
        },
        "waitactingchange_latency": {
            "avgcount": 39,
            "sum": 35.444551284,
            "avgtime": 0.908834648
        },
        "incomplete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "down_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "getmissing_latency": {
            "avgcount": 507,
            "sum": 6.702129624,
            "avgtime": 0.013219190
        },
        "waitupthru_latency": {
            "avgcount": 507,
            "sum": 474.098261727,
            "avgtime": 0.935105052
        },
        "notrecovering_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "rocksdb": {
        "get": 28320977,
        "submit_transaction": 30484924,
        "submit_transaction_sync": 26371957,
        "get_latency": {
            "avgcount": 28320977,
            "sum": 325.900908733,
            "avgtime": 0.000011507
        },
        "submit_latency": {
            "avgcount": 30484924,
            "sum": 1835.888692371,
            "avgtime": 0.000060222
        },
        "submit_sync_latency": {
            "avgcount": 26371957,
            "sum": 1431.555230628,
            "avgtime": 0.000054283
        },
        "compact": 0,
        "compact_range": 0,
        "compact_queue_merge": 0,
        "compact_queue_len": 0,
        "rocksdb_write_wal_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_memtable_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_delay_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_pre_and_post_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    }
}

----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
À: "aderumier" <aderumier@odiso.com>
Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 5 Février 2019 18:56:51
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> (but I have bluestore_fragmentation_micros) 
> ok, this is the same 
> 
> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
> 
> 
> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
> 
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs? 

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time. 

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high. 

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data? 

If not may I have current output and say a couple more samples with 
8-12 hours interval? 


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly. 


Thanks, 

Igor 

> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier@odiso.com> 
> À: "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 16:04:38 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Thanks Igor, 
> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> I'm already monitoring with 
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
> 
> but I don't see l_bluestore_fragmentation counter. 
> 
> (but I have bluestore_fragmentation_micros) 
> 
> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
> Sorry, It's a critical production cluster, I can't test on it :( 
> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
> 
> 
> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
> 
> 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 15:51:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi Alexandre, 
> 
> looks like a bug in StupidAllocator. 
> 
> Could you please collect BlueStore performance counters right after OSD 
> startup and once you get high latency. 
> 
> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> 
> Also if you're able to rebuild the code I can probably make a simple 
> patch to track latency and some other internal allocator's paramter to 
> make sure it's degraded and learn more details. 
> 
> 
> More vigorous fix would be to backport bitmap allocator from Nautilus 
> and try the difference... 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>> Hi again, 
>> 
>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>> 
>> 
>> I have notice something using a simple "perf top", 
>> 
>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>> 
>> when latency is bad, perf top give me : 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>> const, unsigned long>*>::increment_slow() 
>> 
>> (around 10-20% time for both) 
>> 
>> 
>> when latency is good, I don't see them at all. 
>> 
>> 
>> I have used the Mark wallclock profiler, here the results: 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>> 
>> 
>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>> 
>> 
>> + 100.00% clone 
>> + 100.00% start_thread 
>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 4 Février 2019 09:38:11 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> some news: 
>> 
>> I have tried with different transparent hugepage values (madvise, never) : no change 
>> 
>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>> 
>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>> 
>> 
>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>> my others clusters user 1,6TB ssd. 
>> 
>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>> 
>> 
>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>> op_r_latency but instead op_latency? 
>>>> 
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>> 
>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>> Hi Stefan, 
>>> 
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>> like suggested. This report makes me a little nervous about my change. 
>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>> maybe bluestore related (don't have filestore anymore to compare) 
>>> I need to compare with bigger latencies 
>>> 
>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>> http://odisoweb1.odiso.net/latencybad.png 
>>> 
>>> I observe the latency in my guest vm too, on disks iowait. 
>>> 
>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>> 
>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>> exact values out of the daemon do you use for bluestore? 
>>> here my influxdb queries: 
>>> 
>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> Thanks. Is there any reason you monitor op_w_latency but not 
>> op_r_latency but instead op_latency? 
>> 
>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> 
>> greets, 
>> Stefan 
>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi, 
>>> 
>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>> Hi, 
>>>> 
>>>> here some new results, 
>>>> different osd/ different cluster 
>>>> 
>>>> before osd restart latency was between 2-5ms 
>>>> after osd restart is around 1-1.5ms 
>>>> 
>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>> 
>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>> (I'm using tcmalloc 2.5-2.2) 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
>>> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
>>> 
>>> I would like to check if i see the same behaviour. 
>>> 
>>> Greets, 
>>> Stefan 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Sage Weil" <sage@newdream.net> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>> going on one of the OSDs wth a high latency? 
>>>> 
>>>> Thanks! 
>>>> sage 
>>>> 
>>>> 
>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>> 
>>>>> Hi, 
>>>>> 
>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>> 
>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>> 
>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>> 
>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>> values like 20-200ms. 
>>>>> 
>>>>> Some example graphs: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>> 
>>>>> All osds have this behaviour, in all clusters. 
>>>>> 
>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>> 
>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>> 
>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>> 
>>>>> Any Hints for counters/logs to check ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <825077993.841032.1549638894023.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                         ` <825077993.841032.1549638894023.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-08 15:57                                                           ` Alexandre DERUMIER
       [not found]                                                             ` <2132634351.842536.1549641461010.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-08 15:57 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-------------
"bluestore_cache_other": { 
"items": 48661920, 
"bytes": 1539544228 
}, 
"bluestore_cache_data": { 
"items": 54, 
"bytes": 643072 
}, 
(other caches seem to be quite low too, like bluestore_cache_other take all the memory)


After restart
-------------
"bluestore_cache_other": {
 "items": 12432298,
  "bytes": 500834899
},
"bluestore_cache_data": {
 "items": 40084,
 "bytes": 1056235520
},


full mempool dump after restart
-------------------------------

{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 165053952,
                "bytes": 165053952
            },
            "bluestore_cache_data": {
                "items": 40084,
                "bytes": 1056235520
            },
            "bluestore_cache_onode": {
                "items": 22225,
                "bytes": 14935200
            },
            "bluestore_cache_other": {
                "items": 12432298,
                "bytes": 500834899
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 11,
                "bytes": 8184
            },
            "bluestore_writing_deferred": {
                "items": 5047,
                "bytes": 22673736
            },
            "bluestore_writing": {
                "items": 91,
                "bytes": 1662976
            },
            "bluefs": {
                "items": 1907,
                "bytes": 95600
            },
            "buffer_anon": {
                "items": 19664,
                "bytes": 25486050
            },
            "buffer_meta": {
                "items": 46189,
                "bytes": 2956096
            },
            "osd": {
                "items": 243,
                "bytes": 3089016
            },
            "osd_mapbl": {
                "items": 17,
                "bytes": 214366
            },
            "osd_pglog": {
                "items": 889673,
                "bytes": 367160400
            },
            "osdmap": {
                "items": 3803,
                "bytes": 224552
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 178515204,
            "bytes": 2160630547
        }
    }
}

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 8 Février 2019 16:14:54
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

I'm just seeing 

StupidAllocator::_aligned_len 
and 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 

on 1 osd, both 10%. 

here the dump_mempools 

{ 
"mempool": { 
"by_pool": { 
"bloom_filter": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_alloc": { 
"items": 210243456, 
"bytes": 210243456 
}, 
"bluestore_cache_data": { 
"items": 54, 
"bytes": 643072 
}, 
"bluestore_cache_onode": { 
"items": 105637, 
"bytes": 70988064 
}, 
"bluestore_cache_other": { 
"items": 48661920, 
"bytes": 1539544228 
}, 
"bluestore_fsck": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_txc": { 
"items": 12, 
"bytes": 8928 
}, 
"bluestore_writing_deferred": { 
"items": 406, 
"bytes": 4792868 
}, 
"bluestore_writing": { 
"items": 66, 
"bytes": 1085440 
}, 
"bluefs": { 
"items": 1882, 
"bytes": 93600 
}, 
"buffer_anon": { 
"items": 138986, 
"bytes": 24983701 
}, 
"buffer_meta": { 
"items": 544, 
"bytes": 34816 
}, 
"osd": { 
"items": 243, 
"bytes": 3089016 
}, 
"osd_mapbl": { 
"items": 36, 
"bytes": 179308 
}, 
"osd_pglog": { 
"items": 952564, 
"bytes": 372459684 
}, 
"osdmap": { 
"items": 3639, 
"bytes": 224664 
}, 
"osdmap_mapping": { 
"items": 0, 
"bytes": 0 
}, 
"pgmap": { 
"items": 0, 
"bytes": 0 
}, 
"mds_co": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_1": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_2": { 
"items": 0, 
"bytes": 0 
} 
}, 
"total": { 
"items": 260109445, 
"bytes": 2228370845 
} 
} 
} 


and the perf dump 

root@ceph5-2:~# ceph daemon osd.4 perf dump 
{ 
"AsyncMessenger::Worker-0": { 
"msgr_recv_messages": 22948570, 
"msgr_send_messages": 22561570, 
"msgr_recv_bytes": 333085080271, 
"msgr_send_bytes": 261798871204, 
"msgr_created_connections": 6152, 
"msgr_active_connections": 2701, 
"msgr_running_total_time": 1055.197867330, 
"msgr_running_send_time": 352.764480121, 
"msgr_running_recv_time": 499.206831955, 
"msgr_running_fast_dispatch_time": 130.982201607 
}, 
"AsyncMessenger::Worker-1": { 
"msgr_recv_messages": 18801593, 
"msgr_send_messages": 18430264, 
"msgr_recv_bytes": 306871760934, 
"msgr_send_bytes": 192789048666, 
"msgr_created_connections": 5773, 
"msgr_active_connections": 2721, 
"msgr_running_total_time": 816.821076305, 
"msgr_running_send_time": 261.353228926, 
"msgr_running_recv_time": 394.035587911, 
"msgr_running_fast_dispatch_time": 104.012155720 
}, 
"AsyncMessenger::Worker-2": { 
"msgr_recv_messages": 18463400, 
"msgr_send_messages": 18105856, 
"msgr_recv_bytes": 187425453590, 
"msgr_send_bytes": 220735102555, 
"msgr_created_connections": 5897, 
"msgr_active_connections": 2605, 
"msgr_running_total_time": 807.186854324, 
"msgr_running_send_time": 296.834435839, 
"msgr_running_recv_time": 351.364389691, 
"msgr_running_fast_dispatch_time": 101.215776792 
}, 
"bluefs": { 
"gift_bytes": 0, 
"reclaim_bytes": 0, 
"db_total_bytes": 256050724864, 
"db_used_bytes": 12413042688, 
"wal_total_bytes": 0, 
"wal_used_bytes": 0, 
"slow_total_bytes": 0, 
"slow_used_bytes": 0, 
"num_files": 209, 
"log_bytes": 10383360, 
"log_compactions": 14, 
"logged_bytes": 336498688, 
"files_written_wal": 2, 
"files_written_sst": 4499, 
"bytes_written_wal": 417989099783, 
"bytes_written_sst": 213188750209 
}, 
"bluestore": { 
"kv_flush_lat": { 
"avgcount": 26371957, 
"sum": 26.734038497, 
"avgtime": 0.000001013 
}, 
"kv_commit_lat": { 
"avgcount": 26371957, 
"sum": 3397.491150603, 
"avgtime": 0.000128829 
}, 
"kv_lat": { 
"avgcount": 26371957, 
"sum": 3424.225189100, 
"avgtime": 0.000129843 
}, 
"state_prepare_lat": { 
"avgcount": 30484924, 
"sum": 3689.542105337, 
"avgtime": 0.000121028 
}, 
"state_aio_wait_lat": { 
"avgcount": 30484924, 
"sum": 509.864546111, 
"avgtime": 0.000016725 
}, 
"state_io_done_lat": { 
"avgcount": 30484924, 
"sum": 24.534052953, 
"avgtime": 0.000000804 
}, 
"state_kv_queued_lat": { 
"avgcount": 30484924, 
"sum": 3488.338424238, 
"avgtime": 0.000114428 
}, 
"state_kv_commiting_lat": { 
"avgcount": 30484924, 
"sum": 5660.437003432, 
"avgtime": 0.000185679 
}, 
"state_kv_done_lat": { 
"avgcount": 30484924, 
"sum": 7.763511500, 
"avgtime": 0.000000254 
}, 
"state_deferred_queued_lat": { 
"avgcount": 26346134, 
"sum": 666071.296856696, 
"avgtime": 0.025281557 
}, 
"state_deferred_aio_wait_lat": { 
"avgcount": 26346134, 
"sum": 1755.660547071, 
"avgtime": 0.000066638 
}, 
"state_deferred_cleanup_lat": { 
"avgcount": 26346134, 
"sum": 185465.151653703, 
"avgtime": 0.007039558 
}, 
"state_finishing_lat": { 
"avgcount": 30484920, 
"sum": 3.046847481, 
"avgtime": 0.000000099 
}, 
"state_done_lat": { 
"avgcount": 30484920, 
"sum": 13193.362685280, 
"avgtime": 0.000432783 
}, 
"throttle_lat": { 
"avgcount": 30484924, 
"sum": 14.634269979, 
"avgtime": 0.000000480 
}, 
"submit_lat": { 
"avgcount": 30484924, 
"sum": 3873.883076148, 
"avgtime": 0.000127075 
}, 
"commit_lat": { 
"avgcount": 30484924, 
"sum": 13376.492317331, 
"avgtime": 0.000438790 
}, 
"read_lat": { 
"avgcount": 5873923, 
"sum": 1817.167582057, 
"avgtime": 0.000309361 
}, 
"read_onode_meta_lat": { 
"avgcount": 19608201, 
"sum": 146.770464482, 
"avgtime": 0.000007485 
}, 
"read_wait_aio_lat": { 
"avgcount": 13734278, 
"sum": 2532.578077242, 
"avgtime": 0.000184398 
}, 
"compress_lat": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"decompress_lat": { 
"avgcount": 1346945, 
"sum": 26.227575896, 
"avgtime": 0.000019471 
}, 
"csum_lat": { 
"avgcount": 28020392, 
"sum": 149.587819041, 
"avgtime": 0.000005338 
}, 
"compress_success_count": 0, 
"compress_rejected_count": 0, 
"write_pad_bytes": 352923605, 
"deferred_write_ops": 24373340, 
"deferred_write_bytes": 216791842816, 
"write_penalty_read_ops": 8062366, 
"bluestore_allocated": 3765566013440, 
"bluestore_stored": 4186255221852, 
"bluestore_compressed": 39981379040, 
"bluestore_compressed_allocated": 73748348928, 
"bluestore_compressed_original": 165041381376, 
"bluestore_onodes": 104232, 
"bluestore_onode_hits": 71206874, 
"bluestore_onode_misses": 1217914, 
"bluestore_onode_shard_hits": 260183292, 
"bluestore_onode_shard_misses": 22851573, 
"bluestore_extents": 3394513, 
"bluestore_blobs": 2773587, 
"bluestore_buffers": 0, 
"bluestore_buffer_bytes": 0, 
"bluestore_buffer_hit_bytes": 62026011221, 
"bluestore_buffer_miss_bytes": 995233669922, 
"bluestore_write_big": 5648815, 
"bluestore_write_big_bytes": 552502214656, 
"bluestore_write_big_blobs": 12440992, 
"bluestore_write_small": 35883770, 
"bluestore_write_small_bytes": 223436965719, 
"bluestore_write_small_unused": 408125, 
"bluestore_write_small_deferred": 34961455, 
"bluestore_write_small_pre_read": 34961455, 
"bluestore_write_small_new": 514190, 
"bluestore_txc": 30484924, 
"bluestore_onode_reshard": 5144189, 
"bluestore_blob_split": 60104, 
"bluestore_extent_compress": 53347252, 
"bluestore_gc_merged": 21142528, 
"bluestore_read_eio": 0, 
"bluestore_fragmentation_micros": 67 
}, 
"finisher-defered_finisher": { 
"queue_len": 0, 
"complete_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"finisher-finisher-0": { 
"queue_len": 0, 
"complete_latency": { 
"avgcount": 26625163, 
"sum": 1057.506990951, 
"avgtime": 0.000039718 
} 
}, 
"finisher-objecter-finisher-0": { 
"queue_len": 0, 
"complete_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.0::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.0::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.1::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.1::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.2::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.2::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.3::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.3::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.4::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.4::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.5::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.5::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.6::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.6::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.7::sdata_wait_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"mutex-OSDShard.7::shard_lock": { 
"wait": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"objecter": { 
"op_active": 0, 
"op_laggy": 0, 
"op_send": 0, 
"op_send_bytes": 0, 
"op_resend": 0, 
"op_reply": 0, 
"op": 0, 
"op_r": 0, 
"op_w": 0, 
"op_rmw": 0, 
"op_pg": 0, 
"osdop_stat": 0, 
"osdop_create": 0, 
"osdop_read": 0, 
"osdop_write": 0, 
"osdop_writefull": 0, 
"osdop_writesame": 0, 
"osdop_append": 0, 
"osdop_zero": 0, 
"osdop_truncate": 0, 
"osdop_delete": 0, 
"osdop_mapext": 0, 
"osdop_sparse_read": 0, 
"osdop_clonerange": 0, 
"osdop_getxattr": 0, 
"osdop_setxattr": 0, 
"osdop_cmpxattr": 0, 
"osdop_rmxattr": 0, 
"osdop_resetxattrs": 0, 
"osdop_tmap_up": 0, 
"osdop_tmap_put": 0, 
"osdop_tmap_get": 0, 
"osdop_call": 0, 
"osdop_watch": 0, 
"osdop_notify": 0, 
"osdop_src_cmpxattr": 0, 
"osdop_pgls": 0, 
"osdop_pgls_filter": 0, 
"osdop_other": 0, 
"linger_active": 0, 
"linger_send": 0, 
"linger_resend": 0, 
"linger_ping": 0, 
"poolop_active": 0, 
"poolop_send": 0, 
"poolop_resend": 0, 
"poolstat_active": 0, 
"poolstat_send": 0, 
"poolstat_resend": 0, 
"statfs_active": 0, 
"statfs_send": 0, 
"statfs_resend": 0, 
"command_active": 0, 
"command_send": 0, 
"command_resend": 0, 
"map_epoch": 105913, 
"map_full": 0, 
"map_inc": 828, 
"osd_sessions": 0, 
"osd_session_open": 0, 
"osd_session_close": 0, 
"osd_laggy": 0, 
"omap_wr": 0, 
"omap_rd": 0, 
"omap_del": 0 
}, 
"osd": { 
"op_wip": 0, 
"op": 16758102, 
"op_in_bytes": 238398820586, 
"op_out_bytes": 165484999463, 
"op_latency": { 
"avgcount": 16758102, 
"sum": 38242.481640842, 
"avgtime": 0.002282029 
}, 
"op_process_latency": { 
"avgcount": 16758102, 
"sum": 28644.906310687, 
"avgtime": 0.001709316 
}, 
"op_prepare_latency": { 
"avgcount": 16761367, 
"sum": 3489.856599934, 
"avgtime": 0.000208208 
}, 
"op_r": 6188565, 
"op_r_out_bytes": 165484999463, 
"op_r_latency": { 
"avgcount": 6188565, 
"sum": 4507.365756792, 
"avgtime": 0.000728337 
}, 
"op_r_process_latency": { 
"avgcount": 6188565, 
"sum": 942.363063429, 
"avgtime": 0.000152274 
}, 
"op_r_prepare_latency": { 
"avgcount": 6188644, 
"sum": 982.866710389, 
"avgtime": 0.000158817 
}, 
"op_w": 10546037, 
"op_w_in_bytes": 238334329494, 
"op_w_latency": { 
"avgcount": 10546037, 
"sum": 33160.719998316, 
"avgtime": 0.003144377 
}, 
"op_w_process_latency": { 
"avgcount": 10546037, 
"sum": 27668.702029030, 
"avgtime": 0.002623611 
}, 
"op_w_prepare_latency": { 
"avgcount": 10548652, 
"sum": 2499.688609173, 
"avgtime": 0.000236967 
}, 
"op_rw": 23500, 
"op_rw_in_bytes": 64491092, 
"op_rw_out_bytes": 0, 
"op_rw_latency": { 
"avgcount": 23500, 
"sum": 574.395885734, 
"avgtime": 0.024442378 
}, 
"op_rw_process_latency": { 
"avgcount": 23500, 
"sum": 33.841218228, 
"avgtime": 0.001440051 
}, 
"op_rw_prepare_latency": { 
"avgcount": 24071, 
"sum": 7.301280372, 
"avgtime": 0.000303322 
}, 
"op_before_queue_op_lat": { 
"avgcount": 57892986, 
"sum": 1502.117718889, 
"avgtime": 0.000025946 
}, 
"op_before_dequeue_op_lat": { 
"avgcount": 58091683, 
"sum": 45194.453254037, 
"avgtime": 0.000777984 
}, 
"subop": 19784758, 
"subop_in_bytes": 547174969754, 
"subop_latency": { 
"avgcount": 19784758, 
"sum": 13019.714424060, 
"avgtime": 0.000658067 
}, 
"subop_w": 19784758, 
"subop_w_in_bytes": 547174969754, 
"subop_w_latency": { 
"avgcount": 19784758, 
"sum": 13019.714424060, 
"avgtime": 0.000658067 
}, 
"subop_pull": 0, 
"subop_pull_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"subop_push": 0, 
"subop_push_in_bytes": 0, 
"subop_push_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"pull": 0, 
"push": 2003, 
"push_out_bytes": 5560009728, 
"recovery_ops": 1940, 
"loadavg": 118, 
"buffer_bytes": 0, 
"history_alloc_Mbytes": 0, 
"history_alloc_num": 0, 
"cached_crc": 0, 
"cached_crc_adjusted": 0, 
"missed_crc": 0, 
"numpg": 243, 
"numpg_primary": 82, 
"numpg_replica": 161, 
"numpg_stray": 0, 
"numpg_removing": 0, 
"heartbeat_to_peers": 10, 
"map_messages": 7013, 
"map_message_epochs": 7143, 
"map_message_epoch_dups": 6315, 
"messages_delayed_for_map": 0, 
"osd_map_cache_hit": 203309, 
"osd_map_cache_miss": 33, 
"osd_map_cache_miss_low": 0, 
"osd_map_cache_miss_low_avg": { 
"avgcount": 0, 
"sum": 0 
}, 
"osd_map_bl_cache_hit": 47012, 
"osd_map_bl_cache_miss": 1681, 
"stat_bytes": 6401248198656, 
"stat_bytes_used": 3777979072512, 
"stat_bytes_avail": 2623269126144, 
"copyfrom": 0, 
"tier_promote": 0, 
"tier_flush": 0, 
"tier_flush_fail": 0, 
"tier_try_flush": 0, 
"tier_try_flush_fail": 0, 
"tier_evict": 0, 
"tier_whiteout": 1631, 
"tier_dirty": 22360, 
"tier_clean": 0, 
"tier_delay": 0, 
"tier_proxy_read": 0, 
"tier_proxy_write": 0, 
"agent_wake": 0, 
"agent_skip": 0, 
"agent_flush": 0, 
"agent_evict": 0, 
"object_ctx_cache_hit": 16311156, 
"object_ctx_cache_total": 17426393, 
"op_cache_hit": 0, 
"osd_tier_flush_lat": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"osd_tier_promote_lat": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"osd_tier_r_lat": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"osd_pg_info": 30483113, 
"osd_pg_fastinfo": 29619885, 
"osd_pg_biginfo": 81703 
}, 
"recoverystate_perf": { 
"initial_latency": { 
"avgcount": 243, 
"sum": 6.869296500, 
"avgtime": 0.028268709 
}, 
"started_latency": { 
"avgcount": 1125, 
"sum": 13551384.917335850, 
"avgtime": 12045.675482076 
}, 
"reset_latency": { 
"avgcount": 1368, 
"sum": 1101.727799040, 
"avgtime": 0.805356578 
}, 
"start_latency": { 
"avgcount": 1368, 
"sum": 0.002014799, 
"avgtime": 0.000001472 
}, 
"primary_latency": { 
"avgcount": 507, 
"sum": 4575560.638823428, 
"avgtime": 9024.774435549 
}, 
"peering_latency": { 
"avgcount": 550, 
"sum": 499.372283616, 
"avgtime": 0.907949606 
}, 
"backfilling_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"waitremotebackfillreserved_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"waitlocalbackfillreserved_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"notbackfilling_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"repnotrecovering_latency": { 
"avgcount": 1009, 
"sum": 8975301.082274411, 
"avgtime": 8895.243887288 
}, 
"repwaitrecoveryreserved_latency": { 
"avgcount": 420, 
"sum": 99.846056520, 
"avgtime": 0.237728706 
}, 
"repwaitbackfillreserved_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"reprecovering_latency": { 
"avgcount": 420, 
"sum": 241.682764382, 
"avgtime": 0.575435153 
}, 
"activating_latency": { 
"avgcount": 507, 
"sum": 16.893347339, 
"avgtime": 0.033320211 
}, 
"waitlocalrecoveryreserved_latency": { 
"avgcount": 199, 
"sum": 672.335512769, 
"avgtime": 3.378570415 
}, 
"waitremoterecoveryreserved_latency": { 
"avgcount": 199, 
"sum": 213.536439363, 
"avgtime": 1.073047433 
}, 
"recovering_latency": { 
"avgcount": 199, 
"sum": 79.007696479, 
"avgtime": 0.397023600 
}, 
"recovered_latency": { 
"avgcount": 507, 
"sum": 14.000732748, 
"avgtime": 0.027614857 
}, 
"clean_latency": { 
"avgcount": 395, 
"sum": 4574325.900371083, 
"avgtime": 11580.571899673 
}, 
"active_latency": { 
"avgcount": 425, 
"sum": 4575107.630123680, 
"avgtime": 10764.959129702 
}, 
"replicaactive_latency": { 
"avgcount": 589, 
"sum": 8975184.499049954, 
"avgtime": 15238.004242869 
}, 
"stray_latency": { 
"avgcount": 818, 
"sum": 800.729455666, 
"avgtime": 0.978886865 
}, 
"getinfo_latency": { 
"avgcount": 550, 
"sum": 15.085667048, 
"avgtime": 0.027428485 
}, 
"getlog_latency": { 
"avgcount": 546, 
"sum": 3.482175693, 
"avgtime": 0.006377611 
}, 
"waitactingchange_latency": { 
"avgcount": 39, 
"sum": 35.444551284, 
"avgtime": 0.908834648 
}, 
"incomplete_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"down_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"getmissing_latency": { 
"avgcount": 507, 
"sum": 6.702129624, 
"avgtime": 0.013219190 
}, 
"waitupthru_latency": { 
"avgcount": 507, 
"sum": 474.098261727, 
"avgtime": 0.935105052 
}, 
"notrecovering_latency": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
}, 
"rocksdb": { 
"get": 28320977, 
"submit_transaction": 30484924, 
"submit_transaction_sync": 26371957, 
"get_latency": { 
"avgcount": 28320977, 
"sum": 325.900908733, 
"avgtime": 0.000011507 
}, 
"submit_latency": { 
"avgcount": 30484924, 
"sum": 1835.888692371, 
"avgtime": 0.000060222 
}, 
"submit_sync_latency": { 
"avgcount": 26371957, 
"sum": 1431.555230628, 
"avgtime": 0.000054283 
}, 
"compact": 0, 
"compact_range": 0, 
"compact_queue_merge": 0, 
"compact_queue_len": 0, 
"rocksdb_write_wal_time": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"rocksdb_write_memtable_time": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"rocksdb_write_delay_time": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
}, 
"rocksdb_write_pre_and_post_time": { 
"avgcount": 0, 
"sum": 0.000000000, 
"avgtime": 0.000000000 
} 
} 
} 

----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 5 Février 2019 18:56:51 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> (but I have bluestore_fragmentation_micros) 
> ok, this is the same 
> 
> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
> 
> 
> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
> 
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs? 

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time. 

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high. 

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data? 

If not may I have current output and say a couple more samples with 
8-12 hours interval? 


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly. 


Thanks, 

Igor 

> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier@odiso.com> 
> À: "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 16:04:38 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Thanks Igor, 
> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> I'm already monitoring with 
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
> 
> but I don't see l_bluestore_fragmentation counter. 
> 
> (but I have bluestore_fragmentation_micros) 
> 
> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
> Sorry, It's a critical production cluster, I can't test on it :( 
> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
> 
> 
> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
> 
> 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 4 Février 2019 15:51:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi Alexandre, 
> 
> looks like a bug in StupidAllocator. 
> 
> Could you please collect BlueStore performance counters right after OSD 
> startup and once you get high latency. 
> 
> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> 
> Also if you're able to rebuild the code I can probably make a simple 
> patch to track latency and some other internal allocator's paramter to 
> make sure it's degraded and learn more details. 
> 
> 
> More vigorous fix would be to backport bitmap allocator from Nautilus 
> and try the difference... 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>> Hi again, 
>> 
>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>> 
>> 
>> I have notice something using a simple "perf top", 
>> 
>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>> 
>> when latency is bad, perf top give me : 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>> const, unsigned long>*>::increment_slow() 
>> 
>> (around 10-20% time for both) 
>> 
>> 
>> when latency is good, I don't see them at all. 
>> 
>> 
>> I have used the Mark wallclock profiler, here the results: 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>> 
>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>> 
>> 
>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>> 
>> 
>> + 100.00% clone 
>> + 100.00% start_thread 
>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 4 Février 2019 09:38:11 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> some news: 
>> 
>> I have tried with different transparent hugepage values (madvise, never) : no change 
>> 
>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>> 
>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>> 
>> 
>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>> my others clusters user 1,6TB ssd. 
>> 
>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>> 
>> 
>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>> op_r_latency but instead op_latency? 
>>>> 
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>> 
>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi, 
>> 
>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>> Hi Stefan, 
>>> 
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>> like suggested. This report makes me a little nervous about my change. 
>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>> maybe bluestore related (don't have filestore anymore to compare) 
>>> I need to compare with bigger latencies 
>>> 
>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>> http://odisoweb1.odiso.net/latencybad.png 
>>> 
>>> I observe the latency in my guest vm too, on disks iowait. 
>>> 
>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>> 
>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>> exact values out of the daemon do you use for bluestore? 
>>> here my influxdb queries: 
>>> 
>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> 
>>> 
>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>> Thanks. Is there any reason you monitor op_w_latency but not 
>> op_r_latency but instead op_latency? 
>> 
>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>> 
>> greets, 
>> Stefan 
>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi, 
>>> 
>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>> Hi, 
>>>> 
>>>> here some new results, 
>>>> different osd/ different cluster 
>>>> 
>>>> before osd restart latency was between 2-5ms 
>>>> after osd restart is around 1-1.5ms 
>>>> 
>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>> 
>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>> (I'm using tcmalloc 2.5-2.2) 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
>>> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
>>> 
>>> I would like to check if i see the same behaviour. 
>>> 
>>> Greets, 
>>> Stefan 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Sage Weil" <sage@newdream.net> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>> going on one of the OSDs wth a high latency? 
>>>> 
>>>> Thanks! 
>>>> sage 
>>>> 
>>>> 
>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>> 
>>>>> Hi, 
>>>>> 
>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>> 
>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>> 
>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>> 
>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>> values like 20-200ms. 
>>>>> 
>>>>> Some example graphs: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>> 
>>>>> All osds have this behaviour, in all clusters. 
>>>>> 
>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>> 
>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>> 
>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>> 
>>>>> Any Hints for counters/logs to check ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <2132634351.842536.1549641461010.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                             ` <2132634351.842536.1549641461010.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-11 11:03                                                               ` Igor Fedotov
       [not found]                                                                 ` <c26e0eca-1a1c-3354-bff6-4560e3aea4c5-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-11 11:03 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel


On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
> another mempool dump after 1h run. (latency ok)
>
> Biggest difference:
>
> before restart
> -------------
> "bluestore_cache_other": {
> "items": 48661920,
> "bytes": 1539544228
> },
> "bluestore_cache_data": {
> "items": 54,
> "bytes": 643072
> },
> (other caches seem to be quite low too, like bluestore_cache_other take all the memory)
>
>
> After restart
> -------------
> "bluestore_cache_other": {
>   "items": 12432298,
>    "bytes": 500834899
> },
> "bluestore_cache_data": {
>   "items": 40084,
>   "bytes": 1056235520
> },
>
This is fine as cache is warming after restart and some rebalancing 
between data and metadata  might occur.

What relates to allocator and most probably to fragmentation growth is :

             "bluestore_alloc": {
                 "items": 165053952,
                 "bytes": 165053952
             },

which had been higher before the reset (if I got these dumps' order 
properly)

         "bluestore_alloc": {
                 "items": 210243456,
                 "bytes": 210243456
             },

But as I mentioned - I'm not 100% sure this might cause such a huge 
latency increase...

Do you have perf counters dump after the restart?

Could you collect some more dumps - for both mempool and perf counters?

So ideally I'd like to have:

1) mempool/perf counters dumps after the restart (1hour is OK)

2) mempool/perf counters dumps in 24+ hours after restart

3) reset perf counters  after 2), wait for 1 hour (and without OSD 
restart) and dump mempool/perf counters again.

So we'll be able to learn both allocator mem usage growth and operation 
latency distribution for the following periods:

a) 1st hour after restart

b) 25th hour.


Thanks,

Igor


> full mempool dump after restart
> -------------------------------
>
> {
>      "mempool": {
>          "by_pool": {
>              "bloom_filter": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "bluestore_alloc": {
>                  "items": 165053952,
>                  "bytes": 165053952
>              },
>              "bluestore_cache_data": {
>                  "items": 40084,
>                  "bytes": 1056235520
>              },
>              "bluestore_cache_onode": {
>                  "items": 22225,
>                  "bytes": 14935200
>              },
>              "bluestore_cache_other": {
>                  "items": 12432298,
>                  "bytes": 500834899
>              },
>              "bluestore_fsck": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "bluestore_txc": {
>                  "items": 11,
>                  "bytes": 8184
>              },
>              "bluestore_writing_deferred": {
>                  "items": 5047,
>                  "bytes": 22673736
>              },
>              "bluestore_writing": {
>                  "items": 91,
>                  "bytes": 1662976
>              },
>              "bluefs": {
>                  "items": 1907,
>                  "bytes": 95600
>              },
>              "buffer_anon": {
>                  "items": 19664,
>                  "bytes": 25486050
>              },
>              "buffer_meta": {
>                  "items": 46189,
>                  "bytes": 2956096
>              },
>              "osd": {
>                  "items": 243,
>                  "bytes": 3089016
>              },
>              "osd_mapbl": {
>                  "items": 17,
>                  "bytes": 214366
>              },
>              "osd_pglog": {
>                  "items": 889673,
>                  "bytes": 367160400
>              },
>              "osdmap": {
>                  "items": 3803,
>                  "bytes": 224552
>              },
>              "osdmap_mapping": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "pgmap": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "mds_co": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "unittest_1": {
>                  "items": 0,
>                  "bytes": 0
>              },
>              "unittest_2": {
>                  "items": 0,
>                  "bytes": 0
>              }
>          },
>          "total": {
>              "items": 178515204,
>              "bytes": 2160630547
>          }
>      }
> }
>
> ----- Mail original -----
> De: "aderumier" <aderumier@odiso.com>
> À: "Igor Fedotov" <ifedotov@suse.de>
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Vendredi 8 Février 2019 16:14:54
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> I'm just seeing
>
> StupidAllocator::_aligned_len
> and
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>
> on 1 osd, both 10%.
>
> here the dump_mempools
>
> {
> "mempool": {
> "by_pool": {
> "bloom_filter": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_alloc": {
> "items": 210243456,
> "bytes": 210243456
> },
> "bluestore_cache_data": {
> "items": 54,
> "bytes": 643072
> },
> "bluestore_cache_onode": {
> "items": 105637,
> "bytes": 70988064
> },
> "bluestore_cache_other": {
> "items": 48661920,
> "bytes": 1539544228
> },
> "bluestore_fsck": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_txc": {
> "items": 12,
> "bytes": 8928
> },
> "bluestore_writing_deferred": {
> "items": 406,
> "bytes": 4792868
> },
> "bluestore_writing": {
> "items": 66,
> "bytes": 1085440
> },
> "bluefs": {
> "items": 1882,
> "bytes": 93600
> },
> "buffer_anon": {
> "items": 138986,
> "bytes": 24983701
> },
> "buffer_meta": {
> "items": 544,
> "bytes": 34816
> },
> "osd": {
> "items": 243,
> "bytes": 3089016
> },
> "osd_mapbl": {
> "items": 36,
> "bytes": 179308
> },
> "osd_pglog": {
> "items": 952564,
> "bytes": 372459684
> },
> "osdmap": {
> "items": 3639,
> "bytes": 224664
> },
> "osdmap_mapping": {
> "items": 0,
> "bytes": 0
> },
> "pgmap": {
> "items": 0,
> "bytes": 0
> },
> "mds_co": {
> "items": 0,
> "bytes": 0
> },
> "unittest_1": {
> "items": 0,
> "bytes": 0
> },
> "unittest_2": {
> "items": 0,
> "bytes": 0
> }
> },
> "total": {
> "items": 260109445,
> "bytes": 2228370845
> }
> }
> }
>
>
> and the perf dump
>
> root@ceph5-2:~# ceph daemon osd.4 perf dump
> {
> "AsyncMessenger::Worker-0": {
> "msgr_recv_messages": 22948570,
> "msgr_send_messages": 22561570,
> "msgr_recv_bytes": 333085080271,
> "msgr_send_bytes": 261798871204,
> "msgr_created_connections": 6152,
> "msgr_active_connections": 2701,
> "msgr_running_total_time": 1055.197867330,
> "msgr_running_send_time": 352.764480121,
> "msgr_running_recv_time": 499.206831955,
> "msgr_running_fast_dispatch_time": 130.982201607
> },
> "AsyncMessenger::Worker-1": {
> "msgr_recv_messages": 18801593,
> "msgr_send_messages": 18430264,
> "msgr_recv_bytes": 306871760934,
> "msgr_send_bytes": 192789048666,
> "msgr_created_connections": 5773,
> "msgr_active_connections": 2721,
> "msgr_running_total_time": 816.821076305,
> "msgr_running_send_time": 261.353228926,
> "msgr_running_recv_time": 394.035587911,
> "msgr_running_fast_dispatch_time": 104.012155720
> },
> "AsyncMessenger::Worker-2": {
> "msgr_recv_messages": 18463400,
> "msgr_send_messages": 18105856,
> "msgr_recv_bytes": 187425453590,
> "msgr_send_bytes": 220735102555,
> "msgr_created_connections": 5897,
> "msgr_active_connections": 2605,
> "msgr_running_total_time": 807.186854324,
> "msgr_running_send_time": 296.834435839,
> "msgr_running_recv_time": 351.364389691,
> "msgr_running_fast_dispatch_time": 101.215776792
> },
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 256050724864,
> "db_used_bytes": 12413042688,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 0,
> "slow_used_bytes": 0,
> "num_files": 209,
> "log_bytes": 10383360,
> "log_compactions": 14,
> "logged_bytes": 336498688,
> "files_written_wal": 2,
> "files_written_sst": 4499,
> "bytes_written_wal": 417989099783,
> "bytes_written_sst": 213188750209
> },
> "bluestore": {
> "kv_flush_lat": {
> "avgcount": 26371957,
> "sum": 26.734038497,
> "avgtime": 0.000001013
> },
> "kv_commit_lat": {
> "avgcount": 26371957,
> "sum": 3397.491150603,
> "avgtime": 0.000128829
> },
> "kv_lat": {
> "avgcount": 26371957,
> "sum": 3424.225189100,
> "avgtime": 0.000129843
> },
> "state_prepare_lat": {
> "avgcount": 30484924,
> "sum": 3689.542105337,
> "avgtime": 0.000121028
> },
> "state_aio_wait_lat": {
> "avgcount": 30484924,
> "sum": 509.864546111,
> "avgtime": 0.000016725
> },
> "state_io_done_lat": {
> "avgcount": 30484924,
> "sum": 24.534052953,
> "avgtime": 0.000000804
> },
> "state_kv_queued_lat": {
> "avgcount": 30484924,
> "sum": 3488.338424238,
> "avgtime": 0.000114428
> },
> "state_kv_commiting_lat": {
> "avgcount": 30484924,
> "sum": 5660.437003432,
> "avgtime": 0.000185679
> },
> "state_kv_done_lat": {
> "avgcount": 30484924,
> "sum": 7.763511500,
> "avgtime": 0.000000254
> },
> "state_deferred_queued_lat": {
> "avgcount": 26346134,
> "sum": 666071.296856696,
> "avgtime": 0.025281557
> },
> "state_deferred_aio_wait_lat": {
> "avgcount": 26346134,
> "sum": 1755.660547071,
> "avgtime": 0.000066638
> },
> "state_deferred_cleanup_lat": {
> "avgcount": 26346134,
> "sum": 185465.151653703,
> "avgtime": 0.007039558
> },
> "state_finishing_lat": {
> "avgcount": 30484920,
> "sum": 3.046847481,
> "avgtime": 0.000000099
> },
> "state_done_lat": {
> "avgcount": 30484920,
> "sum": 13193.362685280,
> "avgtime": 0.000432783
> },
> "throttle_lat": {
> "avgcount": 30484924,
> "sum": 14.634269979,
> "avgtime": 0.000000480
> },
> "submit_lat": {
> "avgcount": 30484924,
> "sum": 3873.883076148,
> "avgtime": 0.000127075
> },
> "commit_lat": {
> "avgcount": 30484924,
> "sum": 13376.492317331,
> "avgtime": 0.000438790
> },
> "read_lat": {
> "avgcount": 5873923,
> "sum": 1817.167582057,
> "avgtime": 0.000309361
> },
> "read_onode_meta_lat": {
> "avgcount": 19608201,
> "sum": 146.770464482,
> "avgtime": 0.000007485
> },
> "read_wait_aio_lat": {
> "avgcount": 13734278,
> "sum": 2532.578077242,
> "avgtime": 0.000184398
> },
> "compress_lat": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "decompress_lat": {
> "avgcount": 1346945,
> "sum": 26.227575896,
> "avgtime": 0.000019471
> },
> "csum_lat": {
> "avgcount": 28020392,
> "sum": 149.587819041,
> "avgtime": 0.000005338
> },
> "compress_success_count": 0,
> "compress_rejected_count": 0,
> "write_pad_bytes": 352923605,
> "deferred_write_ops": 24373340,
> "deferred_write_bytes": 216791842816,
> "write_penalty_read_ops": 8062366,
> "bluestore_allocated": 3765566013440,
> "bluestore_stored": 4186255221852,
> "bluestore_compressed": 39981379040,
> "bluestore_compressed_allocated": 73748348928,
> "bluestore_compressed_original": 165041381376,
> "bluestore_onodes": 104232,
> "bluestore_onode_hits": 71206874,
> "bluestore_onode_misses": 1217914,
> "bluestore_onode_shard_hits": 260183292,
> "bluestore_onode_shard_misses": 22851573,
> "bluestore_extents": 3394513,
> "bluestore_blobs": 2773587,
> "bluestore_buffers": 0,
> "bluestore_buffer_bytes": 0,
> "bluestore_buffer_hit_bytes": 62026011221,
> "bluestore_buffer_miss_bytes": 995233669922,
> "bluestore_write_big": 5648815,
> "bluestore_write_big_bytes": 552502214656,
> "bluestore_write_big_blobs": 12440992,
> "bluestore_write_small": 35883770,
> "bluestore_write_small_bytes": 223436965719,
> "bluestore_write_small_unused": 408125,
> "bluestore_write_small_deferred": 34961455,
> "bluestore_write_small_pre_read": 34961455,
> "bluestore_write_small_new": 514190,
> "bluestore_txc": 30484924,
> "bluestore_onode_reshard": 5144189,
> "bluestore_blob_split": 60104,
> "bluestore_extent_compress": 53347252,
> "bluestore_gc_merged": 21142528,
> "bluestore_read_eio": 0,
> "bluestore_fragmentation_micros": 67
> },
> "finisher-defered_finisher": {
> "queue_len": 0,
> "complete_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "finisher-finisher-0": {
> "queue_len": 0,
> "complete_latency": {
> "avgcount": 26625163,
> "sum": 1057.506990951,
> "avgtime": 0.000039718
> }
> },
> "finisher-objecter-finisher-0": {
> "queue_len": 0,
> "complete_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.0::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.0::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.1::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.1::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.2::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.2::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.3::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.3::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.4::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.4::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.5::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.5::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.6::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.6::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.7::sdata_wait_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "mutex-OSDShard.7::shard_lock": {
> "wait": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "objecter": {
> "op_active": 0,
> "op_laggy": 0,
> "op_send": 0,
> "op_send_bytes": 0,
> "op_resend": 0,
> "op_reply": 0,
> "op": 0,
> "op_r": 0,
> "op_w": 0,
> "op_rmw": 0,
> "op_pg": 0,
> "osdop_stat": 0,
> "osdop_create": 0,
> "osdop_read": 0,
> "osdop_write": 0,
> "osdop_writefull": 0,
> "osdop_writesame": 0,
> "osdop_append": 0,
> "osdop_zero": 0,
> "osdop_truncate": 0,
> "osdop_delete": 0,
> "osdop_mapext": 0,
> "osdop_sparse_read": 0,
> "osdop_clonerange": 0,
> "osdop_getxattr": 0,
> "osdop_setxattr": 0,
> "osdop_cmpxattr": 0,
> "osdop_rmxattr": 0,
> "osdop_resetxattrs": 0,
> "osdop_tmap_up": 0,
> "osdop_tmap_put": 0,
> "osdop_tmap_get": 0,
> "osdop_call": 0,
> "osdop_watch": 0,
> "osdop_notify": 0,
> "osdop_src_cmpxattr": 0,
> "osdop_pgls": 0,
> "osdop_pgls_filter": 0,
> "osdop_other": 0,
> "linger_active": 0,
> "linger_send": 0,
> "linger_resend": 0,
> "linger_ping": 0,
> "poolop_active": 0,
> "poolop_send": 0,
> "poolop_resend": 0,
> "poolstat_active": 0,
> "poolstat_send": 0,
> "poolstat_resend": 0,
> "statfs_active": 0,
> "statfs_send": 0,
> "statfs_resend": 0,
> "command_active": 0,
> "command_send": 0,
> "command_resend": 0,
> "map_epoch": 105913,
> "map_full": 0,
> "map_inc": 828,
> "osd_sessions": 0,
> "osd_session_open": 0,
> "osd_session_close": 0,
> "osd_laggy": 0,
> "omap_wr": 0,
> "omap_rd": 0,
> "omap_del": 0
> },
> "osd": {
> "op_wip": 0,
> "op": 16758102,
> "op_in_bytes": 238398820586,
> "op_out_bytes": 165484999463,
> "op_latency": {
> "avgcount": 16758102,
> "sum": 38242.481640842,
> "avgtime": 0.002282029
> },
> "op_process_latency": {
> "avgcount": 16758102,
> "sum": 28644.906310687,
> "avgtime": 0.001709316
> },
> "op_prepare_latency": {
> "avgcount": 16761367,
> "sum": 3489.856599934,
> "avgtime": 0.000208208
> },
> "op_r": 6188565,
> "op_r_out_bytes": 165484999463,
> "op_r_latency": {
> "avgcount": 6188565,
> "sum": 4507.365756792,
> "avgtime": 0.000728337
> },
> "op_r_process_latency": {
> "avgcount": 6188565,
> "sum": 942.363063429,
> "avgtime": 0.000152274
> },
> "op_r_prepare_latency": {
> "avgcount": 6188644,
> "sum": 982.866710389,
> "avgtime": 0.000158817
> },
> "op_w": 10546037,
> "op_w_in_bytes": 238334329494,
> "op_w_latency": {
> "avgcount": 10546037,
> "sum": 33160.719998316,
> "avgtime": 0.003144377
> },
> "op_w_process_latency": {
> "avgcount": 10546037,
> "sum": 27668.702029030,
> "avgtime": 0.002623611
> },
> "op_w_prepare_latency": {
> "avgcount": 10548652,
> "sum": 2499.688609173,
> "avgtime": 0.000236967
> },
> "op_rw": 23500,
> "op_rw_in_bytes": 64491092,
> "op_rw_out_bytes": 0,
> "op_rw_latency": {
> "avgcount": 23500,
> "sum": 574.395885734,
> "avgtime": 0.024442378
> },
> "op_rw_process_latency": {
> "avgcount": 23500,
> "sum": 33.841218228,
> "avgtime": 0.001440051
> },
> "op_rw_prepare_latency": {
> "avgcount": 24071,
> "sum": 7.301280372,
> "avgtime": 0.000303322
> },
> "op_before_queue_op_lat": {
> "avgcount": 57892986,
> "sum": 1502.117718889,
> "avgtime": 0.000025946
> },
> "op_before_dequeue_op_lat": {
> "avgcount": 58091683,
> "sum": 45194.453254037,
> "avgtime": 0.000777984
> },
> "subop": 19784758,
> "subop_in_bytes": 547174969754,
> "subop_latency": {
> "avgcount": 19784758,
> "sum": 13019.714424060,
> "avgtime": 0.000658067
> },
> "subop_w": 19784758,
> "subop_w_in_bytes": 547174969754,
> "subop_w_latency": {
> "avgcount": 19784758,
> "sum": 13019.714424060,
> "avgtime": 0.000658067
> },
> "subop_pull": 0,
> "subop_pull_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "subop_push": 0,
> "subop_push_in_bytes": 0,
> "subop_push_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "pull": 0,
> "push": 2003,
> "push_out_bytes": 5560009728,
> "recovery_ops": 1940,
> "loadavg": 118,
> "buffer_bytes": 0,
> "history_alloc_Mbytes": 0,
> "history_alloc_num": 0,
> "cached_crc": 0,
> "cached_crc_adjusted": 0,
> "missed_crc": 0,
> "numpg": 243,
> "numpg_primary": 82,
> "numpg_replica": 161,
> "numpg_stray": 0,
> "numpg_removing": 0,
> "heartbeat_to_peers": 10,
> "map_messages": 7013,
> "map_message_epochs": 7143,
> "map_message_epoch_dups": 6315,
> "messages_delayed_for_map": 0,
> "osd_map_cache_hit": 203309,
> "osd_map_cache_miss": 33,
> "osd_map_cache_miss_low": 0,
> "osd_map_cache_miss_low_avg": {
> "avgcount": 0,
> "sum": 0
> },
> "osd_map_bl_cache_hit": 47012,
> "osd_map_bl_cache_miss": 1681,
> "stat_bytes": 6401248198656,
> "stat_bytes_used": 3777979072512,
> "stat_bytes_avail": 2623269126144,
> "copyfrom": 0,
> "tier_promote": 0,
> "tier_flush": 0,
> "tier_flush_fail": 0,
> "tier_try_flush": 0,
> "tier_try_flush_fail": 0,
> "tier_evict": 0,
> "tier_whiteout": 1631,
> "tier_dirty": 22360,
> "tier_clean": 0,
> "tier_delay": 0,
> "tier_proxy_read": 0,
> "tier_proxy_write": 0,
> "agent_wake": 0,
> "agent_skip": 0,
> "agent_flush": 0,
> "agent_evict": 0,
> "object_ctx_cache_hit": 16311156,
> "object_ctx_cache_total": 17426393,
> "op_cache_hit": 0,
> "osd_tier_flush_lat": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "osd_tier_promote_lat": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "osd_tier_r_lat": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "osd_pg_info": 30483113,
> "osd_pg_fastinfo": 29619885,
> "osd_pg_biginfo": 81703
> },
> "recoverystate_perf": {
> "initial_latency": {
> "avgcount": 243,
> "sum": 6.869296500,
> "avgtime": 0.028268709
> },
> "started_latency": {
> "avgcount": 1125,
> "sum": 13551384.917335850,
> "avgtime": 12045.675482076
> },
> "reset_latency": {
> "avgcount": 1368,
> "sum": 1101.727799040,
> "avgtime": 0.805356578
> },
> "start_latency": {
> "avgcount": 1368,
> "sum": 0.002014799,
> "avgtime": 0.000001472
> },
> "primary_latency": {
> "avgcount": 507,
> "sum": 4575560.638823428,
> "avgtime": 9024.774435549
> },
> "peering_latency": {
> "avgcount": 550,
> "sum": 499.372283616,
> "avgtime": 0.907949606
> },
> "backfilling_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "waitremotebackfillreserved_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "waitlocalbackfillreserved_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "notbackfilling_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "repnotrecovering_latency": {
> "avgcount": 1009,
> "sum": 8975301.082274411,
> "avgtime": 8895.243887288
> },
> "repwaitrecoveryreserved_latency": {
> "avgcount": 420,
> "sum": 99.846056520,
> "avgtime": 0.237728706
> },
> "repwaitbackfillreserved_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "reprecovering_latency": {
> "avgcount": 420,
> "sum": 241.682764382,
> "avgtime": 0.575435153
> },
> "activating_latency": {
> "avgcount": 507,
> "sum": 16.893347339,
> "avgtime": 0.033320211
> },
> "waitlocalrecoveryreserved_latency": {
> "avgcount": 199,
> "sum": 672.335512769,
> "avgtime": 3.378570415
> },
> "waitremoterecoveryreserved_latency": {
> "avgcount": 199,
> "sum": 213.536439363,
> "avgtime": 1.073047433
> },
> "recovering_latency": {
> "avgcount": 199,
> "sum": 79.007696479,
> "avgtime": 0.397023600
> },
> "recovered_latency": {
> "avgcount": 507,
> "sum": 14.000732748,
> "avgtime": 0.027614857
> },
> "clean_latency": {
> "avgcount": 395,
> "sum": 4574325.900371083,
> "avgtime": 11580.571899673
> },
> "active_latency": {
> "avgcount": 425,
> "sum": 4575107.630123680,
> "avgtime": 10764.959129702
> },
> "replicaactive_latency": {
> "avgcount": 589,
> "sum": 8975184.499049954,
> "avgtime": 15238.004242869
> },
> "stray_latency": {
> "avgcount": 818,
> "sum": 800.729455666,
> "avgtime": 0.978886865
> },
> "getinfo_latency": {
> "avgcount": 550,
> "sum": 15.085667048,
> "avgtime": 0.027428485
> },
> "getlog_latency": {
> "avgcount": 546,
> "sum": 3.482175693,
> "avgtime": 0.006377611
> },
> "waitactingchange_latency": {
> "avgcount": 39,
> "sum": 35.444551284,
> "avgtime": 0.908834648
> },
> "incomplete_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "down_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "getmissing_latency": {
> "avgcount": 507,
> "sum": 6.702129624,
> "avgtime": 0.013219190
> },
> "waitupthru_latency": {
> "avgcount": 507,
> "sum": 474.098261727,
> "avgtime": 0.935105052
> },
> "notrecovering_latency": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> },
> "rocksdb": {
> "get": 28320977,
> "submit_transaction": 30484924,
> "submit_transaction_sync": 26371957,
> "get_latency": {
> "avgcount": 28320977,
> "sum": 325.900908733,
> "avgtime": 0.000011507
> },
> "submit_latency": {
> "avgcount": 30484924,
> "sum": 1835.888692371,
> "avgtime": 0.000060222
> },
> "submit_sync_latency": {
> "avgcount": 26371957,
> "sum": 1431.555230628,
> "avgtime": 0.000054283
> },
> "compact": 0,
> "compact_range": 0,
> "compact_queue_merge": 0,
> "compact_queue_len": 0,
> "rocksdb_write_wal_time": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "rocksdb_write_memtable_time": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "rocksdb_write_delay_time": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "rocksdb_write_pre_and_post_time": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> }
> }
> }
>
> ----- Mail original -----
> De: "Igor Fedotov" <ifedotov@suse.de>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mardi 5 Février 2019 18:56:51
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>>> but I don't see l_bluestore_fragmentation counter.
>>>> (but I have bluestore_fragmentation_micros)
>> ok, this is the same
>>
>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");
>>
>>
>> Here a graph on last month, with bluestore_fragmentation_micros and latency,
>>
>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
> it? The same for other OSDs?
>
> This proves some issue with the allocator - generally fragmentation
> might grow but it shouldn't reset on restart. Looks like some intervals
> aren't properly merged in run-time.
>
> On the other side I'm not completely sure that latency degradation is
> caused by that - fragmentation growth is relatively small - I don't see
> how this might impact performance that high.
>
> Wondering if you have OSD mempool monitoring (dump_mempools command
> output on admin socket) reports? Do you have any historic data?
>
> If not may I have current output and say a couple more samples with
> 8-12 hours interval?
>
>
> Wrt to backporting bitmap allocator to mimic - we haven't had such plans
> before that but I'll discuss this at BlueStore meeting shortly.
>
>
> Thanks,
>
> Igor
>
>> ----- Mail original -----
>> De: "Alexandre Derumier" <aderumier@odiso.com>
>> À: "Igor Fedotov" <ifedotov@suse.de>
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 4 Février 2019 16:04:38
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> Thanks Igor,
>>
>>>> Could you please collect BlueStore performance counters right after OSD
>>>> startup and once you get high latency.
>>>>
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>> I'm already monitoring with
>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters)
>>
>> but I don't see l_bluestore_fragmentation counter.
>>
>> (but I have bluestore_fragmentation_micros)
>>
>>
>>>> Also if you're able to rebuild the code I can probably make a simple
>>>> patch to track latency and some other internal allocator's paramter to
>>>> make sure it's degraded and learn more details.
>> Sorry, It's a critical production cluster, I can't test on it :(
>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.
>>
>>
>>
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>> and try the difference...
>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
>> perf results of new bitmap allocator seem very promising from what I've seen in PR.
>>
>>
>>
>> ----- Mail original -----
>> De: "Igor Fedotov" <ifedotov@suse.de>
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 4 Février 2019 15:51:30
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> Hi Alexandre,
>>
>> looks like a bug in StupidAllocator.
>>
>> Could you please collect BlueStore performance counters right after OSD
>> startup and once you get high latency.
>>
>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>
>> Also if you're able to rebuild the code I can probably make a simple
>> patch to track latency and some other internal allocator's paramter to
>> make sure it's degraded and learn more details.
>>
>>
>> More vigorous fix would be to backport bitmap allocator from Nautilus
>> and try the difference...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>>> Hi again,
>>>
>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>>>
>>>
>>> I have notice something using a simple "perf top",
>>>
>>> each time I have this problem (I have seen exactly 4 times the same behaviour),
>>>
>>> when latency is bad, perf top give me :
>>>
>>> StupidAllocator::_aligned_len
>>> and
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
>>> const, unsigned long>*>::increment_slow()
>>>
>>> (around 10-20% time for both)
>>>
>>>
>>> when latency is good, I don't see them at all.
>>>
>>>
>>> I have used the Mark wallclock profiler, here the results:
>>>
>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>>
>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>>
>>>
>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>>>
>>>
>>> + 100.00% clone
>>> + 100.00% start_thread
>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Lundi 4 Février 2019 09:38:11
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi,
>>>
>>> some news:
>>>
>>> I have tried with different transparent hugepage values (madvise, never) : no change
>>>
>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>>
>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>>>
>>>
>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
>>> my others clusters user 1,6TB ssd.
>>>
>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>>>
>>>
>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>>>
>>>
>>> Regards,
>>>
>>> Alexandre
>>>
>>>
>>> ----- Mail original -----
>>> De: "aderumier" <aderumier@odiso.com>
>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>> op_r_latency but instead op_latency?
>>>>>
>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>>>
>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>> À: "aderumier" <aderumier@odiso.com>
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi,
>>>
>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>>> Hi Stefan,
>>>>
>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>>> like suggested. This report makes me a little nervous about my change.
>>>> Well,I'm really not sure that it's a tcmalloc bug.
>>>> maybe bluestore related (don't have filestore anymore to compare)
>>>> I need to compare with bigger latencies
>>>>
>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>>>> http://odisoweb1.odiso.net/latencybad.png
>>>>
>>>> I observe the latency in my guest vm too, on disks iowait.
>>>>
>>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>>
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>> exact values out of the daemon do you use for bluestore?
>>>> here my influxdb queries:
>>>>
>>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>>
>>>>
>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>
>>>>
>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>
>>>>
>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>> Thanks. Is there any reason you monitor op_w_latency but not
>>> op_r_latency but instead op_latency?
>>>
>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>
>>> greets,
>>> Stefan
>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> Hi,
>>>>
>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>>> Hi,
>>>>>
>>>>> here some new results,
>>>>> different osd/ different cluster
>>>>>
>>>>> before osd restart latency was between 2-5ms
>>>>> after osd restart is around 1-1.5ms
>>>>>
>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>>
>>>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>>>> (I'm using tcmalloc 2.5-2.2)
>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>> like suggested. This report makes me a little nervous about my change.
>>>>
>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>> exact values out of the daemon do you use for bluestore?
>>>>
>>>> I would like to check if i see the same behaviour.
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>>> ----- Mail original -----
>>>>> De: "Sage Weil" <sage@newdream.net>
>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>>>
>>>>> Can you capture a perf top or perf record to see where teh CPU time is
>>>>> going on one of the OSDs wth a high latency?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>>
>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>>
>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>>
>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>>>
>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>>>> values like 20-200ms.
>>>>>>
>>>>>> Some example graphs:
>>>>>>
>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>>
>>>>>> All osds have this behaviour, in all clusters.
>>>>>>
>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>>>
>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>>
>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>>>
>>>>>> Any Hints for counters/logs to check ?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Alexandre
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <c26e0eca-1a1c-3354-bff6-4560e3aea4c5-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                 ` <c26e0eca-1a1c-3354-bff6-4560e3aea4c5-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-13  8:42                                                                   ` Alexandre DERUMIER
       [not found]                                                                     ` <1554220830.1076801.1550047328269.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-13  8:42 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

Hi Igor, 

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G.  (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours,
here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt



report  after 24 before counter resets

http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt

report 1h after counter reset
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt




I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 12-02-2019 at 14:00
http://odisoweb1.odiso.net/perfanalysis/graphs2.png
Then after that, slowly decreasing.


Another strange thing,
I'm seeing total bytes at 5G at 12-02-2018.13:30
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G


I'm graphing mempools counters too since yesterday, so I'll able to track them over time.

----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
À: "Alexandre Derumier" <aderumier@odiso.com>
Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Lundi 11 Février 2019 12:03:17
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
> another mempool dump after 1h run. (latency ok) 
> 
> Biggest difference: 
> 
> before restart 
> ------------- 
> "bluestore_cache_other": { 
> "items": 48661920, 
> "bytes": 1539544228 
> }, 
> "bluestore_cache_data": { 
> "items": 54, 
> "bytes": 643072 
> }, 
> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
> 
> 
> After restart 
> ------------- 
> "bluestore_cache_other": { 
> "items": 12432298, 
> "bytes": 500834899 
> }, 
> "bluestore_cache_data": { 
> "items": 40084, 
> "bytes": 1056235520 
> }, 
> 
This is fine as cache is warming after restart and some rebalancing 
between data and metadata might occur. 

What relates to allocator and most probably to fragmentation growth is : 

"bluestore_alloc": { 
"items": 165053952, 
"bytes": 165053952 
}, 

which had been higher before the reset (if I got these dumps' order 
properly) 

"bluestore_alloc": { 
"items": 210243456, 
"bytes": 210243456 
}, 

But as I mentioned - I'm not 100% sure this might cause such a huge 
latency increase... 

Do you have perf counters dump after the restart? 

Could you collect some more dumps - for both mempool and perf counters? 

So ideally I'd like to have: 

1) mempool/perf counters dumps after the restart (1hour is OK) 

2) mempool/perf counters dumps in 24+ hours after restart 

3) reset perf counters after 2), wait for 1 hour (and without OSD 
restart) and dump mempool/perf counters again. 

So we'll be able to learn both allocator mem usage growth and operation 
latency distribution for the following periods: 

a) 1st hour after restart 

b) 25th hour. 


Thanks, 

Igor 


> full mempool dump after restart 
> ------------------------------- 
> 
> { 
> "mempool": { 
> "by_pool": { 
> "bloom_filter": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes": 165053952 
> }, 
> "bluestore_cache_data": { 
> "items": 40084, 
> "bytes": 1056235520 
> }, 
> "bluestore_cache_onode": { 
> "items": 22225, 
> "bytes": 14935200 
> }, 
> "bluestore_cache_other": { 
> "items": 12432298, 
> "bytes": 500834899 
> }, 
> "bluestore_fsck": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_txc": { 
> "items": 11, 
> "bytes": 8184 
> }, 
> "bluestore_writing_deferred": { 
> "items": 5047, 
> "bytes": 22673736 
> }, 
> "bluestore_writing": { 
> "items": 91, 
> "bytes": 1662976 
> }, 
> "bluefs": { 
> "items": 1907, 
> "bytes": 95600 
> }, 
> "buffer_anon": { 
> "items": 19664, 
> "bytes": 25486050 
> }, 
> "buffer_meta": { 
> "items": 46189, 
> "bytes": 2956096 
> }, 
> "osd": { 
> "items": 243, 
> "bytes": 3089016 
> }, 
> "osd_mapbl": { 
> "items": 17, 
> "bytes": 214366 
> }, 
> "osd_pglog": { 
> "items": 889673, 
> "bytes": 367160400 
> }, 
> "osdmap": { 
> "items": 3803, 
> "bytes": 224552 
> }, 
> "osdmap_mapping": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "pgmap": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "mds_co": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_1": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_2": { 
> "items": 0, 
> "bytes": 0 
> } 
> }, 
> "total": { 
> "items": 178515204, 
> "bytes": 2160630547 
> } 
> } 
> } 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier@odiso.com> 
> À: "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 8 Février 2019 16:14:54 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> I'm just seeing 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
> 
> on 1 osd, both 10%. 
> 
> here the dump_mempools 
> 
> { 
> "mempool": { 
> "by_pool": { 
> "bloom_filter": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_alloc": { 
> "items": 210243456, 
> "bytes": 210243456 
> }, 
> "bluestore_cache_data": { 
> "items": 54, 
> "bytes": 643072 
> }, 
> "bluestore_cache_onode": { 
> "items": 105637, 
> "bytes": 70988064 
> }, 
> "bluestore_cache_other": { 
> "items": 48661920, 
> "bytes": 1539544228 
> }, 
> "bluestore_fsck": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_txc": { 
> "items": 12, 
> "bytes": 8928 
> }, 
> "bluestore_writing_deferred": { 
> "items": 406, 
> "bytes": 4792868 
> }, 
> "bluestore_writing": { 
> "items": 66, 
> "bytes": 1085440 
> }, 
> "bluefs": { 
> "items": 1882, 
> "bytes": 93600 
> }, 
> "buffer_anon": { 
> "items": 138986, 
> "bytes": 24983701 
> }, 
> "buffer_meta": { 
> "items": 544, 
> "bytes": 34816 
> }, 
> "osd": { 
> "items": 243, 
> "bytes": 3089016 
> }, 
> "osd_mapbl": { 
> "items": 36, 
> "bytes": 179308 
> }, 
> "osd_pglog": { 
> "items": 952564, 
> "bytes": 372459684 
> }, 
> "osdmap": { 
> "items": 3639, 
> "bytes": 224664 
> }, 
> "osdmap_mapping": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "pgmap": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "mds_co": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_1": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_2": { 
> "items": 0, 
> "bytes": 0 
> } 
> }, 
> "total": { 
> "items": 260109445, 
> "bytes": 2228370845 
> } 
> } 
> } 
> 
> 
> and the perf dump 
> 
> root@ceph5-2:~# ceph daemon osd.4 perf dump 
> { 
> "AsyncMessenger::Worker-0": { 
> "msgr_recv_messages": 22948570, 
> "msgr_send_messages": 22561570, 
> "msgr_recv_bytes": 333085080271, 
> "msgr_send_bytes": 261798871204, 
> "msgr_created_connections": 6152, 
> "msgr_active_connections": 2701, 
> "msgr_running_total_time": 1055.197867330, 
> "msgr_running_send_time": 352.764480121, 
> "msgr_running_recv_time": 499.206831955, 
> "msgr_running_fast_dispatch_time": 130.982201607 
> }, 
> "AsyncMessenger::Worker-1": { 
> "msgr_recv_messages": 18801593, 
> "msgr_send_messages": 18430264, 
> "msgr_recv_bytes": 306871760934, 
> "msgr_send_bytes": 192789048666, 
> "msgr_created_connections": 5773, 
> "msgr_active_connections": 2721, 
> "msgr_running_total_time": 816.821076305, 
> "msgr_running_send_time": 261.353228926, 
> "msgr_running_recv_time": 394.035587911, 
> "msgr_running_fast_dispatch_time": 104.012155720 
> }, 
> "AsyncMessenger::Worker-2": { 
> "msgr_recv_messages": 18463400, 
> "msgr_send_messages": 18105856, 
> "msgr_recv_bytes": 187425453590, 
> "msgr_send_bytes": 220735102555, 
> "msgr_created_connections": 5897, 
> "msgr_active_connections": 2605, 
> "msgr_running_total_time": 807.186854324, 
> "msgr_running_send_time": 296.834435839, 
> "msgr_running_recv_time": 351.364389691, 
> "msgr_running_fast_dispatch_time": 101.215776792 
> }, 
> "bluefs": { 
> "gift_bytes": 0, 
> "reclaim_bytes": 0, 
> "db_total_bytes": 256050724864, 
> "db_used_bytes": 12413042688, 
> "wal_total_bytes": 0, 
> "wal_used_bytes": 0, 
> "slow_total_bytes": 0, 
> "slow_used_bytes": 0, 
> "num_files": 209, 
> "log_bytes": 10383360, 
> "log_compactions": 14, 
> "logged_bytes": 336498688, 
> "files_written_wal": 2, 
> "files_written_sst": 4499, 
> "bytes_written_wal": 417989099783, 
> "bytes_written_sst": 213188750209 
> }, 
> "bluestore": { 
> "kv_flush_lat": { 
> "avgcount": 26371957, 
> "sum": 26.734038497, 
> "avgtime": 0.000001013 
> }, 
> "kv_commit_lat": { 
> "avgcount": 26371957, 
> "sum": 3397.491150603, 
> "avgtime": 0.000128829 
> }, 
> "kv_lat": { 
> "avgcount": 26371957, 
> "sum": 3424.225189100, 
> "avgtime": 0.000129843 
> }, 
> "state_prepare_lat": { 
> "avgcount": 30484924, 
> "sum": 3689.542105337, 
> "avgtime": 0.000121028 
> }, 
> "state_aio_wait_lat": { 
> "avgcount": 30484924, 
> "sum": 509.864546111, 
> "avgtime": 0.000016725 
> }, 
> "state_io_done_lat": { 
> "avgcount": 30484924, 
> "sum": 24.534052953, 
> "avgtime": 0.000000804 
> }, 
> "state_kv_queued_lat": { 
> "avgcount": 30484924, 
> "sum": 3488.338424238, 
> "avgtime": 0.000114428 
> }, 
> "state_kv_commiting_lat": { 
> "avgcount": 30484924, 
> "sum": 5660.437003432, 
> "avgtime": 0.000185679 
> }, 
> "state_kv_done_lat": { 
> "avgcount": 30484924, 
> "sum": 7.763511500, 
> "avgtime": 0.000000254 
> }, 
> "state_deferred_queued_lat": { 
> "avgcount": 26346134, 
> "sum": 666071.296856696, 
> "avgtime": 0.025281557 
> }, 
> "state_deferred_aio_wait_lat": { 
> "avgcount": 26346134, 
> "sum": 1755.660547071, 
> "avgtime": 0.000066638 
> }, 
> "state_deferred_cleanup_lat": { 
> "avgcount": 26346134, 
> "sum": 185465.151653703, 
> "avgtime": 0.007039558 
> }, 
> "state_finishing_lat": { 
> "avgcount": 30484920, 
> "sum": 3.046847481, 
> "avgtime": 0.000000099 
> }, 
> "state_done_lat": { 
> "avgcount": 30484920, 
> "sum": 13193.362685280, 
> "avgtime": 0.000432783 
> }, 
> "throttle_lat": { 
> "avgcount": 30484924, 
> "sum": 14.634269979, 
> "avgtime": 0.000000480 
> }, 
> "submit_lat": { 
> "avgcount": 30484924, 
> "sum": 3873.883076148, 
> "avgtime": 0.000127075 
> }, 
> "commit_lat": { 
> "avgcount": 30484924, 
> "sum": 13376.492317331, 
> "avgtime": 0.000438790 
> }, 
> "read_lat": { 
> "avgcount": 5873923, 
> "sum": 1817.167582057, 
> "avgtime": 0.000309361 
> }, 
> "read_onode_meta_lat": { 
> "avgcount": 19608201, 
> "sum": 146.770464482, 
> "avgtime": 0.000007485 
> }, 
> "read_wait_aio_lat": { 
> "avgcount": 13734278, 
> "sum": 2532.578077242, 
> "avgtime": 0.000184398 
> }, 
> "compress_lat": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "decompress_lat": { 
> "avgcount": 1346945, 
> "sum": 26.227575896, 
> "avgtime": 0.000019471 
> }, 
> "csum_lat": { 
> "avgcount": 28020392, 
> "sum": 149.587819041, 
> "avgtime": 0.000005338 
> }, 
> "compress_success_count": 0, 
> "compress_rejected_count": 0, 
> "write_pad_bytes": 352923605, 
> "deferred_write_ops": 24373340, 
> "deferred_write_bytes": 216791842816, 
> "write_penalty_read_ops": 8062366, 
> "bluestore_allocated": 3765566013440, 
> "bluestore_stored": 4186255221852, 
> "bluestore_compressed": 39981379040, 
> "bluestore_compressed_allocated": 73748348928, 
> "bluestore_compressed_original": 165041381376, 
> "bluestore_onodes": 104232, 
> "bluestore_onode_hits": 71206874, 
> "bluestore_onode_misses": 1217914, 
> "bluestore_onode_shard_hits": 260183292, 
> "bluestore_onode_shard_misses": 22851573, 
> "bluestore_extents": 3394513, 
> "bluestore_blobs": 2773587, 
> "bluestore_buffers": 0, 
> "bluestore_buffer_bytes": 0, 
> "bluestore_buffer_hit_bytes": 62026011221, 
> "bluestore_buffer_miss_bytes": 995233669922, 
> "bluestore_write_big": 5648815, 
> "bluestore_write_big_bytes": 552502214656, 
> "bluestore_write_big_blobs": 12440992, 
> "bluestore_write_small": 35883770, 
> "bluestore_write_small_bytes": 223436965719, 
> "bluestore_write_small_unused": 408125, 
> "bluestore_write_small_deferred": 34961455, 
> "bluestore_write_small_pre_read": 34961455, 
> "bluestore_write_small_new": 514190, 
> "bluestore_txc": 30484924, 
> "bluestore_onode_reshard": 5144189, 
> "bluestore_blob_split": 60104, 
> "bluestore_extent_compress": 53347252, 
> "bluestore_gc_merged": 21142528, 
> "bluestore_read_eio": 0, 
> "bluestore_fragmentation_micros": 67 
> }, 
> "finisher-defered_finisher": { 
> "queue_len": 0, 
> "complete_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "finisher-finisher-0": { 
> "queue_len": 0, 
> "complete_latency": { 
> "avgcount": 26625163, 
> "sum": 1057.506990951, 
> "avgtime": 0.000039718 
> } 
> }, 
> "finisher-objecter-finisher-0": { 
> "queue_len": 0, 
> "complete_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.0::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.0::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.1::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.1::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.2::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.2::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.3::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.3::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.4::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.4::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.5::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.5::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.6::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.6::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.7::sdata_wait_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "mutex-OSDShard.7::shard_lock": { 
> "wait": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "objecter": { 
> "op_active": 0, 
> "op_laggy": 0, 
> "op_send": 0, 
> "op_send_bytes": 0, 
> "op_resend": 0, 
> "op_reply": 0, 
> "op": 0, 
> "op_r": 0, 
> "op_w": 0, 
> "op_rmw": 0, 
> "op_pg": 0, 
> "osdop_stat": 0, 
> "osdop_create": 0, 
> "osdop_read": 0, 
> "osdop_write": 0, 
> "osdop_writefull": 0, 
> "osdop_writesame": 0, 
> "osdop_append": 0, 
> "osdop_zero": 0, 
> "osdop_truncate": 0, 
> "osdop_delete": 0, 
> "osdop_mapext": 0, 
> "osdop_sparse_read": 0, 
> "osdop_clonerange": 0, 
> "osdop_getxattr": 0, 
> "osdop_setxattr": 0, 
> "osdop_cmpxattr": 0, 
> "osdop_rmxattr": 0, 
> "osdop_resetxattrs": 0, 
> "osdop_tmap_up": 0, 
> "osdop_tmap_put": 0, 
> "osdop_tmap_get": 0, 
> "osdop_call": 0, 
> "osdop_watch": 0, 
> "osdop_notify": 0, 
> "osdop_src_cmpxattr": 0, 
> "osdop_pgls": 0, 
> "osdop_pgls_filter": 0, 
> "osdop_other": 0, 
> "linger_active": 0, 
> "linger_send": 0, 
> "linger_resend": 0, 
> "linger_ping": 0, 
> "poolop_active": 0, 
> "poolop_send": 0, 
> "poolop_resend": 0, 
> "poolstat_active": 0, 
> "poolstat_send": 0, 
> "poolstat_resend": 0, 
> "statfs_active": 0, 
> "statfs_send": 0, 
> "statfs_resend": 0, 
> "command_active": 0, 
> "command_send": 0, 
> "command_resend": 0, 
> "map_epoch": 105913, 
> "map_full": 0, 
> "map_inc": 828, 
> "osd_sessions": 0, 
> "osd_session_open": 0, 
> "osd_session_close": 0, 
> "osd_laggy": 0, 
> "omap_wr": 0, 
> "omap_rd": 0, 
> "omap_del": 0 
> }, 
> "osd": { 
> "op_wip": 0, 
> "op": 16758102, 
> "op_in_bytes": 238398820586, 
> "op_out_bytes": 165484999463, 
> "op_latency": { 
> "avgcount": 16758102, 
> "sum": 38242.481640842, 
> "avgtime": 0.002282029 
> }, 
> "op_process_latency": { 
> "avgcount": 16758102, 
> "sum": 28644.906310687, 
> "avgtime": 0.001709316 
> }, 
> "op_prepare_latency": { 
> "avgcount": 16761367, 
> "sum": 3489.856599934, 
> "avgtime": 0.000208208 
> }, 
> "op_r": 6188565, 
> "op_r_out_bytes": 165484999463, 
> "op_r_latency": { 
> "avgcount": 6188565, 
> "sum": 4507.365756792, 
> "avgtime": 0.000728337 
> }, 
> "op_r_process_latency": { 
> "avgcount": 6188565, 
> "sum": 942.363063429, 
> "avgtime": 0.000152274 
> }, 
> "op_r_prepare_latency": { 
> "avgcount": 6188644, 
> "sum": 982.866710389, 
> "avgtime": 0.000158817 
> }, 
> "op_w": 10546037, 
> "op_w_in_bytes": 238334329494, 
> "op_w_latency": { 
> "avgcount": 10546037, 
> "sum": 33160.719998316, 
> "avgtime": 0.003144377 
> }, 
> "op_w_process_latency": { 
> "avgcount": 10546037, 
> "sum": 27668.702029030, 
> "avgtime": 0.002623611 
> }, 
> "op_w_prepare_latency": { 
> "avgcount": 10548652, 
> "sum": 2499.688609173, 
> "avgtime": 0.000236967 
> }, 
> "op_rw": 23500, 
> "op_rw_in_bytes": 64491092, 
> "op_rw_out_bytes": 0, 
> "op_rw_latency": { 
> "avgcount": 23500, 
> "sum": 574.395885734, 
> "avgtime": 0.024442378 
> }, 
> "op_rw_process_latency": { 
> "avgcount": 23500, 
> "sum": 33.841218228, 
> "avgtime": 0.001440051 
> }, 
> "op_rw_prepare_latency": { 
> "avgcount": 24071, 
> "sum": 7.301280372, 
> "avgtime": 0.000303322 
> }, 
> "op_before_queue_op_lat": { 
> "avgcount": 57892986, 
> "sum": 1502.117718889, 
> "avgtime": 0.000025946 
> }, 
> "op_before_dequeue_op_lat": { 
> "avgcount": 58091683, 
> "sum": 45194.453254037, 
> "avgtime": 0.000777984 
> }, 
> "subop": 19784758, 
> "subop_in_bytes": 547174969754, 
> "subop_latency": { 
> "avgcount": 19784758, 
> "sum": 13019.714424060, 
> "avgtime": 0.000658067 
> }, 
> "subop_w": 19784758, 
> "subop_w_in_bytes": 547174969754, 
> "subop_w_latency": { 
> "avgcount": 19784758, 
> "sum": 13019.714424060, 
> "avgtime": 0.000658067 
> }, 
> "subop_pull": 0, 
> "subop_pull_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "subop_push": 0, 
> "subop_push_in_bytes": 0, 
> "subop_push_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "pull": 0, 
> "push": 2003, 
> "push_out_bytes": 5560009728, 
> "recovery_ops": 1940, 
> "loadavg": 118, 
> "buffer_bytes": 0, 
> "history_alloc_Mbytes": 0, 
> "history_alloc_num": 0, 
> "cached_crc": 0, 
> "cached_crc_adjusted": 0, 
> "missed_crc": 0, 
> "numpg": 243, 
> "numpg_primary": 82, 
> "numpg_replica": 161, 
> "numpg_stray": 0, 
> "numpg_removing": 0, 
> "heartbeat_to_peers": 10, 
> "map_messages": 7013, 
> "map_message_epochs": 7143, 
> "map_message_epoch_dups": 6315, 
> "messages_delayed_for_map": 0, 
> "osd_map_cache_hit": 203309, 
> "osd_map_cache_miss": 33, 
> "osd_map_cache_miss_low": 0, 
> "osd_map_cache_miss_low_avg": { 
> "avgcount": 0, 
> "sum": 0 
> }, 
> "osd_map_bl_cache_hit": 47012, 
> "osd_map_bl_cache_miss": 1681, 
> "stat_bytes": 6401248198656, 
> "stat_bytes_used": 3777979072512, 
> "stat_bytes_avail": 2623269126144, 
> "copyfrom": 0, 
> "tier_promote": 0, 
> "tier_flush": 0, 
> "tier_flush_fail": 0, 
> "tier_try_flush": 0, 
> "tier_try_flush_fail": 0, 
> "tier_evict": 0, 
> "tier_whiteout": 1631, 
> "tier_dirty": 22360, 
> "tier_clean": 0, 
> "tier_delay": 0, 
> "tier_proxy_read": 0, 
> "tier_proxy_write": 0, 
> "agent_wake": 0, 
> "agent_skip": 0, 
> "agent_flush": 0, 
> "agent_evict": 0, 
> "object_ctx_cache_hit": 16311156, 
> "object_ctx_cache_total": 17426393, 
> "op_cache_hit": 0, 
> "osd_tier_flush_lat": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "osd_tier_promote_lat": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "osd_tier_r_lat": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "osd_pg_info": 30483113, 
> "osd_pg_fastinfo": 29619885, 
> "osd_pg_biginfo": 81703 
> }, 
> "recoverystate_perf": { 
> "initial_latency": { 
> "avgcount": 243, 
> "sum": 6.869296500, 
> "avgtime": 0.028268709 
> }, 
> "started_latency": { 
> "avgcount": 1125, 
> "sum": 13551384.917335850, 
> "avgtime": 12045.675482076 
> }, 
> "reset_latency": { 
> "avgcount": 1368, 
> "sum": 1101.727799040, 
> "avgtime": 0.805356578 
> }, 
> "start_latency": { 
> "avgcount": 1368, 
> "sum": 0.002014799, 
> "avgtime": 0.000001472 
> }, 
> "primary_latency": { 
> "avgcount": 507, 
> "sum": 4575560.638823428, 
> "avgtime": 9024.774435549 
> }, 
> "peering_latency": { 
> "avgcount": 550, 
> "sum": 499.372283616, 
> "avgtime": 0.907949606 
> }, 
> "backfilling_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "waitremotebackfillreserved_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "waitlocalbackfillreserved_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "notbackfilling_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "repnotrecovering_latency": { 
> "avgcount": 1009, 
> "sum": 8975301.082274411, 
> "avgtime": 8895.243887288 
> }, 
> "repwaitrecoveryreserved_latency": { 
> "avgcount": 420, 
> "sum": 99.846056520, 
> "avgtime": 0.237728706 
> }, 
> "repwaitbackfillreserved_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "reprecovering_latency": { 
> "avgcount": 420, 
> "sum": 241.682764382, 
> "avgtime": 0.575435153 
> }, 
> "activating_latency": { 
> "avgcount": 507, 
> "sum": 16.893347339, 
> "avgtime": 0.033320211 
> }, 
> "waitlocalrecoveryreserved_latency": { 
> "avgcount": 199, 
> "sum": 672.335512769, 
> "avgtime": 3.378570415 
> }, 
> "waitremoterecoveryreserved_latency": { 
> "avgcount": 199, 
> "sum": 213.536439363, 
> "avgtime": 1.073047433 
> }, 
> "recovering_latency": { 
> "avgcount": 199, 
> "sum": 79.007696479, 
> "avgtime": 0.397023600 
> }, 
> "recovered_latency": { 
> "avgcount": 507, 
> "sum": 14.000732748, 
> "avgtime": 0.027614857 
> }, 
> "clean_latency": { 
> "avgcount": 395, 
> "sum": 4574325.900371083, 
> "avgtime": 11580.571899673 
> }, 
> "active_latency": { 
> "avgcount": 425, 
> "sum": 4575107.630123680, 
> "avgtime": 10764.959129702 
> }, 
> "replicaactive_latency": { 
> "avgcount": 589, 
> "sum": 8975184.499049954, 
> "avgtime": 15238.004242869 
> }, 
> "stray_latency": { 
> "avgcount": 818, 
> "sum": 800.729455666, 
> "avgtime": 0.978886865 
> }, 
> "getinfo_latency": { 
> "avgcount": 550, 
> "sum": 15.085667048, 
> "avgtime": 0.027428485 
> }, 
> "getlog_latency": { 
> "avgcount": 546, 
> "sum": 3.482175693, 
> "avgtime": 0.006377611 
> }, 
> "waitactingchange_latency": { 
> "avgcount": 39, 
> "sum": 35.444551284, 
> "avgtime": 0.908834648 
> }, 
> "incomplete_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "down_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "getmissing_latency": { 
> "avgcount": 507, 
> "sum": 6.702129624, 
> "avgtime": 0.013219190 
> }, 
> "waitupthru_latency": { 
> "avgcount": 507, 
> "sum": 474.098261727, 
> "avgtime": 0.935105052 
> }, 
> "notrecovering_latency": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> }, 
> "rocksdb": { 
> "get": 28320977, 
> "submit_transaction": 30484924, 
> "submit_transaction_sync": 26371957, 
> "get_latency": { 
> "avgcount": 28320977, 
> "sum": 325.900908733, 
> "avgtime": 0.000011507 
> }, 
> "submit_latency": { 
> "avgcount": 30484924, 
> "sum": 1835.888692371, 
> "avgtime": 0.000060222 
> }, 
> "submit_sync_latency": { 
> "avgcount": 26371957, 
> "sum": 1431.555230628, 
> "avgtime": 0.000054283 
> }, 
> "compact": 0, 
> "compact_range": 0, 
> "compact_queue_merge": 0, 
> "compact_queue_len": 0, 
> "rocksdb_write_wal_time": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "rocksdb_write_memtable_time": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "rocksdb_write_delay_time": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> }, 
> "rocksdb_write_pre_and_post_time": { 
> "avgcount": 0, 
> "sum": 0.000000000, 
> "avgtime": 0.000000000 
> } 
> } 
> } 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mardi 5 Février 2019 18:56:51 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>> (but I have bluestore_fragmentation_micros) 
>> ok, this is the same 
>> 
>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>> 
>> 
>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>> 
>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
> it? The same for other OSDs? 
> 
> This proves some issue with the allocator - generally fragmentation 
> might grow but it shouldn't reset on restart. Looks like some intervals 
> aren't properly merged in run-time. 
> 
> On the other side I'm not completely sure that latency degradation is 
> caused by that - fragmentation growth is relatively small - I don't see 
> how this might impact performance that high. 
> 
> Wondering if you have OSD mempool monitoring (dump_mempools command 
> output on admin socket) reports? Do you have any historic data? 
> 
> If not may I have current output and say a couple more samples with 
> 8-12 hours interval? 
> 
> 
> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
> before that but I'll discuss this at BlueStore meeting shortly. 
> 
> 
> Thanks, 
> 
> Igor 
> 
>> ----- Mail original ----- 
>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>> À: "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 4 Février 2019 16:04:38 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Thanks Igor, 
>> 
>>>> Could you please collect BlueStore performance counters right after OSD 
>>>> startup and once you get high latency. 
>>>> 
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>> I'm already monitoring with 
>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>> 
>> but I don't see l_bluestore_fragmentation counter. 
>> 
>> (but I have bluestore_fragmentation_micros) 
>> 
>> 
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>> Sorry, It's a critical production cluster, I can't test on it :( 
>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>> 
>> 
>> 
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 4 Février 2019 15:51:30 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi Alexandre, 
>> 
>> looks like a bug in StupidAllocator. 
>> 
>> Could you please collect BlueStore performance counters right after OSD 
>> startup and once you get high latency. 
>> 
>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>> 
>> Also if you're able to rebuild the code I can probably make a simple 
>> patch to track latency and some other internal allocator's paramter to 
>> make sure it's degraded and learn more details. 
>> 
>> 
>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>> and try the difference... 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> 
>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>> Hi again, 
>>> 
>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>> 
>>> 
>>> I have notice something using a simple "perf top", 
>>> 
>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>> 
>>> when latency is bad, perf top give me : 
>>> 
>>> StupidAllocator::_aligned_len 
>>> and 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>> const, unsigned long>*>::increment_slow() 
>>> 
>>> (around 10-20% time for both) 
>>> 
>>> 
>>> when latency is good, I don't see them at all. 
>>> 
>>> 
>>> I have used the Mark wallclock profiler, here the results: 
>>> 
>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>> 
>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>> 
>>> 
>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>> 
>>> 
>>> + 100.00% clone 
>>> + 100.00% start_thread 
>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi, 
>>> 
>>> some news: 
>>> 
>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>> 
>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>> 
>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>> 
>>> 
>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>> my others clusters user 1,6TB ssd. 
>>> 
>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>> 
>>> 
>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@odiso.com> 
>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>> 
>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>> 
>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi, 
>>> 
>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>> Hi Stefan, 
>>>> 
>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>> I need to compare with bigger latencies 
>>>> 
>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>> 
>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>> 
>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>> 
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>> here my influxdb queries: 
>>>> 
>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>> 
>>>> 
>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>> 
>>>> 
>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>> 
>>>> 
>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>> op_r_latency but instead op_latency? 
>>> 
>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>> 
>>> greets, 
>>> Stefan 
>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Hi, 
>>>> 
>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>> Hi, 
>>>>> 
>>>>> here some new results, 
>>>>> different osd/ different cluster 
>>>>> 
>>>>> before osd restart latency was between 2-5ms 
>>>>> after osd restart is around 1-1.5ms 
>>>>> 
>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>> 
>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>> like suggested. This report makes me a little nervous about my change. 
>>>> 
>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>> exact values out of the daemon do you use for bluestore? 
>>>> 
>>>> I would like to check if i see the same behaviour. 
>>>> 
>>>> Greets, 
>>>> Stefan 
>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>> going on one of the OSDs wth a high latency? 
>>>>> 
>>>>> Thanks! 
>>>>> sage 
>>>>> 
>>>>> 
>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>> 
>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>> 
>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>> 
>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>> values like 20-200ms. 
>>>>>> 
>>>>>> Some example graphs: 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>> 
>>>>>> All osds have this behaviour, in all clusters. 
>>>>>> 
>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>> 
>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>> 
>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>> 
>>>>>> Any Hints for counters/logs to check ? 
>>>>>> 
>>>>>> 
>>>>>> Regards, 
>>>>>> 
>>>>>> Alexandre 
>>>>>> 
>>>>>> 
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users@lists.ceph.com 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>> 
> 
> 
> 



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1554220830.1076801.1550047328269.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                     ` <1554220830.1076801.1550047328269.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2019-02-15 12:46                                                                       ` Igor Fedotov
  2019-02-15 12:47                                                                       ` Igor Fedotov
  1 sibling, 0 replies; 42+ messages in thread
From: Igor Fedotov @ 2019-02-15 12:46 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 44887 bytes --]

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)

0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)
   
What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one.

Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture stays the same.

W.r.t. memory usage you observed I see nothing suspicious so far - No decrease in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
> Hi Igor,
>
> Thanks again for helping !
>
>
>
> I have upgrade to last mimic this weekend, and with new autotune memory,
> I have setup osd_memory_target to 8G.  (my nvme are 6TB)
>
>
> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours,
> here the reports for osd.0:
>
> http://odisoweb1.odiso.net/perfanalysis/
>
>
> osd has been started the 12-02-2019 at 08:00
>
> first report after 1h running
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>
>
>
> report  after 24 before counter resets
>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>
> report 1h after counter reset
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>
>
>
>
> I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 12-02-2019 at 14:00
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
> Then after that, slowly decreasing.
>
>
> Another strange thing,
> I'm seeing total bytes at 5G at 12-02-2018.13:30
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G
>
>
> I'm graphing mempools counters too since yesterday, so I'll able to track them over time.
>
> ----- Mail original -----
> De: "Igor Fedotov" <ifedotov-l3A5Bk7waGM@public.gmane.org>
> À: "Alexandre Derumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> Cc: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Envoyé: Lundi 11 Février 2019 12:03:17
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>> another mempool dump after 1h run. (latency ok)
>>
>> Biggest difference:
>>
>> before restart
>> -------------
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory)
>>
>>
>> After restart
>> -------------
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>>
> This is fine as cache is warming after restart and some rebalancing
> between data and metadata might occur.
>
> What relates to allocator and most probably to fragmentation growth is :
>
> "bluestore_alloc": {
> "items": 165053952,
> "bytes": 165053952
> },
>
> which had been higher before the reset (if I got these dumps' order
> properly)
>
> "bluestore_alloc": {
> "items": 210243456,
> "bytes": 210243456
> },
>
> But as I mentioned - I'm not 100% sure this might cause such a huge
> latency increase...
>
> Do you have perf counters dump after the restart?
>
> Could you collect some more dumps - for both mempool and perf counters?
>
> So ideally I'd like to have:
>
> 1) mempool/perf counters dumps after the restart (1hour is OK)
>
> 2) mempool/perf counters dumps in 24+ hours after restart
>
> 3) reset perf counters after 2), wait for 1 hour (and without OSD
> restart) and dump mempool/perf counters again.
>
> So we'll be able to learn both allocator mem usage growth and operation
> latency distribution for the following periods:
>
> a) 1st hour after restart
>
> b) 25th hour.
>
>
> Thanks,
>
> Igor
>
>
>> full mempool dump after restart
>> -------------------------------
>>
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 165053952,
>> "bytes": 165053952
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>> "bluestore_cache_onode": {
>> "items": 22225,
>> "bytes": 14935200
>> },
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_fsck": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_txc": {
>> "items": 11,
>> "bytes": 8184
>> },
>> "bluestore_writing_deferred": {
>> "items": 5047,
>> "bytes": 22673736
>> },
>> "bluestore_writing": {
>> "items": 91,
>> "bytes": 1662976
>> },
>> "bluefs": {
>> "items": 1907,
>> "bytes": 95600
>> },
>> "buffer_anon": {
>> "items": 19664,
>> "bytes": 25486050
>> },
>> "buffer_meta": {
>> "items": 46189,
>> "bytes": 2956096
>> },
>> "osd": {
>> "items": 243,
>> "bytes": 3089016
>> },
>> "osd_mapbl": {
>> "items": 17,
>> "bytes": 214366
>> },
>> "osd_pglog": {
>> "items": 889673,
>> "bytes": 367160400
>> },
>> "osdmap": {
>> "items": 3803,
>> "bytes": 224552
>> },
>> "osdmap_mapping": {
>> "items": 0,
>> "bytes": 0
>> },
>> "pgmap": {
>> "items": 0,
>> "bytes": 0
>> },
>> "mds_co": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_1": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_2": {
>> "items": 0,
>> "bytes": 0
>> }
>> },
>> "total": {
>> "items": 178515204,
>> "bytes": 2160630547
>> }
>> }
>> }
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>> À: "Igor Fedotov" <ifedotov-l3A5Bk7waGM@public.gmane.org>
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>, "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>> Envoyé: Vendredi 8 Février 2019 16:14:54
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> I'm just seeing
>>
>> StupidAllocator::_aligned_len
>> and
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>
>> on 1 osd, both 10%.
>>
>> here the dump_mempools
>>
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 210243456,
>> "bytes": 210243456
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> "bluestore_cache_onode": {
>> "items": 105637,
>> "bytes": 70988064
>> },
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_fsck": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_txc": {
>> "items": 12,
>> "bytes": 8928
>> },
>> "bluestore_writing_deferred": {
>> "items": 406,
>> "bytes": 4792868
>> },
>> "bluestore_writing": {
>> "items": 66,
>> "bytes": 1085440
>> },
>> "bluefs": {
>> "items": 1882,
>> "bytes": 93600
>> },
>> "buffer_anon": {
>> "items": 138986,
>> "bytes": 24983701
>> },
>> "buffer_meta": {
>> "items": 544,
>> "bytes": 34816
>> },
>> "osd": {
>> "items": 243,
>> "bytes": 3089016
>> },
>> "osd_mapbl": {
>> "items": 36,
>> "bytes": 179308
>> },
>> "osd_pglog": {
>> "items": 952564,
>> "bytes": 372459684
>> },
>> "osdmap": {
>> "items": 3639,
>> "bytes": 224664
>> },
>> "osdmap_mapping": {
>> "items": 0,
>> "bytes": 0
>> },
>> "pgmap": {
>> "items": 0,
>> "bytes": 0
>> },
>> "mds_co": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_1": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_2": {
>> "items": 0,
>> "bytes": 0
>> }
>> },
>> "total": {
>> "items": 260109445,
>> "bytes": 2228370845
>> }
>> }
>> }
>>
>>
>> and the perf dump
>>
>> root@ceph5-2:~# ceph daemon osd.4 perf dump
>> {
>> "AsyncMessenger::Worker-0": {
>> "msgr_recv_messages": 22948570,
>> "msgr_send_messages": 22561570,
>> "msgr_recv_bytes": 333085080271,
>> "msgr_send_bytes": 261798871204,
>> "msgr_created_connections": 6152,
>> "msgr_active_connections": 2701,
>> "msgr_running_total_time": 1055.197867330,
>> "msgr_running_send_time": 352.764480121,
>> "msgr_running_recv_time": 499.206831955,
>> "msgr_running_fast_dispatch_time": 130.982201607
>> },
>> "AsyncMessenger::Worker-1": {
>> "msgr_recv_messages": 18801593,
>> "msgr_send_messages": 18430264,
>> "msgr_recv_bytes": 306871760934,
>> "msgr_send_bytes": 192789048666,
>> "msgr_created_connections": 5773,
>> "msgr_active_connections": 2721,
>> "msgr_running_total_time": 816.821076305,
>> "msgr_running_send_time": 261.353228926,
>> "msgr_running_recv_time": 394.035587911,
>> "msgr_running_fast_dispatch_time": 104.012155720
>> },
>> "AsyncMessenger::Worker-2": {
>> "msgr_recv_messages": 18463400,
>> "msgr_send_messages": 18105856,
>> "msgr_recv_bytes": 187425453590,
>> "msgr_send_bytes": 220735102555,
>> "msgr_created_connections": 5897,
>> "msgr_active_connections": 2605,
>> "msgr_running_total_time": 807.186854324,
>> "msgr_running_send_time": 296.834435839,
>> "msgr_running_recv_time": 351.364389691,
>> "msgr_running_fast_dispatch_time": 101.215776792
>> },
>> "bluefs": {
>> "gift_bytes": 0,
>> "reclaim_bytes": 0,
>> "db_total_bytes": 256050724864,
>> "db_used_bytes": 12413042688,
>> "wal_total_bytes": 0,
>> "wal_used_bytes": 0,
>> "slow_total_bytes": 0,
>> "slow_used_bytes": 0,
>> "num_files": 209,
>> "log_bytes": 10383360,
>> "log_compactions": 14,
>> "logged_bytes": 336498688,
>> "files_written_wal": 2,
>> "files_written_sst": 4499,
>> "bytes_written_wal": 417989099783,
>> "bytes_written_sst": 213188750209
>> },
>> "bluestore": {
>> "kv_flush_lat": {
>> "avgcount": 26371957,
>> "sum": 26.734038497,
>> "avgtime": 0.000001013
>> },
>> "kv_commit_lat": {
>> "avgcount": 26371957,
>> "sum": 3397.491150603,
>> "avgtime": 0.000128829
>> },
>> "kv_lat": {
>> "avgcount": 26371957,
>> "sum": 3424.225189100,
>> "avgtime": 0.000129843
>> },
>> "state_prepare_lat": {
>> "avgcount": 30484924,
>> "sum": 3689.542105337,
>> "avgtime": 0.000121028
>> },
>> "state_aio_wait_lat": {
>> "avgcount": 30484924,
>> "sum": 509.864546111,
>> "avgtime": 0.000016725
>> },
>> "state_io_done_lat": {
>> "avgcount": 30484924,
>> "sum": 24.534052953,
>> "avgtime": 0.000000804
>> },
>> "state_kv_queued_lat": {
>> "avgcount": 30484924,
>> "sum": 3488.338424238,
>> "avgtime": 0.000114428
>> },
>> "state_kv_commiting_lat": {
>> "avgcount": 30484924,
>> "sum": 5660.437003432,
>> "avgtime": 0.000185679
>> },
>> "state_kv_done_lat": {
>> "avgcount": 30484924,
>> "sum": 7.763511500,
>> "avgtime": 0.000000254
>> },
>> "state_deferred_queued_lat": {
>> "avgcount": 26346134,
>> "sum": 666071.296856696,
>> "avgtime": 0.025281557
>> },
>> "state_deferred_aio_wait_lat": {
>> "avgcount": 26346134,
>> "sum": 1755.660547071,
>> "avgtime": 0.000066638
>> },
>> "state_deferred_cleanup_lat": {
>> "avgcount": 26346134,
>> "sum": 185465.151653703,
>> "avgtime": 0.007039558
>> },
>> "state_finishing_lat": {
>> "avgcount": 30484920,
>> "sum": 3.046847481,
>> "avgtime": 0.000000099
>> },
>> "state_done_lat": {
>> "avgcount": 30484920,
>> "sum": 13193.362685280,
>> "avgtime": 0.000432783
>> },
>> "throttle_lat": {
>> "avgcount": 30484924,
>> "sum": 14.634269979,
>> "avgtime": 0.000000480
>> },
>> "submit_lat": {
>> "avgcount": 30484924,
>> "sum": 3873.883076148,
>> "avgtime": 0.000127075
>> },
>> "commit_lat": {
>> "avgcount": 30484924,
>> "sum": 13376.492317331,
>> "avgtime": 0.000438790
>> },
>> "read_lat": {
>> "avgcount": 5873923,
>> "sum": 1817.167582057,
>> "avgtime": 0.000309361
>> },
>> "read_onode_meta_lat": {
>> "avgcount": 19608201,
>> "sum": 146.770464482,
>> "avgtime": 0.000007485
>> },
>> "read_wait_aio_lat": {
>> "avgcount": 13734278,
>> "sum": 2532.578077242,
>> "avgtime": 0.000184398
>> },
>> "compress_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "decompress_lat": {
>> "avgcount": 1346945,
>> "sum": 26.227575896,
>> "avgtime": 0.000019471
>> },
>> "csum_lat": {
>> "avgcount": 28020392,
>> "sum": 149.587819041,
>> "avgtime": 0.000005338
>> },
>> "compress_success_count": 0,
>> "compress_rejected_count": 0,
>> "write_pad_bytes": 352923605,
>> "deferred_write_ops": 24373340,
>> "deferred_write_bytes": 216791842816,
>> "write_penalty_read_ops": 8062366,
>> "bluestore_allocated": 3765566013440,
>> "bluestore_stored": 4186255221852,
>> "bluestore_compressed": 39981379040,
>> "bluestore_compressed_allocated": 73748348928,
>> "bluestore_compressed_original": 165041381376,
>> "bluestore_onodes": 104232,
>> "bluestore_onode_hits": 71206874,
>> "bluestore_onode_misses": 1217914,
>> "bluestore_onode_shard_hits": 260183292,
>> "bluestore_onode_shard_misses": 22851573,
>> "bluestore_extents": 3394513,
>> "bluestore_blobs": 2773587,
>> "bluestore_buffers": 0,
>> "bluestore_buffer_bytes": 0,
>> "bluestore_buffer_hit_bytes": 62026011221,
>> "bluestore_buffer_miss_bytes": 995233669922,
>> "bluestore_write_big": 5648815,
>> "bluestore_write_big_bytes": 552502214656,
>> "bluestore_write_big_blobs": 12440992,
>> "bluestore_write_small": 35883770,
>> "bluestore_write_small_bytes": 223436965719,
>> "bluestore_write_small_unused": 408125,
>> "bluestore_write_small_deferred": 34961455,
>> "bluestore_write_small_pre_read": 34961455,
>> "bluestore_write_small_new": 514190,
>> "bluestore_txc": 30484924,
>> "bluestore_onode_reshard": 5144189,
>> "bluestore_blob_split": 60104,
>> "bluestore_extent_compress": 53347252,
>> "bluestore_gc_merged": 21142528,
>> "bluestore_read_eio": 0,
>> "bluestore_fragmentation_micros": 67
>> },
>> "finisher-defered_finisher": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "finisher-finisher-0": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 26625163,
>> "sum": 1057.506990951,
>> "avgtime": 0.000039718
>> }
>> },
>> "finisher-objecter-finisher-0": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.0::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.0::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.1::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.1::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.2::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.2::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.3::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.3::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.4::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.4::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.5::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.5::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.6::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.6::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.7::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.7::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "objecter": {
>> "op_active": 0,
>> "op_laggy": 0,
>> "op_send": 0,
>> "op_send_bytes": 0,
>> "op_resend": 0,
>> "op_reply": 0,
>> "op": 0,
>> "op_r": 0,
>> "op_w": 0,
>> "op_rmw": 0,
>> "op_pg": 0,
>> "osdop_stat": 0,
>> "osdop_create": 0,
>> "osdop_read": 0,
>> "osdop_write": 0,
>> "osdop_writefull": 0,
>> "osdop_writesame": 0,
>> "osdop_append": 0,
>> "osdop_zero": 0,
>> "osdop_truncate": 0,
>> "osdop_delete": 0,
>> "osdop_mapext": 0,
>> "osdop_sparse_read": 0,
>> "osdop_clonerange": 0,
>> "osdop_getxattr": 0,
>> "osdop_setxattr": 0,
>> "osdop_cmpxattr": 0,
>> "osdop_rmxattr": 0,
>> "osdop_resetxattrs": 0,
>> "osdop_tmap_up": 0,
>> "osdop_tmap_put": 0,
>> "osdop_tmap_get": 0,
>> "osdop_call": 0,
>> "osdop_watch": 0,
>> "osdop_notify": 0,
>> "osdop_src_cmpxattr": 0,
>> "osdop_pgls": 0,
>> "osdop_pgls_filter": 0,
>> "osdop_other": 0,
>> "linger_active": 0,
>> "linger_send": 0,
>> "linger_resend": 0,
>> "linger_ping": 0,
>> "poolop_active": 0,
>> "poolop_send": 0,
>> "poolop_resend": 0,
>> "poolstat_active": 0,
>> "poolstat_send": 0,
>> "poolstat_resend": 0,
>> "statfs_active": 0,
>> "statfs_send": 0,
>> "statfs_resend": 0,
>> "command_active": 0,
>> "command_send": 0,
>> "command_resend": 0,
>> "map_epoch": 105913,
>> "map_full": 0,
>> "map_inc": 828,
>> "osd_sessions": 0,
>> "osd_session_open": 0,
>> "osd_session_close": 0,
>> "osd_laggy": 0,
>> "omap_wr": 0,
>> "omap_rd": 0,
>> "omap_del": 0
>> },
>> "osd": {
>> "op_wip": 0,
>> "op": 16758102,
>> "op_in_bytes": 238398820586,
>> "op_out_bytes": 165484999463,
>> "op_latency": {
>> "avgcount": 16758102,
>> "sum": 38242.481640842,
>> "avgtime": 0.002282029
>> },
>> "op_process_latency": {
>> "avgcount": 16758102,
>> "sum": 28644.906310687,
>> "avgtime": 0.001709316
>> },
>> "op_prepare_latency": {
>> "avgcount": 16761367,
>> "sum": 3489.856599934,
>> "avgtime": 0.000208208
>> },
>> "op_r": 6188565,
>> "op_r_out_bytes": 165484999463,
>> "op_r_latency": {
>> "avgcount": 6188565,
>> "sum": 4507.365756792,
>> "avgtime": 0.000728337
>> },
>> "op_r_process_latency": {
>> "avgcount": 6188565,
>> "sum": 942.363063429,
>> "avgtime": 0.000152274
>> },
>> "op_r_prepare_latency": {
>> "avgcount": 6188644,
>> "sum": 982.866710389,
>> "avgtime": 0.000158817
>> },
>> "op_w": 10546037,
>> "op_w_in_bytes": 238334329494,
>> "op_w_latency": {
>> "avgcount": 10546037,
>> "sum": 33160.719998316,
>> "avgtime": 0.003144377
>> },
>> "op_w_process_latency": {
>> "avgcount": 10546037,
>> "sum": 27668.702029030,
>> "avgtime": 0.002623611
>> },
>> "op_w_prepare_latency": {
>> "avgcount": 10548652,
>> "sum": 2499.688609173,
>> "avgtime": 0.000236967
>> },
>> "op_rw": 23500,
>> "op_rw_in_bytes": 64491092,
>> "op_rw_out_bytes": 0,
>> "op_rw_latency": {
>> "avgcount": 23500,
>> "sum": 574.395885734,
>> "avgtime": 0.024442378
>> },
>> "op_rw_process_latency": {
>> "avgcount": 23500,
>> "sum": 33.841218228,
>> "avgtime": 0.001440051
>> },
>> "op_rw_prepare_latency": {
>> "avgcount": 24071,
>> "sum": 7.301280372,
>> "avgtime": 0.000303322
>> },
>> "op_before_queue_op_lat": {
>> "avgcount": 57892986,
>> "sum": 1502.117718889,
>> "avgtime": 0.000025946
>> },
>> "op_before_dequeue_op_lat": {
>> "avgcount": 58091683,
>> "sum": 45194.453254037,
>> "avgtime": 0.000777984
>> },
>> "subop": 19784758,
>> "subop_in_bytes": 547174969754,
>> "subop_latency": {
>> "avgcount": 19784758,
>> "sum": 13019.714424060,
>> "avgtime": 0.000658067
>> },
>> "subop_w": 19784758,
>> "subop_w_in_bytes": 547174969754,
>> "subop_w_latency": {
>> "avgcount": 19784758,
>> "sum": 13019.714424060,
>> "avgtime": 0.000658067
>> },
>> "subop_pull": 0,
>> "subop_pull_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "subop_push": 0,
>> "subop_push_in_bytes": 0,
>> "subop_push_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "pull": 0,
>> "push": 2003,
>> "push_out_bytes": 5560009728,
>> "recovery_ops": 1940,
>> "loadavg": 118,
>> "buffer_bytes": 0,
>> "history_alloc_Mbytes": 0,
>> "history_alloc_num": 0,
>> "cached_crc": 0,
>> "cached_crc_adjusted": 0,
>> "missed_crc": 0,
>> "numpg": 243,
>> "numpg_primary": 82,
>> "numpg_replica": 161,
>> "numpg_stray": 0,
>> "numpg_removing": 0,
>> "heartbeat_to_peers": 10,
>> "map_messages": 7013,
>> "map_message_epochs": 7143,
>> "map_message_epoch_dups": 6315,
>> "messages_delayed_for_map": 0,
>> "osd_map_cache_hit": 203309,
>> "osd_map_cache_miss": 33,
>> "osd_map_cache_miss_low": 0,
>> "osd_map_cache_miss_low_avg": {
>> "avgcount": 0,
>> "sum": 0
>> },
>> "osd_map_bl_cache_hit": 47012,
>> "osd_map_bl_cache_miss": 1681,
>> "stat_bytes": 6401248198656,
>> "stat_bytes_used": 3777979072512,
>> "stat_bytes_avail": 2623269126144,
>> "copyfrom": 0,
>> "tier_promote": 0,
>> "tier_flush": 0,
>> "tier_flush_fail": 0,
>> "tier_try_flush": 0,
>> "tier_try_flush_fail": 0,
>> "tier_evict": 0,
>> "tier_whiteout": 1631,
>> "tier_dirty": 22360,
>> "tier_clean": 0,
>> "tier_delay": 0,
>> "tier_proxy_read": 0,
>> "tier_proxy_write": 0,
>> "agent_wake": 0,
>> "agent_skip": 0,
>> "agent_flush": 0,
>> "agent_evict": 0,
>> "object_ctx_cache_hit": 16311156,
>> "object_ctx_cache_total": 17426393,
>> "op_cache_hit": 0,
>> "osd_tier_flush_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_tier_promote_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_tier_r_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_pg_info": 30483113,
>> "osd_pg_fastinfo": 29619885,
>> "osd_pg_biginfo": 81703
>> },
>> "recoverystate_perf": {
>> "initial_latency": {
>> "avgcount": 243,
>> "sum": 6.869296500,
>> "avgtime": 0.028268709
>> },
>> "started_latency": {
>> "avgcount": 1125,
>> "sum": 13551384.917335850,
>> "avgtime": 12045.675482076
>> },
>> "reset_latency": {
>> "avgcount": 1368,
>> "sum": 1101.727799040,
>> "avgtime": 0.805356578
>> },
>> "start_latency": {
>> "avgcount": 1368,
>> "sum": 0.002014799,
>> "avgtime": 0.000001472
>> },
>> "primary_latency": {
>> "avgcount": 507,
>> "sum": 4575560.638823428,
>> "avgtime": 9024.774435549
>> },
>> "peering_latency": {
>> "avgcount": 550,
>> "sum": 499.372283616,
>> "avgtime": 0.907949606
>> },
>> "backfilling_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "waitremotebackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "waitlocalbackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "notbackfilling_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "repnotrecovering_latency": {
>> "avgcount": 1009,
>> "sum": 8975301.082274411,
>> "avgtime": 8895.243887288
>> },
>> "repwaitrecoveryreserved_latency": {
>> "avgcount": 420,
>> "sum": 99.846056520,
>> "avgtime": 0.237728706
>> },
>> "repwaitbackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "reprecovering_latency": {
>> "avgcount": 420,
>> "sum": 241.682764382,
>> "avgtime": 0.575435153
>> },
>> "activating_latency": {
>> "avgcount": 507,
>> "sum": 16.893347339,
>> "avgtime": 0.033320211
>> },
>> "waitlocalrecoveryreserved_latency": {
>> "avgcount": 199,
>> "sum": 672.335512769,
>> "avgtime": 3.378570415
>> },
>> "waitremoterecoveryreserved_latency": {
>> "avgcount": 199,
>> "sum": 213.536439363,
>> "avgtime": 1.073047433
>> },
>> "recovering_latency": {
>> "avgcount": 199,
>> "sum": 79.007696479,
>> "avgtime": 0.397023600
>> },
>> "recovered_latency": {
>> "avgcount": 507,
>> "sum": 14.000732748,
>> "avgtime": 0.027614857
>> },
>> "clean_latency": {
>> "avgcount": 395,
>> "sum": 4574325.900371083,
>> "avgtime": 11580.571899673
>> },
>> "active_latency": {
>> "avgcount": 425,
>> "sum": 4575107.630123680,
>> "avgtime": 10764.959129702
>> },
>> "replicaactive_latency": {
>> "avgcount": 589,
>> "sum": 8975184.499049954,
>> "avgtime": 15238.004242869
>> },
>> "stray_latency": {
>> "avgcount": 818,
>> "sum": 800.729455666,
>> "avgtime": 0.978886865
>> },
>> "getinfo_latency": {
>> "avgcount": 550,
>> "sum": 15.085667048,
>> "avgtime": 0.027428485
>> },
>> "getlog_latency": {
>> "avgcount": 546,
>> "sum": 3.482175693,
>> "avgtime": 0.006377611
>> },
>> "waitactingchange_latency": {
>> "avgcount": 39,
>> "sum": 35.444551284,
>> "avgtime": 0.908834648
>> },
>> "incomplete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "down_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "getmissing_latency": {
>> "avgcount": 507,
>> "sum": 6.702129624,
>> "avgtime": 0.013219190
>> },
>> "waitupthru_latency": {
>> "avgcount": 507,
>> "sum": 474.098261727,
>> "avgtime": 0.935105052
>> },
>> "notrecovering_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "rocksdb": {
>> "get": 28320977,
>> "submit_transaction": 30484924,
>> "submit_transaction_sync": 26371957,
>> "get_latency": {
>> "avgcount": 28320977,
>> "sum": 325.900908733,
>> "avgtime": 0.000011507
>> },
>> "submit_latency": {
>> "avgcount": 30484924,
>> "sum": 1835.888692371,
>> "avgtime": 0.000060222
>> },
>> "submit_sync_latency": {
>> "avgcount": 26371957,
>> "sum": 1431.555230628,
>> "avgtime": 0.000054283
>> },
>> "compact": 0,
>> "compact_range": 0,
>> "compact_queue_merge": 0,
>> "compact_queue_len": 0,
>> "rocksdb_write_wal_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_memtable_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_delay_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_pre_and_post_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> }
>> }
>>
>> ----- Mail original -----
>> De: "Igor Fedotov" <ifedotov-l3A5Bk7waGM@public.gmane.org>
>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>, "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>> Envoyé: Mardi 5 Février 2019 18:56:51
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>> (but I have bluestore_fragmentation_micros)
>>> ok, this is the same
>>>
>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");
>>>
>>>
>>> Here a graph on last month, with bluestore_fragmentation_micros and latency,
>>>
>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
>> it? The same for other OSDs?
>>
>> This proves some issue with the allocator - generally fragmentation
>> might grow but it shouldn't reset on restart. Looks like some intervals
>> aren't properly merged in run-time.
>>
>> On the other side I'm not completely sure that latency degradation is
>> caused by that - fragmentation growth is relatively small - I don't see
>> how this might impact performance that high.
>>
>> Wondering if you have OSD mempool monitoring (dump_mempools command
>> output on admin socket) reports? Do you have any historic data?
>>
>> If not may I have current output and say a couple more samples with
>> 8-12 hours interval?
>>
>>
>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans
>> before that but I'll discuss this at BlueStore meeting shortly.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>> ----- Mail original -----
>>> De: "Alexandre Derumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>>> À: "Igor Fedotov" <ifedotov-l3A5Bk7waGM@public.gmane.org>
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>, "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>> Envoyé: Lundi 4 Février 2019 16:04:38
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Thanks Igor,
>>>
>>>>> Could you please collect BlueStore performance counters right after OSD
>>>>> startup and once you get high latency.
>>>>>
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>> I'm already monitoring with
>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters)
>>>
>>> but I don't see l_bluestore_fragmentation counter.
>>>
>>> (but I have bluestore_fragmentation_micros)
>>>
>>>
>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>> patch to track latency and some other internal allocator's paramter to
>>>>> make sure it's degraded and learn more details.
>>> Sorry, It's a critical production cluster, I can't test on it :(
>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.
>>>
>>>
>>>
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>>> and try the difference...
>>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
>>> perf results of new bitmap allocator seem very promising from what I've seen in PR.
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Igor Fedotov" <ifedotov-l3A5Bk7waGM@public.gmane.org>
>>> À: "Alexandre Derumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>, "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>, "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> Cc: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>> Envoyé: Lundi 4 Février 2019 15:51:30
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi Alexandre,
>>>
>>> looks like a bug in StupidAllocator.
>>>
>>> Could you please collect BlueStore performance counters right after OSD
>>> startup and once you get high latency.
>>>
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>
>>> Also if you're able to rebuild the code I can probably make a simple
>>> patch to track latency and some other internal allocator's paramter to
>>> make sure it's degraded and learn more details.
>>>
>>>
>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>> and try the difference...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>>>> Hi again,
>>>>
>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>>>>
>>>>
>>>> I have notice something using a simple "perf top",
>>>>
>>>> each time I have this problem (I have seen exactly 4 times the same behaviour),
>>>>
>>>> when latency is bad, perf top give me :
>>>>
>>>> StupidAllocator::_aligned_len
>>>> and
>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
>>>> const, unsigned long>*>::increment_slow()
>>>>
>>>> (around 10-20% time for both)
>>>>
>>>>
>>>> when latency is good, I don't see them at all.
>>>>
>>>>
>>>> I have used the Mark wallclock profiler, here the results:
>>>>
>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>>>
>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>>>
>>>>
>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>>>>
>>>>
>>>> + 100.00% clone
>>>> + 100.00% start_thread
>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Alexandre Derumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
>>>> Cc: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>>> Envoyé: Lundi 4 Février 2019 09:38:11
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> Hi,
>>>>
>>>> some news:
>>>>
>>>> I have tried with different transparent hugepage values (madvise, never) : no change
>>>>
>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>>>
>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>>>>
>>>>
>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
>>>> my others clusters user 1,6TB ssd.
>>>>
>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>>>>
>>>>
>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
>>>> Cc: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>> op_r_latency but instead op_latency?
>>>>>>
>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>>>>
>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
>>>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>>>> Cc: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> Hi,
>>>>
>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>>>> Hi Stefan,
>>>>>
>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>>>> like suggested. This report makes me a little nervous about my change.
>>>>> Well,I'm really not sure that it's a tcmalloc bug.
>>>>> maybe bluestore related (don't have filestore anymore to compare)
>>>>> I need to compare with bigger latencies
>>>>>
>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>>>>> http://odisoweb1.odiso.net/latencybad.png
>>>>>
>>>>> I observe the latency in my guest vm too, on disks iowait.
>>>>>
>>>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>>>
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>> here my influxdb queries:
>>>>>
>>>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>> op_r_latency but instead op_latency?
>>>>
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>>
>>>> greets,
>>>> Stefan
>>>>
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
>>>>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>, "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
>>>>> Cc: "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>>>> Hi,
>>>>>>
>>>>>> here some new results,
>>>>>> different osd/ different cluster
>>>>>>
>>>>>> before osd restart latency was between 2-5ms
>>>>>> after osd restart is around 1-1.5ms
>>>>>>
>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>>>
>>>>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>>>>> (I'm using tcmalloc 2.5-2.2)
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>> like suggested. This report makes me a little nervous about my change.
>>>>>
>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>> exact values out of the daemon do you use for bluestore?
>>>>>
>>>>> I would like to check if i see the same behaviour.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
>>>>>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
>>>>>> Cc: "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>>>>
>>>>>> Can you capture a perf top or perf record to see where teh CPU time is
>>>>>> going on one of the OSDs wth a high latency?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>>>
>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>>>
>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>>>>
>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>>>>> values like 20-200ms.
>>>>>>>
>>>>>>> Some example graphs:
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>>>
>>>>>>> All osds have this behaviour, in all clusters.
>>>>>>>
>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>>>>
>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>>>
>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>>>>
>>>>>>> Any Hints for counters/logs to check ?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Alexandre
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>
>>
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 53944 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                     ` <1554220830.1076801.1550047328269.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  2019-02-15 12:46                                                                       ` Igor Fedotov
@ 2019-02-15 12:47                                                                       ` Igor Fedotov
       [not found]                                                                         ` <f97b81e4-265d-cd8e-3053-321d988720c4-l3A5Bk7waGM@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-15 12:47 UTC (permalink / raw)
  Cc: ceph-users, ceph-devel

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)
0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one.

Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture 
stays the same.

W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
 > Hi Igor,
 >
 > Thanks again for helping !
 >
 >
 >
 > I have upgrade to last mimic this weekend, and with new autotune memory,
 > I have setup osd_memory_target to 8G.  (my nvme are 6TB)
 >
 >
 > I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours,
 > here the reports for osd.0:
 >
 > http://odisoweb1.odiso.net/perfanalysis/
 >
 >
 > osd has been started the 12-02-2019 at 08:00
 >
 > first report after 1h running
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
 > 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
 >
 >
 >
 > report  after 24 before counter resets
 >
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
 > 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
 >
 > report 1h after counter reset
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
 > 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
 > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
 >
 >
 >
 >
 > I'm seeing the bluestore buffer bytes memory increasing up to 4G  
around 12-02-2019 at 14:00
 > http://odisoweb1.odiso.net/perfanalysis/graphs2.png
 > Then after that, slowly decreasing.
 >
 >
 > Another strange thing,
 > I'm seeing total bytes at 5G at 12-02-2018.13:30
 > 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
 > Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G
 >
 >
 > I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time.
 >
 > ----- Mail original -----
 > De: "Igor Fedotov" <ifedotov@suse.de>
 > À: "Alexandre Derumier" <aderumier@odiso.com>
 > Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
 > Envoyé: Lundi 11 Février 2019 12:03:17
 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart
 >
 > On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
 >> another mempool dump after 1h run. (latency ok)
 >>
 >> Biggest difference:
 >>
 >> before restart
 >> -------------
 >> "bluestore_cache_other": {
 >> "items": 48661920,
 >> "bytes": 1539544228
 >> },
 >> "bluestore_cache_data": {
 >> "items": 54,
 >> "bytes": 643072
 >> },
 >> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory)
 >>
 >>
 >> After restart
 >> -------------
 >> "bluestore_cache_other": {
 >> "items": 12432298,
 >> "bytes": 500834899
 >> },
 >> "bluestore_cache_data": {
 >> "items": 40084,
 >> "bytes": 1056235520
 >> },
 >>
 > This is fine as cache is warming after restart and some rebalancing
 > between data and metadata might occur.
 >
 > What relates to allocator and most probably to fragmentation growth is :
 >
 > "bluestore_alloc": {
 > "items": 165053952,
 > "bytes": 165053952
 > },
 >
 > which had been higher before the reset (if I got these dumps' order
 > properly)
 >
 > "bluestore_alloc": {
 > "items": 210243456,
 > "bytes": 210243456
 > },
 >
 > But as I mentioned - I'm not 100% sure this might cause such a huge
 > latency increase...
 >
 > Do you have perf counters dump after the restart?
 >
 > Could you collect some more dumps - for both mempool and perf counters?
 >
 > So ideally I'd like to have:
 >
 > 1) mempool/perf counters dumps after the restart (1hour is OK)
 >
 > 2) mempool/perf counters dumps in 24+ hours after restart
 >
 > 3) reset perf counters after 2), wait for 1 hour (and without OSD
 > restart) and dump mempool/perf counters again.
 >
 > So we'll be able to learn both allocator mem usage growth and operation
 > latency distribution for the following periods:
 >
 > a) 1st hour after restart
 >
 > b) 25th hour.
 >
 >
 > Thanks,
 >
 > Igor
 >
 >
 >> full mempool dump after restart
 >> -------------------------------
 >>
 >> {
 >> "mempool": {
 >> "by_pool": {
 >> "bloom_filter": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "bluestore_alloc": {
 >> "items": 165053952,
 >> "bytes": 165053952
 >> },
 >> "bluestore_cache_data": {
 >> "items": 40084,
 >> "bytes": 1056235520
 >> },
 >> "bluestore_cache_onode": {
 >> "items": 22225,
 >> "bytes": 14935200
 >> },
 >> "bluestore_cache_other": {
 >> "items": 12432298,
 >> "bytes": 500834899
 >> },
 >> "bluestore_fsck": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "bluestore_txc": {
 >> "items": 11,
 >> "bytes": 8184
 >> },
 >> "bluestore_writing_deferred": {
 >> "items": 5047,
 >> "bytes": 22673736
 >> },
 >> "bluestore_writing": {
 >> "items": 91,
 >> "bytes": 1662976
 >> },
 >> "bluefs": {
 >> "items": 1907,
 >> "bytes": 95600
 >> },
 >> "buffer_anon": {
 >> "items": 19664,
 >> "bytes": 25486050
 >> },
 >> "buffer_meta": {
 >> "items": 46189,
 >> "bytes": 2956096
 >> },
 >> "osd": {
 >> "items": 243,
 >> "bytes": 3089016
 >> },
 >> "osd_mapbl": {
 >> "items": 17,
 >> "bytes": 214366
 >> },
 >> "osd_pglog": {
 >> "items": 889673,
 >> "bytes": 367160400
 >> },
 >> "osdmap": {
 >> "items": 3803,
 >> "bytes": 224552
 >> },
 >> "osdmap_mapping": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "pgmap": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "mds_co": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "unittest_1": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "unittest_2": {
 >> "items": 0,
 >> "bytes": 0
 >> }
 >> },
 >> "total": {
 >> "items": 178515204,
 >> "bytes": 2160630547
 >> }
 >> }
 >> }
 >>
 >> ----- Mail original -----
 >> De: "aderumier" <aderumier@odiso.com>
 >> À: "Igor Fedotov" <ifedotov@suse.de>
 >> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org>
 >> Envoyé: Vendredi 8 Février 2019 16:14:54
 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart
 >>
 >> I'm just seeing
 >>
 >> StupidAllocator::_aligned_len
 >> and
 >> 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, mempoo
 >>
 >> on 1 osd, both 10%.
 >>
 >> here the dump_mempools
 >>
 >> {
 >> "mempool": {
 >> "by_pool": {
 >> "bloom_filter": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "bluestore_alloc": {
 >> "items": 210243456,
 >> "bytes": 210243456
 >> },
 >> "bluestore_cache_data": {
 >> "items": 54,
 >> "bytes": 643072
 >> },
 >> "bluestore_cache_onode": {
 >> "items": 105637,
 >> "bytes": 70988064
 >> },
 >> "bluestore_cache_other": {
 >> "items": 48661920,
 >> "bytes": 1539544228
 >> },
 >> "bluestore_fsck": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "bluestore_txc": {
 >> "items": 12,
 >> "bytes": 8928
 >> },
 >> "bluestore_writing_deferred": {
 >> "items": 406,
 >> "bytes": 4792868
 >> },
 >> "bluestore_writing": {
 >> "items": 66,
 >> "bytes": 1085440
 >> },
 >> "bluefs": {
 >> "items": 1882,
 >> "bytes": 93600
 >> },
 >> "buffer_anon": {
 >> "items": 138986,
 >> "bytes": 24983701
 >> },
 >> "buffer_meta": {
 >> "items": 544,
 >> "bytes": 34816
 >> },
 >> "osd": {
 >> "items": 243,
 >> "bytes": 3089016
 >> },
 >> "osd_mapbl": {
 >> "items": 36,
 >> "bytes": 179308
 >> },
 >> "osd_pglog": {
 >> "items": 952564,
 >> "bytes": 372459684
 >> },
 >> "osdmap": {
 >> "items": 3639,
 >> "bytes": 224664
 >> },
 >> "osdmap_mapping": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "pgmap": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "mds_co": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "unittest_1": {
 >> "items": 0,
 >> "bytes": 0
 >> },
 >> "unittest_2": {
 >> "items": 0,
 >> "bytes": 0
 >> }
 >> },
 >> "total": {
 >> "items": 260109445,
 >> "bytes": 2228370845
 >> }
 >> }
 >> }
 >>
 >>
 >> and the perf dump
 >>
 >> root@ceph5-2:~# ceph daemon osd.4 perf dump
 >> {
 >> "AsyncMessenger::Worker-0": {
 >> "msgr_recv_messages": 22948570,
 >> "msgr_send_messages": 22561570,
 >> "msgr_recv_bytes": 333085080271,
 >> "msgr_send_bytes": 261798871204,
 >> "msgr_created_connections": 6152,
 >> "msgr_active_connections": 2701,
 >> "msgr_running_total_time": 1055.197867330,
 >> "msgr_running_send_time": 352.764480121,
 >> "msgr_running_recv_time": 499.206831955,
 >> "msgr_running_fast_dispatch_time": 130.982201607
 >> },
 >> "AsyncMessenger::Worker-1": {
 >> "msgr_recv_messages": 18801593,
 >> "msgr_send_messages": 18430264,
 >> "msgr_recv_bytes": 306871760934,
 >> "msgr_send_bytes": 192789048666,
 >> "msgr_created_connections": 5773,
 >> "msgr_active_connections": 2721,
 >> "msgr_running_total_time": 816.821076305,
 >> "msgr_running_send_time": 261.353228926,
 >> "msgr_running_recv_time": 394.035587911,
 >> "msgr_running_fast_dispatch_time": 104.012155720
 >> },
 >> "AsyncMessenger::Worker-2": {
 >> "msgr_recv_messages": 18463400,
 >> "msgr_send_messages": 18105856,
 >> "msgr_recv_bytes": 187425453590,
 >> "msgr_send_bytes": 220735102555,
 >> "msgr_created_connections": 5897,
 >> "msgr_active_connections": 2605,
 >> "msgr_running_total_time": 807.186854324,
 >> "msgr_running_send_time": 296.834435839,
 >> "msgr_running_recv_time": 351.364389691,
 >> "msgr_running_fast_dispatch_time": 101.215776792
 >> },
 >> "bluefs": {
 >> "gift_bytes": 0,
 >> "reclaim_bytes": 0,
 >> "db_total_bytes": 256050724864,
 >> "db_used_bytes": 12413042688,
 >> "wal_total_bytes": 0,
 >> "wal_used_bytes": 0,
 >> "slow_total_bytes": 0,
 >> "slow_used_bytes": 0,
 >> "num_files": 209,
 >> "log_bytes": 10383360,
 >> "log_compactions": 14,
 >> "logged_bytes": 336498688,
 >> "files_written_wal": 2,
 >> "files_written_sst": 4499,
 >> "bytes_written_wal": 417989099783,
 >> "bytes_written_sst": 213188750209
 >> },
 >> "bluestore": {
 >> "kv_flush_lat": {
 >> "avgcount": 26371957,
 >> "sum": 26.734038497,
 >> "avgtime": 0.000001013
 >> },
 >> "kv_commit_lat": {
 >> "avgcount": 26371957,
 >> "sum": 3397.491150603,
 >> "avgtime": 0.000128829
 >> },
 >> "kv_lat": {
 >> "avgcount": 26371957,
 >> "sum": 3424.225189100,
 >> "avgtime": 0.000129843
 >> },
 >> "state_prepare_lat": {
 >> "avgcount": 30484924,
 >> "sum": 3689.542105337,
 >> "avgtime": 0.000121028
 >> },
 >> "state_aio_wait_lat": {
 >> "avgcount": 30484924,
 >> "sum": 509.864546111,
 >> "avgtime": 0.000016725
 >> },
 >> "state_io_done_lat": {
 >> "avgcount": 30484924,
 >> "sum": 24.534052953,
 >> "avgtime": 0.000000804
 >> },
 >> "state_kv_queued_lat": {
 >> "avgcount": 30484924,
 >> "sum": 3488.338424238,
 >> "avgtime": 0.000114428
 >> },
 >> "state_kv_commiting_lat": {
 >> "avgcount": 30484924,
 >> "sum": 5660.437003432,
 >> "avgtime": 0.000185679
 >> },
 >> "state_kv_done_lat": {
 >> "avgcount": 30484924,
 >> "sum": 7.763511500,
 >> "avgtime": 0.000000254
 >> },
 >> "state_deferred_queued_lat": {
 >> "avgcount": 26346134,
 >> "sum": 666071.296856696,
 >> "avgtime": 0.025281557
 >> },
 >> "state_deferred_aio_wait_lat": {
 >> "avgcount": 26346134,
 >> "sum": 1755.660547071,
 >> "avgtime": 0.000066638
 >> },
 >> "state_deferred_cleanup_lat": {
 >> "avgcount": 26346134,
 >> "sum": 185465.151653703,
 >> "avgtime": 0.007039558
 >> },
 >> "state_finishing_lat": {
 >> "avgcount": 30484920,
 >> "sum": 3.046847481,
 >> "avgtime": 0.000000099
 >> },
 >> "state_done_lat": {
 >> "avgcount": 30484920,
 >> "sum": 13193.362685280,
 >> "avgtime": 0.000432783
 >> },
 >> "throttle_lat": {
 >> "avgcount": 30484924,
 >> "sum": 14.634269979,
 >> "avgtime": 0.000000480
 >> },
 >> "submit_lat": {
 >> "avgcount": 30484924,
 >> "sum": 3873.883076148,
 >> "avgtime": 0.000127075
 >> },
 >> "commit_lat": {
 >> "avgcount": 30484924,
 >> "sum": 13376.492317331,
 >> "avgtime": 0.000438790
 >> },
 >> "read_lat": {
 >> "avgcount": 5873923,
 >> "sum": 1817.167582057,
 >> "avgtime": 0.000309361
 >> },
 >> "read_onode_meta_lat": {
 >> "avgcount": 19608201,
 >> "sum": 146.770464482,
 >> "avgtime": 0.000007485
 >> },
 >> "read_wait_aio_lat": {
 >> "avgcount": 13734278,
 >> "sum": 2532.578077242,
 >> "avgtime": 0.000184398
 >> },
 >> "compress_lat": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "decompress_lat": {
 >> "avgcount": 1346945,
 >> "sum": 26.227575896,
 >> "avgtime": 0.000019471
 >> },
 >> "csum_lat": {
 >> "avgcount": 28020392,
 >> "sum": 149.587819041,
 >> "avgtime": 0.000005338
 >> },
 >> "compress_success_count": 0,
 >> "compress_rejected_count": 0,
 >> "write_pad_bytes": 352923605,
 >> "deferred_write_ops": 24373340,
 >> "deferred_write_bytes": 216791842816,
 >> "write_penalty_read_ops": 8062366,
 >> "bluestore_allocated": 3765566013440,
 >> "bluestore_stored": 4186255221852,
 >> "bluestore_compressed": 39981379040,
 >> "bluestore_compressed_allocated": 73748348928,
 >> "bluestore_compressed_original": 165041381376,
 >> "bluestore_onodes": 104232,
 >> "bluestore_onode_hits": 71206874,
 >> "bluestore_onode_misses": 1217914,
 >> "bluestore_onode_shard_hits": 260183292,
 >> "bluestore_onode_shard_misses": 22851573,
 >> "bluestore_extents": 3394513,
 >> "bluestore_blobs": 2773587,
 >> "bluestore_buffers": 0,
 >> "bluestore_buffer_bytes": 0,
 >> "bluestore_buffer_hit_bytes": 62026011221,
 >> "bluestore_buffer_miss_bytes": 995233669922,
 >> "bluestore_write_big": 5648815,
 >> "bluestore_write_big_bytes": 552502214656,
 >> "bluestore_write_big_blobs": 12440992,
 >> "bluestore_write_small": 35883770,
 >> "bluestore_write_small_bytes": 223436965719,
 >> "bluestore_write_small_unused": 408125,
 >> "bluestore_write_small_deferred": 34961455,
 >> "bluestore_write_small_pre_read": 34961455,
 >> "bluestore_write_small_new": 514190,
 >> "bluestore_txc": 30484924,
 >> "bluestore_onode_reshard": 5144189,
 >> "bluestore_blob_split": 60104,
 >> "bluestore_extent_compress": 53347252,
 >> "bluestore_gc_merged": 21142528,
 >> "bluestore_read_eio": 0,
 >> "bluestore_fragmentation_micros": 67
 >> },
 >> "finisher-defered_finisher": {
 >> "queue_len": 0,
 >> "complete_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "finisher-finisher-0": {
 >> "queue_len": 0,
 >> "complete_latency": {
 >> "avgcount": 26625163,
 >> "sum": 1057.506990951,
 >> "avgtime": 0.000039718
 >> }
 >> },
 >> "finisher-objecter-finisher-0": {
 >> "queue_len": 0,
 >> "complete_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.0::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.0::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.1::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.1::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.2::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.2::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.3::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.3::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.4::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.4::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.5::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.5::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.6::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.6::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.7::sdata_wait_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "mutex-OSDShard.7::shard_lock": {
 >> "wait": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "objecter": {
 >> "op_active": 0,
 >> "op_laggy": 0,
 >> "op_send": 0,
 >> "op_send_bytes": 0,
 >> "op_resend": 0,
 >> "op_reply": 0,
 >> "op": 0,
 >> "op_r": 0,
 >> "op_w": 0,
 >> "op_rmw": 0,
 >> "op_pg": 0,
 >> "osdop_stat": 0,
 >> "osdop_create": 0,
 >> "osdop_read": 0,
 >> "osdop_write": 0,
 >> "osdop_writefull": 0,
 >> "osdop_writesame": 0,
 >> "osdop_append": 0,
 >> "osdop_zero": 0,
 >> "osdop_truncate": 0,
 >> "osdop_delete": 0,
 >> "osdop_mapext": 0,
 >> "osdop_sparse_read": 0,
 >> "osdop_clonerange": 0,
 >> "osdop_getxattr": 0,
 >> "osdop_setxattr": 0,
 >> "osdop_cmpxattr": 0,
 >> "osdop_rmxattr": 0,
 >> "osdop_resetxattrs": 0,
 >> "osdop_tmap_up": 0,
 >> "osdop_tmap_put": 0,
 >> "osdop_tmap_get": 0,
 >> "osdop_call": 0,
 >> "osdop_watch": 0,
 >> "osdop_notify": 0,
 >> "osdop_src_cmpxattr": 0,
 >> "osdop_pgls": 0,
 >> "osdop_pgls_filter": 0,
 >> "osdop_other": 0,
 >> "linger_active": 0,
 >> "linger_send": 0,
 >> "linger_resend": 0,
 >> "linger_ping": 0,
 >> "poolop_active": 0,
 >> "poolop_send": 0,
 >> "poolop_resend": 0,
 >> "poolstat_active": 0,
 >> "poolstat_send": 0,
 >> "poolstat_resend": 0,
 >> "statfs_active": 0,
 >> "statfs_send": 0,
 >> "statfs_resend": 0,
 >> "command_active": 0,
 >> "command_send": 0,
 >> "command_resend": 0,
 >> "map_epoch": 105913,
 >> "map_full": 0,
 >> "map_inc": 828,
 >> "osd_sessions": 0,
 >> "osd_session_open": 0,
 >> "osd_session_close": 0,
 >> "osd_laggy": 0,
 >> "omap_wr": 0,
 >> "omap_rd": 0,
 >> "omap_del": 0
 >> },
 >> "osd": {
 >> "op_wip": 0,
 >> "op": 16758102,
 >> "op_in_bytes": 238398820586,
 >> "op_out_bytes": 165484999463,
 >> "op_latency": {
 >> "avgcount": 16758102,
 >> "sum": 38242.481640842,
 >> "avgtime": 0.002282029
 >> },
 >> "op_process_latency": {
 >> "avgcount": 16758102,
 >> "sum": 28644.906310687,
 >> "avgtime": 0.001709316
 >> },
 >> "op_prepare_latency": {
 >> "avgcount": 16761367,
 >> "sum": 3489.856599934,
 >> "avgtime": 0.000208208
 >> },
 >> "op_r": 6188565,
 >> "op_r_out_bytes": 165484999463,
 >> "op_r_latency": {
 >> "avgcount": 6188565,
 >> "sum": 4507.365756792,
 >> "avgtime": 0.000728337
 >> },
 >> "op_r_process_latency": {
 >> "avgcount": 6188565,
 >> "sum": 942.363063429,
 >> "avgtime": 0.000152274
 >> },
 >> "op_r_prepare_latency": {
 >> "avgcount": 6188644,
 >> "sum": 982.866710389,
 >> "avgtime": 0.000158817
 >> },
 >> "op_w": 10546037,
 >> "op_w_in_bytes": 238334329494,
 >> "op_w_latency": {
 >> "avgcount": 10546037,
 >> "sum": 33160.719998316,
 >> "avgtime": 0.003144377
 >> },
 >> "op_w_process_latency": {
 >> "avgcount": 10546037,
 >> "sum": 27668.702029030,
 >> "avgtime": 0.002623611
 >> },
 >> "op_w_prepare_latency": {
 >> "avgcount": 10548652,
 >> "sum": 2499.688609173,
 >> "avgtime": 0.000236967
 >> },
 >> "op_rw": 23500,
 >> "op_rw_in_bytes": 64491092,
 >> "op_rw_out_bytes": 0,
 >> "op_rw_latency": {
 >> "avgcount": 23500,
 >> "sum": 574.395885734,
 >> "avgtime": 0.024442378
 >> },
 >> "op_rw_process_latency": {
 >> "avgcount": 23500,
 >> "sum": 33.841218228,
 >> "avgtime": 0.001440051
 >> },
 >> "op_rw_prepare_latency": {
 >> "avgcount": 24071,
 >> "sum": 7.301280372,
 >> "avgtime": 0.000303322
 >> },
 >> "op_before_queue_op_lat": {
 >> "avgcount": 57892986,
 >> "sum": 1502.117718889,
 >> "avgtime": 0.000025946
 >> },
 >> "op_before_dequeue_op_lat": {
 >> "avgcount": 58091683,
 >> "sum": 45194.453254037,
 >> "avgtime": 0.000777984
 >> },
 >> "subop": 19784758,
 >> "subop_in_bytes": 547174969754,
 >> "subop_latency": {
 >> "avgcount": 19784758,
 >> "sum": 13019.714424060,
 >> "avgtime": 0.000658067
 >> },
 >> "subop_w": 19784758,
 >> "subop_w_in_bytes": 547174969754,
 >> "subop_w_latency": {
 >> "avgcount": 19784758,
 >> "sum": 13019.714424060,
 >> "avgtime": 0.000658067
 >> },
 >> "subop_pull": 0,
 >> "subop_pull_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "subop_push": 0,
 >> "subop_push_in_bytes": 0,
 >> "subop_push_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "pull": 0,
 >> "push": 2003,
 >> "push_out_bytes": 5560009728,
 >> "recovery_ops": 1940,
 >> "loadavg": 118,
 >> "buffer_bytes": 0,
 >> "history_alloc_Mbytes": 0,
 >> "history_alloc_num": 0,
 >> "cached_crc": 0,
 >> "cached_crc_adjusted": 0,
 >> "missed_crc": 0,
 >> "numpg": 243,
 >> "numpg_primary": 82,
 >> "numpg_replica": 161,
 >> "numpg_stray": 0,
 >> "numpg_removing": 0,
 >> "heartbeat_to_peers": 10,
 >> "map_messages": 7013,
 >> "map_message_epochs": 7143,
 >> "map_message_epoch_dups": 6315,
 >> "messages_delayed_for_map": 0,
 >> "osd_map_cache_hit": 203309,
 >> "osd_map_cache_miss": 33,
 >> "osd_map_cache_miss_low": 0,
 >> "osd_map_cache_miss_low_avg": {
 >> "avgcount": 0,
 >> "sum": 0
 >> },
 >> "osd_map_bl_cache_hit": 47012,
 >> "osd_map_bl_cache_miss": 1681,
 >> "stat_bytes": 6401248198656,
 >> "stat_bytes_used": 3777979072512,
 >> "stat_bytes_avail": 2623269126144,
 >> "copyfrom": 0,
 >> "tier_promote": 0,
 >> "tier_flush": 0,
 >> "tier_flush_fail": 0,
 >> "tier_try_flush": 0,
 >> "tier_try_flush_fail": 0,
 >> "tier_evict": 0,
 >> "tier_whiteout": 1631,
 >> "tier_dirty": 22360,
 >> "tier_clean": 0,
 >> "tier_delay": 0,
 >> "tier_proxy_read": 0,
 >> "tier_proxy_write": 0,
 >> "agent_wake": 0,
 >> "agent_skip": 0,
 >> "agent_flush": 0,
 >> "agent_evict": 0,
 >> "object_ctx_cache_hit": 16311156,
 >> "object_ctx_cache_total": 17426393,
 >> "op_cache_hit": 0,
 >> "osd_tier_flush_lat": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "osd_tier_promote_lat": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "osd_tier_r_lat": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "osd_pg_info": 30483113,
 >> "osd_pg_fastinfo": 29619885,
 >> "osd_pg_biginfo": 81703
 >> },
 >> "recoverystate_perf": {
 >> "initial_latency": {
 >> "avgcount": 243,
 >> "sum": 6.869296500,
 >> "avgtime": 0.028268709
 >> },
 >> "started_latency": {
 >> "avgcount": 1125,
 >> "sum": 13551384.917335850,
 >> "avgtime": 12045.675482076
 >> },
 >> "reset_latency": {
 >> "avgcount": 1368,
 >> "sum": 1101.727799040,
 >> "avgtime": 0.805356578
 >> },
 >> "start_latency": {
 >> "avgcount": 1368,
 >> "sum": 0.002014799,
 >> "avgtime": 0.000001472
 >> },
 >> "primary_latency": {
 >> "avgcount": 507,
 >> "sum": 4575560.638823428,
 >> "avgtime": 9024.774435549
 >> },
 >> "peering_latency": {
 >> "avgcount": 550,
 >> "sum": 499.372283616,
 >> "avgtime": 0.907949606
 >> },
 >> "backfilling_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "waitremotebackfillreserved_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "waitlocalbackfillreserved_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "notbackfilling_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "repnotrecovering_latency": {
 >> "avgcount": 1009,
 >> "sum": 8975301.082274411,
 >> "avgtime": 8895.243887288
 >> },
 >> "repwaitrecoveryreserved_latency": {
 >> "avgcount": 420,
 >> "sum": 99.846056520,
 >> "avgtime": 0.237728706
 >> },
 >> "repwaitbackfillreserved_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "reprecovering_latency": {
 >> "avgcount": 420,
 >> "sum": 241.682764382,
 >> "avgtime": 0.575435153
 >> },
 >> "activating_latency": {
 >> "avgcount": 507,
 >> "sum": 16.893347339,
 >> "avgtime": 0.033320211
 >> },
 >> "waitlocalrecoveryreserved_latency": {
 >> "avgcount": 199,
 >> "sum": 672.335512769,
 >> "avgtime": 3.378570415
 >> },
 >> "waitremoterecoveryreserved_latency": {
 >> "avgcount": 199,
 >> "sum": 213.536439363,
 >> "avgtime": 1.073047433
 >> },
 >> "recovering_latency": {
 >> "avgcount": 199,
 >> "sum": 79.007696479,
 >> "avgtime": 0.397023600
 >> },
 >> "recovered_latency": {
 >> "avgcount": 507,
 >> "sum": 14.000732748,
 >> "avgtime": 0.027614857
 >> },
 >> "clean_latency": {
 >> "avgcount": 395,
 >> "sum": 4574325.900371083,
 >> "avgtime": 11580.571899673
 >> },
 >> "active_latency": {
 >> "avgcount": 425,
 >> "sum": 4575107.630123680,
 >> "avgtime": 10764.959129702
 >> },
 >> "replicaactive_latency": {
 >> "avgcount": 589,
 >> "sum": 8975184.499049954,
 >> "avgtime": 15238.004242869
 >> },
 >> "stray_latency": {
 >> "avgcount": 818,
 >> "sum": 800.729455666,
 >> "avgtime": 0.978886865
 >> },
 >> "getinfo_latency": {
 >> "avgcount": 550,
 >> "sum": 15.085667048,
 >> "avgtime": 0.027428485
 >> },
 >> "getlog_latency": {
 >> "avgcount": 546,
 >> "sum": 3.482175693,
 >> "avgtime": 0.006377611
 >> },
 >> "waitactingchange_latency": {
 >> "avgcount": 39,
 >> "sum": 35.444551284,
 >> "avgtime": 0.908834648
 >> },
 >> "incomplete_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "down_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "getmissing_latency": {
 >> "avgcount": 507,
 >> "sum": 6.702129624,
 >> "avgtime": 0.013219190
 >> },
 >> "waitupthru_latency": {
 >> "avgcount": 507,
 >> "sum": 474.098261727,
 >> "avgtime": 0.935105052
 >> },
 >> "notrecovering_latency": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> },
 >> "rocksdb": {
 >> "get": 28320977,
 >> "submit_transaction": 30484924,
 >> "submit_transaction_sync": 26371957,
 >> "get_latency": {
 >> "avgcount": 28320977,
 >> "sum": 325.900908733,
 >> "avgtime": 0.000011507
 >> },
 >> "submit_latency": {
 >> "avgcount": 30484924,
 >> "sum": 1835.888692371,
 >> "avgtime": 0.000060222
 >> },
 >> "submit_sync_latency": {
 >> "avgcount": 26371957,
 >> "sum": 1431.555230628,
 >> "avgtime": 0.000054283
 >> },
 >> "compact": 0,
 >> "compact_range": 0,
 >> "compact_queue_merge": 0,
 >> "compact_queue_len": 0,
 >> "rocksdb_write_wal_time": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "rocksdb_write_memtable_time": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "rocksdb_write_delay_time": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> },
 >> "rocksdb_write_pre_and_post_time": {
 >> "avgcount": 0,
 >> "sum": 0.000000000,
 >> "avgtime": 0.000000000
 >> }
 >> }
 >> }
 >>
 >> ----- Mail original -----
 >> De: "Igor Fedotov" <ifedotov@suse.de>
 >> À: "aderumier" <aderumier@odiso.com>
 >> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org>
 >> Envoyé: Mardi 5 Février 2019 18:56:51
 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart
 >>
 >> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
 >>>>> but I don't see l_bluestore_fragmentation counter.
 >>>>> (but I have bluestore_fragmentation_micros)
 >>> ok, this is the same
 >>>
 >>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
 >>> "How fragmented bluestore free space is (free extents / max 
possible number of free extents) * 1000");
 >>>
 >>>
 >>> Here a graph on last month, with bluestore_fragmentation_micros and 
latency,
 >>>
 >>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
 >> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
 >> it? The same for other OSDs?
 >>
 >> This proves some issue with the allocator - generally fragmentation
 >> might grow but it shouldn't reset on restart. Looks like some intervals
 >> aren't properly merged in run-time.
 >>
 >> On the other side I'm not completely sure that latency degradation is
 >> caused by that - fragmentation growth is relatively small - I don't see
 >> how this might impact performance that high.
 >>
 >> Wondering if you have OSD mempool monitoring (dump_mempools command
 >> output on admin socket) reports? Do you have any historic data?
 >>
 >> If not may I have current output and say a couple more samples with
 >> 8-12 hours interval?
 >>
 >>
 >> Wrt to backporting bitmap allocator to mimic - we haven't had such 
plans
 >> before that but I'll discuss this at BlueStore meeting shortly.
 >>
 >>
 >> Thanks,
 >>
 >> Igor
 >>
 >>> ----- Mail original -----
 >>> De: "Alexandre Derumier" <aderumier@odiso.com>
 >>> À: "Igor Fedotov" <ifedotov@suse.de>
 >>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org>
 >>> Envoyé: Lundi 4 Février 2019 16:04:38
 >>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart
 >>>
 >>> Thanks Igor,
 >>>
 >>>>> Could you please collect BlueStore performance counters right 
after OSD
 >>>>> startup and once you get high latency.
 >>>>>
 >>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
 >>> I'm already monitoring with
 >>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
counters)
 >>>
 >>> but I don't see l_bluestore_fragmentation counter.
 >>>
 >>> (but I have bluestore_fragmentation_micros)
 >>>
 >>>
 >>>>> Also if you're able to rebuild the code I can probably make a simple
 >>>>> patch to track latency and some other internal allocator's 
paramter to
 >>>>> make sure it's degraded and learn more details.
 >>> Sorry, It's a critical production cluster, I can't test on it :(
 >>> But I have a test cluster, maybe I can try to put some load on it, 
and try to reproduce.
 >>>
 >>>
 >>>
 >>>>> More vigorous fix would be to backport bitmap allocator from 
Nautilus
 >>>>> and try the difference...
 >>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
 >>> perf results of new bitmap allocator seem very promising from what 
I've seen in PR.
 >>>
 >>>
 >>>
 >>> ----- Mail original -----
 >>> De: "Igor Fedotov" <ifedotov@suse.de>
 >>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
 >>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
 >>> Envoyé: Lundi 4 Février 2019 15:51:30
 >>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart
 >>>
 >>> Hi Alexandre,
 >>>
 >>> looks like a bug in StupidAllocator.
 >>>
 >>> Could you please collect BlueStore performance counters right after 
OSD
 >>> startup and once you get high latency.
 >>>
 >>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
 >>>
 >>> Also if you're able to rebuild the code I can probably make a simple
 >>> patch to track latency and some other internal allocator's paramter to
 >>> make sure it's degraded and learn more details.
 >>>
 >>>
 >>> More vigorous fix would be to backport bitmap allocator from Nautilus
 >>> and try the difference...
 >>>
 >>>
 >>> Thanks,
 >>>
 >>> Igor
 >>>
 >>>
 >>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
 >>>> Hi again,
 >>>>
 >>>> I speak too fast, the problem has occured again, so it's not 
tcmalloc cache size related.
 >>>>
 >>>>
 >>>> I have notice something using a simple "perf top",
 >>>>
 >>>> each time I have this problem (I have seen exactly 4 times the 
same behaviour),
 >>>>
 >>>> when latency is bad, perf top give me :
 >>>>
 >>>> StupidAllocator::_aligned_len
 >>>> and
 >>>> 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, mempoo
 >>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
unsigned long>&, std::pair<unsigned long
 >>>> const, unsigned long>*>::increment_slow()
 >>>>
 >>>> (around 10-20% time for both)
 >>>>
 >>>>
 >>>> when latency is good, I don't see them at all.
 >>>>
 >>>>
 >>>> I have used the Mark wallclock profiler, here the results:
 >>>>
 >>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
 >>>>
 >>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
 >>>>
 >>>>
 >>>> here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len
 >>>>
 >>>>
 >>>> + 100.00% clone
 >>>> + 100.00% start_thread
 >>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
 >>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
 >>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)
 >>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
ThreadPool::TPHandle&)
 >>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
 >>>> | + 70.00% 
PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&)
 >>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
 >>>> | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
 >>>> | | + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
 >>>> | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, 
boost::intrusive_ptr<OpRequest>)
 >>>> | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, 
boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
 >>>> | | | + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)
 >>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
 >>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
 >>>> | | | | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>, 
boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
 >>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
 >>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
unsigned long, long, unsigned long*, unsigned int*)
 >>>> | | | | | | + 34.00% 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
unsigned long>&, std::pair<unsigned long const, unsigned 
long>*>::increment_slow()
 >>>> | | | | | | + 26.00% 
StupidAllocator::_aligned_len(interval_set<unsigned long, 
btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >::iterator, unsigned long)
 >>>>
 >>>>
 >>>>
 >>>> ----- Mail original -----
 >>>> De: "Alexandre Derumier" <aderumier@odiso.com>
 >>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
 >>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
 >>>> Envoyé: Lundi 4 Février 2019 09:38:11
 >>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart
 >>>>
 >>>> Hi,
 >>>>
 >>>> some news:
 >>>>
 >>>> I have tried with different transparent hugepage values (madvise, 
never) : no change
 >>>>
 >>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
 >>>>
 >>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
some more days to be sure)
 >>>>
 >>>>
 >>>> Note that this behaviour seem to happen really faster (< 2 days) 
on my big nvme drives (6TB),
 >>>> my others clusters user 1,6TB ssd.
 >>>>
 >>>> Currently I'm using only 1 osd by nvme (I don't have more than 
5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
it's helping.
 >>>>
 >>>>
 >>>> BTW, does somebody have already tested ceph without tcmalloc, with 
glibc >= 2.26 (which have also thread cache) ?
 >>>>
 >>>>
 >>>> Regards,
 >>>>
 >>>> Alexandre
 >>>>
 >>>>
 >>>> ----- Mail original -----
 >>>> De: "aderumier" <aderumier@odiso.com>
 >>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
 >>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
 >>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
 >>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart
 >>>>
 >>>>>> Thanks. Is there any reason you monitor op_w_latency but not
 >>>>>> op_r_latency but instead op_latency?
 >>>>>>
 >>>>>> Also why do you monitor op_w_process_latency? but not 
op_r_process_latency?
 >>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
of graphs).
 >>>>
 >>>> I just don't see latency difference on reads. (or they are very 
very small vs the write latency increase)
 >>>>
 >>>>
 >>>>
 >>>> ----- Mail original -----
 >>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
 >>>> À: "aderumier" <aderumier@odiso.com>
 >>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
 >>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
 >>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart
 >>>>
 >>>> Hi,
 >>>>
 >>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
 >>>>> Hi Stefan,
 >>>>>
 >>>>>>> currently i'm in the process of switching back from jemalloc to 
tcmalloc
 >>>>>>> like suggested. This report makes me a little nervous about my 
change.
 >>>>> Well,I'm really not sure that it's a tcmalloc bug.
 >>>>> maybe bluestore related (don't have filestore anymore to compare)
 >>>>> I need to compare with bigger latencies
 >>>>>
 >>>>> here an example, when all osd at 20-50ms before restart, then 
after restart (at 21:15), 1ms
 >>>>> http://odisoweb1.odiso.net/latencybad.png
 >>>>>
 >>>>> I observe the latency in my guest vm too, on disks iowait.
 >>>>>
 >>>>> http://odisoweb1.odiso.net/latencybadvm.png
 >>>>>
 >>>>>>> Also i'm currently only monitoring latency for filestore osds. 
Which
 >>>>>>> exact values out of the daemon do you use for bluestore?
 >>>>> here my influxdb queries:
 >>>>>
 >>>>> It take op_latency.sum/op_latency.avgcount on last second.
 >>>>>
 >>>>>
 >>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
GROUP BY time($interval), "host", "id" fill(previous)
 >>>>>
 >>>>>
 >>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
"ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)
 >>>>>
 >>>>>
 >>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
=~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)
 >>>> Thanks. Is there any reason you monitor op_w_latency but not
 >>>> op_r_latency but instead op_latency?
 >>>>
 >>>> Also why do you monitor op_w_process_latency? but not 
op_r_process_latency?
 >>>>
 >>>> greets,
 >>>> Stefan
 >>>>
 >>>>> ----- Mail original -----
 >>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
 >>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
<sage@newdream.net>
 >>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org>
 >>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
 >>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart
 >>>>>
 >>>>> Hi,
 >>>>>
 >>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
 >>>>>> Hi,
 >>>>>>
 >>>>>> here some new results,
 >>>>>> different osd/ different cluster
 >>>>>>
 >>>>>> before osd restart latency was between 2-5ms
 >>>>>> after osd restart is around 1-1.5ms
 >>>>>>
 >>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
 >>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
 >>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
 >>>>>>
 >>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
but maybe I'm wrong.
 >>>>>> (I'm using tcmalloc 2.5-2.2)
 >>>>> currently i'm in the process of switching back from jemalloc to 
tcmalloc
 >>>>> like suggested. This report makes me a little nervous about my 
change.
 >>>>>
 >>>>> Also i'm currently only monitoring latency for filestore osds. Which
 >>>>> exact values out of the daemon do you use for bluestore?
 >>>>>
 >>>>> I would like to check if i see the same behaviour.
 >>>>>
 >>>>> Greets,
 >>>>> Stefan
 >>>>>
 >>>>>> ----- Mail original -----
 >>>>>> De: "Sage Weil" <sage@newdream.net>
 >>>>>> À: "aderumier" <aderumier@odiso.com>
 >>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org>
 >>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
 >>>>>> Objet: Re: ceph osd commit latency increase over time, until 
restart
 >>>>>>
 >>>>>> Can you capture a perf top or perf record to see where teh CPU 
time is
 >>>>>> going on one of the OSDs wth a high latency?
 >>>>>>
 >>>>>> Thanks!
 >>>>>> sage
 >>>>>>
 >>>>>>
 >>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
 >>>>>>
 >>>>>>> Hi,
 >>>>>>>
 >>>>>>> I have a strange behaviour of my osd, on multiple clusters,
 >>>>>>>
 >>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
nvme drivers,
 >>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
snapshot/rbd export-diff/snapshotdelete each day for backup
 >>>>>>>
 >>>>>>> When the osd are refreshly started, the commit latency is 
between 0,5-1ms.
 >>>>>>>
 >>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
day), until reaching crazy
 >>>>>>> values like 20-200ms.
 >>>>>>>
 >>>>>>> Some example graphs:
 >>>>>>>
 >>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
 >>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
 >>>>>>>
 >>>>>>> All osds have this behaviour, in all clusters.
 >>>>>>>
 >>>>>>> The latency of physical disks is ok. (Clusters are far to be 
full loaded)
 >>>>>>>
 >>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
 >>>>>>>
 >>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
bluestore memory bug ?
 >>>>>>>
 >>>>>>> Any Hints for counters/logs to check ?
 >>>>>>>
 >>>>>>>
 >>>>>>> Regards,
 >>>>>>>
 >>>>>>> Alexandre
 >>>>>>>
 >>>>>>>
 >>>>>> _______________________________________________
 >>>>>> ceph-users mailing list
 >>>>>> ceph-users@lists.ceph.com
 >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 >>>>>>
 >>
 >
 >

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
> Hi Igor,
>
> Thanks again for helping !
>
>
>
> I have upgrade to last mimic this weekend, and with new autotune memory,
> I have setup osd_memory_target to 8G.  (my nvme are 6TB)
>
>
> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours,
> here the reports for osd.0:
>
> http://odisoweb1.odiso.net/perfanalysis/
>
>
> osd has been started the 12-02-2019 at 08:00
>
> first report after 1h running
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>
>
>
> report  after 24 before counter resets
>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>
> report 1h after counter reset
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>
>
>
>
> I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 12-02-2019 at 14:00
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
> Then after that, slowly decreasing.
>
>
> Another strange thing,
> I'm seeing total bytes at 5G at 12-02-2018.13:30
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G
>
>
> I'm graphing mempools counters too since yesterday, so I'll able to track them over time.
>
> ----- Mail original -----
> De: "Igor Fedotov" <ifedotov@suse.de>
> À: "Alexandre Derumier" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 11 Février 2019 12:03:17
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>> another mempool dump after 1h run. (latency ok)
>>
>> Biggest difference:
>>
>> before restart
>> -------------
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory)
>>
>>
>> After restart
>> -------------
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>>
> This is fine as cache is warming after restart and some rebalancing
> between data and metadata might occur.
>
> What relates to allocator and most probably to fragmentation growth is :
>
> "bluestore_alloc": {
> "items": 165053952,
> "bytes": 165053952
> },
>
> which had been higher before the reset (if I got these dumps' order
> properly)
>
> "bluestore_alloc": {
> "items": 210243456,
> "bytes": 210243456
> },
>
> But as I mentioned - I'm not 100% sure this might cause such a huge
> latency increase...
>
> Do you have perf counters dump after the restart?
>
> Could you collect some more dumps - for both mempool and perf counters?
>
> So ideally I'd like to have:
>
> 1) mempool/perf counters dumps after the restart (1hour is OK)
>
> 2) mempool/perf counters dumps in 24+ hours after restart
>
> 3) reset perf counters after 2), wait for 1 hour (and without OSD
> restart) and dump mempool/perf counters again.
>
> So we'll be able to learn both allocator mem usage growth and operation
> latency distribution for the following periods:
>
> a) 1st hour after restart
>
> b) 25th hour.
>
>
> Thanks,
>
> Igor
>
>
>> full mempool dump after restart
>> -------------------------------
>>
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 165053952,
>> "bytes": 165053952
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>> "bluestore_cache_onode": {
>> "items": 22225,
>> "bytes": 14935200
>> },
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_fsck": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_txc": {
>> "items": 11,
>> "bytes": 8184
>> },
>> "bluestore_writing_deferred": {
>> "items": 5047,
>> "bytes": 22673736
>> },
>> "bluestore_writing": {
>> "items": 91,
>> "bytes": 1662976
>> },
>> "bluefs": {
>> "items": 1907,
>> "bytes": 95600
>> },
>> "buffer_anon": {
>> "items": 19664,
>> "bytes": 25486050
>> },
>> "buffer_meta": {
>> "items": 46189,
>> "bytes": 2956096
>> },
>> "osd": {
>> "items": 243,
>> "bytes": 3089016
>> },
>> "osd_mapbl": {
>> "items": 17,
>> "bytes": 214366
>> },
>> "osd_pglog": {
>> "items": 889673,
>> "bytes": 367160400
>> },
>> "osdmap": {
>> "items": 3803,
>> "bytes": 224552
>> },
>> "osdmap_mapping": {
>> "items": 0,
>> "bytes": 0
>> },
>> "pgmap": {
>> "items": 0,
>> "bytes": 0
>> },
>> "mds_co": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_1": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_2": {
>> "items": 0,
>> "bytes": 0
>> }
>> },
>> "total": {
>> "items": 178515204,
>> "bytes": 2160630547
>> }
>> }
>> }
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier@odiso.com>
>> À: "Igor Fedotov" <ifedotov@suse.de>
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Vendredi 8 Février 2019 16:14:54
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> I'm just seeing
>>
>> StupidAllocator::_aligned_len
>> and
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>
>> on 1 osd, both 10%.
>>
>> here the dump_mempools
>>
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 210243456,
>> "bytes": 210243456
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> "bluestore_cache_onode": {
>> "items": 105637,
>> "bytes": 70988064
>> },
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_fsck": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_txc": {
>> "items": 12,
>> "bytes": 8928
>> },
>> "bluestore_writing_deferred": {
>> "items": 406,
>> "bytes": 4792868
>> },
>> "bluestore_writing": {
>> "items": 66,
>> "bytes": 1085440
>> },
>> "bluefs": {
>> "items": 1882,
>> "bytes": 93600
>> },
>> "buffer_anon": {
>> "items": 138986,
>> "bytes": 24983701
>> },
>> "buffer_meta": {
>> "items": 544,
>> "bytes": 34816
>> },
>> "osd": {
>> "items": 243,
>> "bytes": 3089016
>> },
>> "osd_mapbl": {
>> "items": 36,
>> "bytes": 179308
>> },
>> "osd_pglog": {
>> "items": 952564,
>> "bytes": 372459684
>> },
>> "osdmap": {
>> "items": 3639,
>> "bytes": 224664
>> },
>> "osdmap_mapping": {
>> "items": 0,
>> "bytes": 0
>> },
>> "pgmap": {
>> "items": 0,
>> "bytes": 0
>> },
>> "mds_co": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_1": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_2": {
>> "items": 0,
>> "bytes": 0
>> }
>> },
>> "total": {
>> "items": 260109445,
>> "bytes": 2228370845
>> }
>> }
>> }
>>
>>
>> and the perf dump
>>
>> root@ceph5-2:~# ceph daemon osd.4 perf dump
>> {
>> "AsyncMessenger::Worker-0": {
>> "msgr_recv_messages": 22948570,
>> "msgr_send_messages": 22561570,
>> "msgr_recv_bytes": 333085080271,
>> "msgr_send_bytes": 261798871204,
>> "msgr_created_connections": 6152,
>> "msgr_active_connections": 2701,
>> "msgr_running_total_time": 1055.197867330,
>> "msgr_running_send_time": 352.764480121,
>> "msgr_running_recv_time": 499.206831955,
>> "msgr_running_fast_dispatch_time": 130.982201607
>> },
>> "AsyncMessenger::Worker-1": {
>> "msgr_recv_messages": 18801593,
>> "msgr_send_messages": 18430264,
>> "msgr_recv_bytes": 306871760934,
>> "msgr_send_bytes": 192789048666,
>> "msgr_created_connections": 5773,
>> "msgr_active_connections": 2721,
>> "msgr_running_total_time": 816.821076305,
>> "msgr_running_send_time": 261.353228926,
>> "msgr_running_recv_time": 394.035587911,
>> "msgr_running_fast_dispatch_time": 104.012155720
>> },
>> "AsyncMessenger::Worker-2": {
>> "msgr_recv_messages": 18463400,
>> "msgr_send_messages": 18105856,
>> "msgr_recv_bytes": 187425453590,
>> "msgr_send_bytes": 220735102555,
>> "msgr_created_connections": 5897,
>> "msgr_active_connections": 2605,
>> "msgr_running_total_time": 807.186854324,
>> "msgr_running_send_time": 296.834435839,
>> "msgr_running_recv_time": 351.364389691,
>> "msgr_running_fast_dispatch_time": 101.215776792
>> },
>> "bluefs": {
>> "gift_bytes": 0,
>> "reclaim_bytes": 0,
>> "db_total_bytes": 256050724864,
>> "db_used_bytes": 12413042688,
>> "wal_total_bytes": 0,
>> "wal_used_bytes": 0,
>> "slow_total_bytes": 0,
>> "slow_used_bytes": 0,
>> "num_files": 209,
>> "log_bytes": 10383360,
>> "log_compactions": 14,
>> "logged_bytes": 336498688,
>> "files_written_wal": 2,
>> "files_written_sst": 4499,
>> "bytes_written_wal": 417989099783,
>> "bytes_written_sst": 213188750209
>> },
>> "bluestore": {
>> "kv_flush_lat": {
>> "avgcount": 26371957,
>> "sum": 26.734038497,
>> "avgtime": 0.000001013
>> },
>> "kv_commit_lat": {
>> "avgcount": 26371957,
>> "sum": 3397.491150603,
>> "avgtime": 0.000128829
>> },
>> "kv_lat": {
>> "avgcount": 26371957,
>> "sum": 3424.225189100,
>> "avgtime": 0.000129843
>> },
>> "state_prepare_lat": {
>> "avgcount": 30484924,
>> "sum": 3689.542105337,
>> "avgtime": 0.000121028
>> },
>> "state_aio_wait_lat": {
>> "avgcount": 30484924,
>> "sum": 509.864546111,
>> "avgtime": 0.000016725
>> },
>> "state_io_done_lat": {
>> "avgcount": 30484924,
>> "sum": 24.534052953,
>> "avgtime": 0.000000804
>> },
>> "state_kv_queued_lat": {
>> "avgcount": 30484924,
>> "sum": 3488.338424238,
>> "avgtime": 0.000114428
>> },
>> "state_kv_commiting_lat": {
>> "avgcount": 30484924,
>> "sum": 5660.437003432,
>> "avgtime": 0.000185679
>> },
>> "state_kv_done_lat": {
>> "avgcount": 30484924,
>> "sum": 7.763511500,
>> "avgtime": 0.000000254
>> },
>> "state_deferred_queued_lat": {
>> "avgcount": 26346134,
>> "sum": 666071.296856696,
>> "avgtime": 0.025281557
>> },
>> "state_deferred_aio_wait_lat": {
>> "avgcount": 26346134,
>> "sum": 1755.660547071,
>> "avgtime": 0.000066638
>> },
>> "state_deferred_cleanup_lat": {
>> "avgcount": 26346134,
>> "sum": 185465.151653703,
>> "avgtime": 0.007039558
>> },
>> "state_finishing_lat": {
>> "avgcount": 30484920,
>> "sum": 3.046847481,
>> "avgtime": 0.000000099
>> },
>> "state_done_lat": {
>> "avgcount": 30484920,
>> "sum": 13193.362685280,
>> "avgtime": 0.000432783
>> },
>> "throttle_lat": {
>> "avgcount": 30484924,
>> "sum": 14.634269979,
>> "avgtime": 0.000000480
>> },
>> "submit_lat": {
>> "avgcount": 30484924,
>> "sum": 3873.883076148,
>> "avgtime": 0.000127075
>> },
>> "commit_lat": {
>> "avgcount": 30484924,
>> "sum": 13376.492317331,
>> "avgtime": 0.000438790
>> },
>> "read_lat": {
>> "avgcount": 5873923,
>> "sum": 1817.167582057,
>> "avgtime": 0.000309361
>> },
>> "read_onode_meta_lat": {
>> "avgcount": 19608201,
>> "sum": 146.770464482,
>> "avgtime": 0.000007485
>> },
>> "read_wait_aio_lat": {
>> "avgcount": 13734278,
>> "sum": 2532.578077242,
>> "avgtime": 0.000184398
>> },
>> "compress_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "decompress_lat": {
>> "avgcount": 1346945,
>> "sum": 26.227575896,
>> "avgtime": 0.000019471
>> },
>> "csum_lat": {
>> "avgcount": 28020392,
>> "sum": 149.587819041,
>> "avgtime": 0.000005338
>> },
>> "compress_success_count": 0,
>> "compress_rejected_count": 0,
>> "write_pad_bytes": 352923605,
>> "deferred_write_ops": 24373340,
>> "deferred_write_bytes": 216791842816,
>> "write_penalty_read_ops": 8062366,
>> "bluestore_allocated": 3765566013440,
>> "bluestore_stored": 4186255221852,
>> "bluestore_compressed": 39981379040,
>> "bluestore_compressed_allocated": 73748348928,
>> "bluestore_compressed_original": 165041381376,
>> "bluestore_onodes": 104232,
>> "bluestore_onode_hits": 71206874,
>> "bluestore_onode_misses": 1217914,
>> "bluestore_onode_shard_hits": 260183292,
>> "bluestore_onode_shard_misses": 22851573,
>> "bluestore_extents": 3394513,
>> "bluestore_blobs": 2773587,
>> "bluestore_buffers": 0,
>> "bluestore_buffer_bytes": 0,
>> "bluestore_buffer_hit_bytes": 62026011221,
>> "bluestore_buffer_miss_bytes": 995233669922,
>> "bluestore_write_big": 5648815,
>> "bluestore_write_big_bytes": 552502214656,
>> "bluestore_write_big_blobs": 12440992,
>> "bluestore_write_small": 35883770,
>> "bluestore_write_small_bytes": 223436965719,
>> "bluestore_write_small_unused": 408125,
>> "bluestore_write_small_deferred": 34961455,
>> "bluestore_write_small_pre_read": 34961455,
>> "bluestore_write_small_new": 514190,
>> "bluestore_txc": 30484924,
>> "bluestore_onode_reshard": 5144189,
>> "bluestore_blob_split": 60104,
>> "bluestore_extent_compress": 53347252,
>> "bluestore_gc_merged": 21142528,
>> "bluestore_read_eio": 0,
>> "bluestore_fragmentation_micros": 67
>> },
>> "finisher-defered_finisher": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "finisher-finisher-0": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 26625163,
>> "sum": 1057.506990951,
>> "avgtime": 0.000039718
>> }
>> },
>> "finisher-objecter-finisher-0": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.0::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.0::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.1::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.1::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.2::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.2::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.3::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.3::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.4::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.4::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.5::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.5::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.6::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.6::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.7::sdata_wait_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "mutex-OSDShard.7::shard_lock": {
>> "wait": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "objecter": {
>> "op_active": 0,
>> "op_laggy": 0,
>> "op_send": 0,
>> "op_send_bytes": 0,
>> "op_resend": 0,
>> "op_reply": 0,
>> "op": 0,
>> "op_r": 0,
>> "op_w": 0,
>> "op_rmw": 0,
>> "op_pg": 0,
>> "osdop_stat": 0,
>> "osdop_create": 0,
>> "osdop_read": 0,
>> "osdop_write": 0,
>> "osdop_writefull": 0,
>> "osdop_writesame": 0,
>> "osdop_append": 0,
>> "osdop_zero": 0,
>> "osdop_truncate": 0,
>> "osdop_delete": 0,
>> "osdop_mapext": 0,
>> "osdop_sparse_read": 0,
>> "osdop_clonerange": 0,
>> "osdop_getxattr": 0,
>> "osdop_setxattr": 0,
>> "osdop_cmpxattr": 0,
>> "osdop_rmxattr": 0,
>> "osdop_resetxattrs": 0,
>> "osdop_tmap_up": 0,
>> "osdop_tmap_put": 0,
>> "osdop_tmap_get": 0,
>> "osdop_call": 0,
>> "osdop_watch": 0,
>> "osdop_notify": 0,
>> "osdop_src_cmpxattr": 0,
>> "osdop_pgls": 0,
>> "osdop_pgls_filter": 0,
>> "osdop_other": 0,
>> "linger_active": 0,
>> "linger_send": 0,
>> "linger_resend": 0,
>> "linger_ping": 0,
>> "poolop_active": 0,
>> "poolop_send": 0,
>> "poolop_resend": 0,
>> "poolstat_active": 0,
>> "poolstat_send": 0,
>> "poolstat_resend": 0,
>> "statfs_active": 0,
>> "statfs_send": 0,
>> "statfs_resend": 0,
>> "command_active": 0,
>> "command_send": 0,
>> "command_resend": 0,
>> "map_epoch": 105913,
>> "map_full": 0,
>> "map_inc": 828,
>> "osd_sessions": 0,
>> "osd_session_open": 0,
>> "osd_session_close": 0,
>> "osd_laggy": 0,
>> "omap_wr": 0,
>> "omap_rd": 0,
>> "omap_del": 0
>> },
>> "osd": {
>> "op_wip": 0,
>> "op": 16758102,
>> "op_in_bytes": 238398820586,
>> "op_out_bytes": 165484999463,
>> "op_latency": {
>> "avgcount": 16758102,
>> "sum": 38242.481640842,
>> "avgtime": 0.002282029
>> },
>> "op_process_latency": {
>> "avgcount": 16758102,
>> "sum": 28644.906310687,
>> "avgtime": 0.001709316
>> },
>> "op_prepare_latency": {
>> "avgcount": 16761367,
>> "sum": 3489.856599934,
>> "avgtime": 0.000208208
>> },
>> "op_r": 6188565,
>> "op_r_out_bytes": 165484999463,
>> "op_r_latency": {
>> "avgcount": 6188565,
>> "sum": 4507.365756792,
>> "avgtime": 0.000728337
>> },
>> "op_r_process_latency": {
>> "avgcount": 6188565,
>> "sum": 942.363063429,
>> "avgtime": 0.000152274
>> },
>> "op_r_prepare_latency": {
>> "avgcount": 6188644,
>> "sum": 982.866710389,
>> "avgtime": 0.000158817
>> },
>> "op_w": 10546037,
>> "op_w_in_bytes": 238334329494,
>> "op_w_latency": {
>> "avgcount": 10546037,
>> "sum": 33160.719998316,
>> "avgtime": 0.003144377
>> },
>> "op_w_process_latency": {
>> "avgcount": 10546037,
>> "sum": 27668.702029030,
>> "avgtime": 0.002623611
>> },
>> "op_w_prepare_latency": {
>> "avgcount": 10548652,
>> "sum": 2499.688609173,
>> "avgtime": 0.000236967
>> },
>> "op_rw": 23500,
>> "op_rw_in_bytes": 64491092,
>> "op_rw_out_bytes": 0,
>> "op_rw_latency": {
>> "avgcount": 23500,
>> "sum": 574.395885734,
>> "avgtime": 0.024442378
>> },
>> "op_rw_process_latency": {
>> "avgcount": 23500,
>> "sum": 33.841218228,
>> "avgtime": 0.001440051
>> },
>> "op_rw_prepare_latency": {
>> "avgcount": 24071,
>> "sum": 7.301280372,
>> "avgtime": 0.000303322
>> },
>> "op_before_queue_op_lat": {
>> "avgcount": 57892986,
>> "sum": 1502.117718889,
>> "avgtime": 0.000025946
>> },
>> "op_before_dequeue_op_lat": {
>> "avgcount": 58091683,
>> "sum": 45194.453254037,
>> "avgtime": 0.000777984
>> },
>> "subop": 19784758,
>> "subop_in_bytes": 547174969754,
>> "subop_latency": {
>> "avgcount": 19784758,
>> "sum": 13019.714424060,
>> "avgtime": 0.000658067
>> },
>> "subop_w": 19784758,
>> "subop_w_in_bytes": 547174969754,
>> "subop_w_latency": {
>> "avgcount": 19784758,
>> "sum": 13019.714424060,
>> "avgtime": 0.000658067
>> },
>> "subop_pull": 0,
>> "subop_pull_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "subop_push": 0,
>> "subop_push_in_bytes": 0,
>> "subop_push_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "pull": 0,
>> "push": 2003,
>> "push_out_bytes": 5560009728,
>> "recovery_ops": 1940,
>> "loadavg": 118,
>> "buffer_bytes": 0,
>> "history_alloc_Mbytes": 0,
>> "history_alloc_num": 0,
>> "cached_crc": 0,
>> "cached_crc_adjusted": 0,
>> "missed_crc": 0,
>> "numpg": 243,
>> "numpg_primary": 82,
>> "numpg_replica": 161,
>> "numpg_stray": 0,
>> "numpg_removing": 0,
>> "heartbeat_to_peers": 10,
>> "map_messages": 7013,
>> "map_message_epochs": 7143,
>> "map_message_epoch_dups": 6315,
>> "messages_delayed_for_map": 0,
>> "osd_map_cache_hit": 203309,
>> "osd_map_cache_miss": 33,
>> "osd_map_cache_miss_low": 0,
>> "osd_map_cache_miss_low_avg": {
>> "avgcount": 0,
>> "sum": 0
>> },
>> "osd_map_bl_cache_hit": 47012,
>> "osd_map_bl_cache_miss": 1681,
>> "stat_bytes": 6401248198656,
>> "stat_bytes_used": 3777979072512,
>> "stat_bytes_avail": 2623269126144,
>> "copyfrom": 0,
>> "tier_promote": 0,
>> "tier_flush": 0,
>> "tier_flush_fail": 0,
>> "tier_try_flush": 0,
>> "tier_try_flush_fail": 0,
>> "tier_evict": 0,
>> "tier_whiteout": 1631,
>> "tier_dirty": 22360,
>> "tier_clean": 0,
>> "tier_delay": 0,
>> "tier_proxy_read": 0,
>> "tier_proxy_write": 0,
>> "agent_wake": 0,
>> "agent_skip": 0,
>> "agent_flush": 0,
>> "agent_evict": 0,
>> "object_ctx_cache_hit": 16311156,
>> "object_ctx_cache_total": 17426393,
>> "op_cache_hit": 0,
>> "osd_tier_flush_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_tier_promote_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_tier_r_lat": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "osd_pg_info": 30483113,
>> "osd_pg_fastinfo": 29619885,
>> "osd_pg_biginfo": 81703
>> },
>> "recoverystate_perf": {
>> "initial_latency": {
>> "avgcount": 243,
>> "sum": 6.869296500,
>> "avgtime": 0.028268709
>> },
>> "started_latency": {
>> "avgcount": 1125,
>> "sum": 13551384.917335850,
>> "avgtime": 12045.675482076
>> },
>> "reset_latency": {
>> "avgcount": 1368,
>> "sum": 1101.727799040,
>> "avgtime": 0.805356578
>> },
>> "start_latency": {
>> "avgcount": 1368,
>> "sum": 0.002014799,
>> "avgtime": 0.000001472
>> },
>> "primary_latency": {
>> "avgcount": 507,
>> "sum": 4575560.638823428,
>> "avgtime": 9024.774435549
>> },
>> "peering_latency": {
>> "avgcount": 550,
>> "sum": 499.372283616,
>> "avgtime": 0.907949606
>> },
>> "backfilling_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "waitremotebackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "waitlocalbackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "notbackfilling_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "repnotrecovering_latency": {
>> "avgcount": 1009,
>> "sum": 8975301.082274411,
>> "avgtime": 8895.243887288
>> },
>> "repwaitrecoveryreserved_latency": {
>> "avgcount": 420,
>> "sum": 99.846056520,
>> "avgtime": 0.237728706
>> },
>> "repwaitbackfillreserved_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "reprecovering_latency": {
>> "avgcount": 420,
>> "sum": 241.682764382,
>> "avgtime": 0.575435153
>> },
>> "activating_latency": {
>> "avgcount": 507,
>> "sum": 16.893347339,
>> "avgtime": 0.033320211
>> },
>> "waitlocalrecoveryreserved_latency": {
>> "avgcount": 199,
>> "sum": 672.335512769,
>> "avgtime": 3.378570415
>> },
>> "waitremoterecoveryreserved_latency": {
>> "avgcount": 199,
>> "sum": 213.536439363,
>> "avgtime": 1.073047433
>> },
>> "recovering_latency": {
>> "avgcount": 199,
>> "sum": 79.007696479,
>> "avgtime": 0.397023600
>> },
>> "recovered_latency": {
>> "avgcount": 507,
>> "sum": 14.000732748,
>> "avgtime": 0.027614857
>> },
>> "clean_latency": {
>> "avgcount": 395,
>> "sum": 4574325.900371083,
>> "avgtime": 11580.571899673
>> },
>> "active_latency": {
>> "avgcount": 425,
>> "sum": 4575107.630123680,
>> "avgtime": 10764.959129702
>> },
>> "replicaactive_latency": {
>> "avgcount": 589,
>> "sum": 8975184.499049954,
>> "avgtime": 15238.004242869
>> },
>> "stray_latency": {
>> "avgcount": 818,
>> "sum": 800.729455666,
>> "avgtime": 0.978886865
>> },
>> "getinfo_latency": {
>> "avgcount": 550,
>> "sum": 15.085667048,
>> "avgtime": 0.027428485
>> },
>> "getlog_latency": {
>> "avgcount": 546,
>> "sum": 3.482175693,
>> "avgtime": 0.006377611
>> },
>> "waitactingchange_latency": {
>> "avgcount": 39,
>> "sum": 35.444551284,
>> "avgtime": 0.908834648
>> },
>> "incomplete_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "down_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "getmissing_latency": {
>> "avgcount": 507,
>> "sum": 6.702129624,
>> "avgtime": 0.013219190
>> },
>> "waitupthru_latency": {
>> "avgcount": 507,
>> "sum": 474.098261727,
>> "avgtime": 0.935105052
>> },
>> "notrecovering_latency": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> },
>> "rocksdb": {
>> "get": 28320977,
>> "submit_transaction": 30484924,
>> "submit_transaction_sync": 26371957,
>> "get_latency": {
>> "avgcount": 28320977,
>> "sum": 325.900908733,
>> "avgtime": 0.000011507
>> },
>> "submit_latency": {
>> "avgcount": 30484924,
>> "sum": 1835.888692371,
>> "avgtime": 0.000060222
>> },
>> "submit_sync_latency": {
>> "avgcount": 26371957,
>> "sum": 1431.555230628,
>> "avgtime": 0.000054283
>> },
>> "compact": 0,
>> "compact_range": 0,
>> "compact_queue_merge": 0,
>> "compact_queue_len": 0,
>> "rocksdb_write_wal_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_memtable_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_delay_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> },
>> "rocksdb_write_pre_and_post_time": {
>> "avgcount": 0,
>> "sum": 0.000000000,
>> "avgtime": 0.000000000
>> }
>> }
>> }
>>
>> ----- Mail original -----
>> De: "Igor Fedotov" <ifedotov@suse.de>
>> À: "aderumier" <aderumier@odiso.com>
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mardi 5 Février 2019 18:56:51
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>> (but I have bluestore_fragmentation_micros)
>>> ok, this is the same
>>>
>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");
>>>
>>>
>>> Here a graph on last month, with bluestore_fragmentation_micros and latency,
>>>
>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
>> it? The same for other OSDs?
>>
>> This proves some issue with the allocator - generally fragmentation
>> might grow but it shouldn't reset on restart. Looks like some intervals
>> aren't properly merged in run-time.
>>
>> On the other side I'm not completely sure that latency degradation is
>> caused by that - fragmentation growth is relatively small - I don't see
>> how this might impact performance that high.
>>
>> Wondering if you have OSD mempool monitoring (dump_mempools command
>> output on admin socket) reports? Do you have any historic data?
>>
>> If not may I have current output and say a couple more samples with
>> 8-12 hours interval?
>>
>>
>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans
>> before that but I'll discuss this at BlueStore meeting shortly.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>> ----- Mail original -----
>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>> À: "Igor Fedotov" <ifedotov@suse.de>
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Lundi 4 Février 2019 16:04:38
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Thanks Igor,
>>>
>>>>> Could you please collect BlueStore performance counters right after OSD
>>>>> startup and once you get high latency.
>>>>>
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>> I'm already monitoring with
>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters)
>>>
>>> but I don't see l_bluestore_fragmentation counter.
>>>
>>> (but I have bluestore_fragmentation_micros)
>>>
>>>
>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>> patch to track latency and some other internal allocator's paramter to
>>>>> make sure it's degraded and learn more details.
>>> Sorry, It's a critical production cluster, I can't test on it :(
>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.
>>>
>>>
>>>
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>>> and try the difference...
>>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
>>> perf results of new bitmap allocator seem very promising from what I've seen in PR.
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Lundi 4 Février 2019 15:51:30
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi Alexandre,
>>>
>>> looks like a bug in StupidAllocator.
>>>
>>> Could you please collect BlueStore performance counters right after OSD
>>> startup and once you get high latency.
>>>
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>
>>> Also if you're able to rebuild the code I can probably make a simple
>>> patch to track latency and some other internal allocator's paramter to
>>> make sure it's degraded and learn more details.
>>>
>>>
>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>> and try the difference...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>>>> Hi again,
>>>>
>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>>>>
>>>>
>>>> I have notice something using a simple "perf top",
>>>>
>>>> each time I have this problem (I have seen exactly 4 times the same behaviour),
>>>>
>>>> when latency is bad, perf top give me :
>>>>
>>>> StupidAllocator::_aligned_len
>>>> and
>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
>>>> const, unsigned long>*>::increment_slow()
>>>>
>>>> (around 10-20% time for both)
>>>>
>>>>
>>>> when latency is good, I don't see them at all.
>>>>
>>>>
>>>> I have used the Mark wallclock profiler, here the results:
>>>>
>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>>>
>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>>>
>>>>
>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>>>>
>>>>
>>>> + 100.00% clone
>>>> + 100.00% start_thread
>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 4 Février 2019 09:38:11
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> Hi,
>>>>
>>>> some news:
>>>>
>>>> I have tried with different transparent hugepage values (madvise, never) : no change
>>>>
>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>>>
>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>>>>
>>>>
>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
>>>> my others clusters user 1,6TB ssd.
>>>>
>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>>>>
>>>>
>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "aderumier" <aderumier@odiso.com>
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>> op_r_latency but instead op_latency?
>>>>>>
>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>>>>
>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>> À: "aderumier" <aderumier@odiso.com>
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> Hi,
>>>>
>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>>>> Hi Stefan,
>>>>>
>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>>>> like suggested. This report makes me a little nervous about my change.
>>>>> Well,I'm really not sure that it's a tcmalloc bug.
>>>>> maybe bluestore related (don't have filestore anymore to compare)
>>>>> I need to compare with bigger latencies
>>>>>
>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>>>>> http://odisoweb1.odiso.net/latencybad.png
>>>>>
>>>>> I observe the latency in my guest vm too, on disks iowait.
>>>>>
>>>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>>>
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>> here my influxdb queries:
>>>>>
>>>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>
>>>>>
>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>> op_r_latency but instead op_latency?
>>>>
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>>
>>>> greets,
>>>> Stefan
>>>>
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>>>> Hi,
>>>>>>
>>>>>> here some new results,
>>>>>> different osd/ different cluster
>>>>>>
>>>>>> before osd restart latency was between 2-5ms
>>>>>> after osd restart is around 1-1.5ms
>>>>>>
>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>>>
>>>>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>>>>> (I'm using tcmalloc 2.5-2.2)
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>> like suggested. This report makes me a little nervous about my change.
>>>>>
>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>> exact values out of the daemon do you use for bluestore?
>>>>>
>>>>> I would like to check if i see the same behaviour.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Sage Weil" <sage@newdream.net>
>>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>>>>
>>>>>> Can you capture a perf top or perf record to see where teh CPU time is
>>>>>> going on one of the OSDs wth a high latency?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>>>
>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>>>
>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>>>>
>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>>>>> values like 20-200ms.
>>>>>>>
>>>>>>> Some example graphs:
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>>>
>>>>>>> All osds have this behaviour, in all clusters.
>>>>>>>
>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>>>>
>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>>>
>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>>>>
>>>>>>> Any Hints for counters/logs to check ?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Alexandre
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>
>>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <f97b81e4-265d-cd8e-3053-321d988720c4-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                         ` <f97b81e4-265d-cd8e-3053-321d988720c4-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-15 13:31                                                                           ` Alexandre DERUMIER
       [not found]                                                                             ` <19368722.1223708.1550237472044.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-15 13:31 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

Thanks Igor.

I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different.

I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem.







----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 15 Février 2019 13:47:57
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi Alexander, 

I've read through your reports, nothing obvious so far. 

I can only see several times average latency increase for OSD write ops 
(in seconds) 
0.002040060 (first hour) vs. 

0.002483516 (last 24 hours) vs. 
0.008382087 (last hour) 

subop_w_latency: 
0.000478934 (first hour) vs. 
0.000537956 (last 24 hours) vs. 
0.003073475 (last hour) 

and OSD read ops, osd_r_latency: 

0.000408595 (first hour) 
0.000709031 (24 hours) 
0.004979540 (last hour) 

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one. 

Which probably means that the issue is rather somewhere above BlueStore. 

Suggest to proceed with perf dumps collection to see if the picture 
stays the same. 

W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe. 

Thanks, 
Igor 

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
> Hi Igor, 
> 
> Thanks again for helping ! 
> 
> 
> 
> I have upgrade to last mimic this weekend, and with new autotune memory, 
> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
> 
> 
> I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours, 
> here the reports for osd.0: 
> 
> http://odisoweb1.odiso.net/perfanalysis/ 
> 
> 
> osd has been started the 12-02-2019 at 08:00 
> 
> first report after 1h running 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
> 
> 
> 
> report after 24 before counter resets 
> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
> 
> report 1h after counter reset 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
> 
> 
> 
> 
> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
around 12-02-2019 at 14:00 
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
> Then after that, slowly decreasing. 
> 
> 
> Another strange thing, 
> I'm seeing total bytes at 5G at 12-02-2018.13:30 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
> Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G 
> 
> 
> I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time. 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "Alexandre Derumier" <aderumier@odiso.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 11 Février 2019 12:03:17 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
> 
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>> another mempool dump after 1h run. (latency ok) 
>> 
>> Biggest difference: 
>> 
>> before restart 
>> ------------- 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory) 
>> 
>> 
>> After restart 
>> ------------- 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> 
> This is fine as cache is warming after restart and some rebalancing 
> between data and metadata might occur. 
> 
> What relates to allocator and most probably to fragmentation growth is : 
> 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes": 165053952 
> }, 
> 
> which had been higher before the reset (if I got these dumps' order 
> properly) 
> 
> "bluestore_alloc": { 
> "items": 210243456, 
> "bytes": 210243456 
> }, 
> 
> But as I mentioned - I'm not 100% sure this might cause such a huge 
> latency increase... 
> 
> Do you have perf counters dump after the restart? 
> 
> Could you collect some more dumps - for both mempool and perf counters? 
> 
> So ideally I'd like to have: 
> 
> 1) mempool/perf counters dumps after the restart (1hour is OK) 
> 
> 2) mempool/perf counters dumps in 24+ hours after restart 
> 
> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
> restart) and dump mempool/perf counters again. 
> 
> So we'll be able to learn both allocator mem usage growth and operation 
> latency distribution for the following periods: 
> 
> a) 1st hour after restart 
> 
> b) 25th hour. 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
>> full mempool dump after restart 
>> ------------------------------- 
>> 
>> { 
>> "mempool": { 
>> "by_pool": { 
>> "bloom_filter": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> "bluestore_cache_onode": { 
>> "items": 22225, 
>> "bytes": 14935200 
>> }, 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_fsck": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_txc": { 
>> "items": 11, 
>> "bytes": 8184 
>> }, 
>> "bluestore_writing_deferred": { 
>> "items": 5047, 
>> "bytes": 22673736 
>> }, 
>> "bluestore_writing": { 
>> "items": 91, 
>> "bytes": 1662976 
>> }, 
>> "bluefs": { 
>> "items": 1907, 
>> "bytes": 95600 
>> }, 
>> "buffer_anon": { 
>> "items": 19664, 
>> "bytes": 25486050 
>> }, 
>> "buffer_meta": { 
>> "items": 46189, 
>> "bytes": 2956096 
>> }, 
>> "osd": { 
>> "items": 243, 
>> "bytes": 3089016 
>> }, 
>> "osd_mapbl": { 
>> "items": 17, 
>> "bytes": 214366 
>> }, 
>> "osd_pglog": { 
>> "items": 889673, 
>> "bytes": 367160400 
>> }, 
>> "osdmap": { 
>> "items": 3803, 
>> "bytes": 224552 
>> }, 
>> "osdmap_mapping": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "pgmap": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "mds_co": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_1": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_2": { 
>> "items": 0, 
>> "bytes": 0 
>> } 
>> }, 
>> "total": { 
>> "items": 178515204, 
>> "bytes": 2160630547 
>> } 
>> } 
>> } 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
>> 
>> I'm just seeing 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, mempoo 
>> 
>> on 1 osd, both 10%. 
>> 
>> here the dump_mempools 
>> 
>> { 
>> "mempool": { 
>> "by_pool": { 
>> "bloom_filter": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> "bluestore_cache_onode": { 
>> "items": 105637, 
>> "bytes": 70988064 
>> }, 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_fsck": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_txc": { 
>> "items": 12, 
>> "bytes": 8928 
>> }, 
>> "bluestore_writing_deferred": { 
>> "items": 406, 
>> "bytes": 4792868 
>> }, 
>> "bluestore_writing": { 
>> "items": 66, 
>> "bytes": 1085440 
>> }, 
>> "bluefs": { 
>> "items": 1882, 
>> "bytes": 93600 
>> }, 
>> "buffer_anon": { 
>> "items": 138986, 
>> "bytes": 24983701 
>> }, 
>> "buffer_meta": { 
>> "items": 544, 
>> "bytes": 34816 
>> }, 
>> "osd": { 
>> "items": 243, 
>> "bytes": 3089016 
>> }, 
>> "osd_mapbl": { 
>> "items": 36, 
>> "bytes": 179308 
>> }, 
>> "osd_pglog": { 
>> "items": 952564, 
>> "bytes": 372459684 
>> }, 
>> "osdmap": { 
>> "items": 3639, 
>> "bytes": 224664 
>> }, 
>> "osdmap_mapping": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "pgmap": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "mds_co": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_1": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_2": { 
>> "items": 0, 
>> "bytes": 0 
>> } 
>> }, 
>> "total": { 
>> "items": 260109445, 
>> "bytes": 2228370845 
>> } 
>> } 
>> } 
>> 
>> 
>> and the perf dump 
>> 
>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>> { 
>> "AsyncMessenger::Worker-0": { 
>> "msgr_recv_messages": 22948570, 
>> "msgr_send_messages": 22561570, 
>> "msgr_recv_bytes": 333085080271, 
>> "msgr_send_bytes": 261798871204, 
>> "msgr_created_connections": 6152, 
>> "msgr_active_connections": 2701, 
>> "msgr_running_total_time": 1055.197867330, 
>> "msgr_running_send_time": 352.764480121, 
>> "msgr_running_recv_time": 499.206831955, 
>> "msgr_running_fast_dispatch_time": 130.982201607 
>> }, 
>> "AsyncMessenger::Worker-1": { 
>> "msgr_recv_messages": 18801593, 
>> "msgr_send_messages": 18430264, 
>> "msgr_recv_bytes": 306871760934, 
>> "msgr_send_bytes": 192789048666, 
>> "msgr_created_connections": 5773, 
>> "msgr_active_connections": 2721, 
>> "msgr_running_total_time": 816.821076305, 
>> "msgr_running_send_time": 261.353228926, 
>> "msgr_running_recv_time": 394.035587911, 
>> "msgr_running_fast_dispatch_time": 104.012155720 
>> }, 
>> "AsyncMessenger::Worker-2": { 
>> "msgr_recv_messages": 18463400, 
>> "msgr_send_messages": 18105856, 
>> "msgr_recv_bytes": 187425453590, 
>> "msgr_send_bytes": 220735102555, 
>> "msgr_created_connections": 5897, 
>> "msgr_active_connections": 2605, 
>> "msgr_running_total_time": 807.186854324, 
>> "msgr_running_send_time": 296.834435839, 
>> "msgr_running_recv_time": 351.364389691, 
>> "msgr_running_fast_dispatch_time": 101.215776792 
>> }, 
>> "bluefs": { 
>> "gift_bytes": 0, 
>> "reclaim_bytes": 0, 
>> "db_total_bytes": 256050724864, 
>> "db_used_bytes": 12413042688, 
>> "wal_total_bytes": 0, 
>> "wal_used_bytes": 0, 
>> "slow_total_bytes": 0, 
>> "slow_used_bytes": 0, 
>> "num_files": 209, 
>> "log_bytes": 10383360, 
>> "log_compactions": 14, 
>> "logged_bytes": 336498688, 
>> "files_written_wal": 2, 
>> "files_written_sst": 4499, 
>> "bytes_written_wal": 417989099783, 
>> "bytes_written_sst": 213188750209 
>> }, 
>> "bluestore": { 
>> "kv_flush_lat": { 
>> "avgcount": 26371957, 
>> "sum": 26.734038497, 
>> "avgtime": 0.000001013 
>> }, 
>> "kv_commit_lat": { 
>> "avgcount": 26371957, 
>> "sum": 3397.491150603, 
>> "avgtime": 0.000128829 
>> }, 
>> "kv_lat": { 
>> "avgcount": 26371957, 
>> "sum": 3424.225189100, 
>> "avgtime": 0.000129843 
>> }, 
>> "state_prepare_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3689.542105337, 
>> "avgtime": 0.000121028 
>> }, 
>> "state_aio_wait_lat": { 
>> "avgcount": 30484924, 
>> "sum": 509.864546111, 
>> "avgtime": 0.000016725 
>> }, 
>> "state_io_done_lat": { 
>> "avgcount": 30484924, 
>> "sum": 24.534052953, 
>> "avgtime": 0.000000804 
>> }, 
>> "state_kv_queued_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3488.338424238, 
>> "avgtime": 0.000114428 
>> }, 
>> "state_kv_commiting_lat": { 
>> "avgcount": 30484924, 
>> "sum": 5660.437003432, 
>> "avgtime": 0.000185679 
>> }, 
>> "state_kv_done_lat": { 
>> "avgcount": 30484924, 
>> "sum": 7.763511500, 
>> "avgtime": 0.000000254 
>> }, 
>> "state_deferred_queued_lat": { 
>> "avgcount": 26346134, 
>> "sum": 666071.296856696, 
>> "avgtime": 0.025281557 
>> }, 
>> "state_deferred_aio_wait_lat": { 
>> "avgcount": 26346134, 
>> "sum": 1755.660547071, 
>> "avgtime": 0.000066638 
>> }, 
>> "state_deferred_cleanup_lat": { 
>> "avgcount": 26346134, 
>> "sum": 185465.151653703, 
>> "avgtime": 0.007039558 
>> }, 
>> "state_finishing_lat": { 
>> "avgcount": 30484920, 
>> "sum": 3.046847481, 
>> "avgtime": 0.000000099 
>> }, 
>> "state_done_lat": { 
>> "avgcount": 30484920, 
>> "sum": 13193.362685280, 
>> "avgtime": 0.000432783 
>> }, 
>> "throttle_lat": { 
>> "avgcount": 30484924, 
>> "sum": 14.634269979, 
>> "avgtime": 0.000000480 
>> }, 
>> "submit_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3873.883076148, 
>> "avgtime": 0.000127075 
>> }, 
>> "commit_lat": { 
>> "avgcount": 30484924, 
>> "sum": 13376.492317331, 
>> "avgtime": 0.000438790 
>> }, 
>> "read_lat": { 
>> "avgcount": 5873923, 
>> "sum": 1817.167582057, 
>> "avgtime": 0.000309361 
>> }, 
>> "read_onode_meta_lat": { 
>> "avgcount": 19608201, 
>> "sum": 146.770464482, 
>> "avgtime": 0.000007485 
>> }, 
>> "read_wait_aio_lat": { 
>> "avgcount": 13734278, 
>> "sum": 2532.578077242, 
>> "avgtime": 0.000184398 
>> }, 
>> "compress_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "decompress_lat": { 
>> "avgcount": 1346945, 
>> "sum": 26.227575896, 
>> "avgtime": 0.000019471 
>> }, 
>> "csum_lat": { 
>> "avgcount": 28020392, 
>> "sum": 149.587819041, 
>> "avgtime": 0.000005338 
>> }, 
>> "compress_success_count": 0, 
>> "compress_rejected_count": 0, 
>> "write_pad_bytes": 352923605, 
>> "deferred_write_ops": 24373340, 
>> "deferred_write_bytes": 216791842816, 
>> "write_penalty_read_ops": 8062366, 
>> "bluestore_allocated": 3765566013440, 
>> "bluestore_stored": 4186255221852, 
>> "bluestore_compressed": 39981379040, 
>> "bluestore_compressed_allocated": 73748348928, 
>> "bluestore_compressed_original": 165041381376, 
>> "bluestore_onodes": 104232, 
>> "bluestore_onode_hits": 71206874, 
>> "bluestore_onode_misses": 1217914, 
>> "bluestore_onode_shard_hits": 260183292, 
>> "bluestore_onode_shard_misses": 22851573, 
>> "bluestore_extents": 3394513, 
>> "bluestore_blobs": 2773587, 
>> "bluestore_buffers": 0, 
>> "bluestore_buffer_bytes": 0, 
>> "bluestore_buffer_hit_bytes": 62026011221, 
>> "bluestore_buffer_miss_bytes": 995233669922, 
>> "bluestore_write_big": 5648815, 
>> "bluestore_write_big_bytes": 552502214656, 
>> "bluestore_write_big_blobs": 12440992, 
>> "bluestore_write_small": 35883770, 
>> "bluestore_write_small_bytes": 223436965719, 
>> "bluestore_write_small_unused": 408125, 
>> "bluestore_write_small_deferred": 34961455, 
>> "bluestore_write_small_pre_read": 34961455, 
>> "bluestore_write_small_new": 514190, 
>> "bluestore_txc": 30484924, 
>> "bluestore_onode_reshard": 5144189, 
>> "bluestore_blob_split": 60104, 
>> "bluestore_extent_compress": 53347252, 
>> "bluestore_gc_merged": 21142528, 
>> "bluestore_read_eio": 0, 
>> "bluestore_fragmentation_micros": 67 
>> }, 
>> "finisher-defered_finisher": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "finisher-finisher-0": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 26625163, 
>> "sum": 1057.506990951, 
>> "avgtime": 0.000039718 
>> } 
>> }, 
>> "finisher-objecter-finisher-0": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.0::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.0::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.1::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.1::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.2::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.2::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.3::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.3::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.4::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.4::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.5::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.5::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.6::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.6::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.7::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.7::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "objecter": { 
>> "op_active": 0, 
>> "op_laggy": 0, 
>> "op_send": 0, 
>> "op_send_bytes": 0, 
>> "op_resend": 0, 
>> "op_reply": 0, 
>> "op": 0, 
>> "op_r": 0, 
>> "op_w": 0, 
>> "op_rmw": 0, 
>> "op_pg": 0, 
>> "osdop_stat": 0, 
>> "osdop_create": 0, 
>> "osdop_read": 0, 
>> "osdop_write": 0, 
>> "osdop_writefull": 0, 
>> "osdop_writesame": 0, 
>> "osdop_append": 0, 
>> "osdop_zero": 0, 
>> "osdop_truncate": 0, 
>> "osdop_delete": 0, 
>> "osdop_mapext": 0, 
>> "osdop_sparse_read": 0, 
>> "osdop_clonerange": 0, 
>> "osdop_getxattr": 0, 
>> "osdop_setxattr": 0, 
>> "osdop_cmpxattr": 0, 
>> "osdop_rmxattr": 0, 
>> "osdop_resetxattrs": 0, 
>> "osdop_tmap_up": 0, 
>> "osdop_tmap_put": 0, 
>> "osdop_tmap_get": 0, 
>> "osdop_call": 0, 
>> "osdop_watch": 0, 
>> "osdop_notify": 0, 
>> "osdop_src_cmpxattr": 0, 
>> "osdop_pgls": 0, 
>> "osdop_pgls_filter": 0, 
>> "osdop_other": 0, 
>> "linger_active": 0, 
>> "linger_send": 0, 
>> "linger_resend": 0, 
>> "linger_ping": 0, 
>> "poolop_active": 0, 
>> "poolop_send": 0, 
>> "poolop_resend": 0, 
>> "poolstat_active": 0, 
>> "poolstat_send": 0, 
>> "poolstat_resend": 0, 
>> "statfs_active": 0, 
>> "statfs_send": 0, 
>> "statfs_resend": 0, 
>> "command_active": 0, 
>> "command_send": 0, 
>> "command_resend": 0, 
>> "map_epoch": 105913, 
>> "map_full": 0, 
>> "map_inc": 828, 
>> "osd_sessions": 0, 
>> "osd_session_open": 0, 
>> "osd_session_close": 0, 
>> "osd_laggy": 0, 
>> "omap_wr": 0, 
>> "omap_rd": 0, 
>> "omap_del": 0 
>> }, 
>> "osd": { 
>> "op_wip": 0, 
>> "op": 16758102, 
>> "op_in_bytes": 238398820586, 
>> "op_out_bytes": 165484999463, 
>> "op_latency": { 
>> "avgcount": 16758102, 
>> "sum": 38242.481640842, 
>> "avgtime": 0.002282029 
>> }, 
>> "op_process_latency": { 
>> "avgcount": 16758102, 
>> "sum": 28644.906310687, 
>> "avgtime": 0.001709316 
>> }, 
>> "op_prepare_latency": { 
>> "avgcount": 16761367, 
>> "sum": 3489.856599934, 
>> "avgtime": 0.000208208 
>> }, 
>> "op_r": 6188565, 
>> "op_r_out_bytes": 165484999463, 
>> "op_r_latency": { 
>> "avgcount": 6188565, 
>> "sum": 4507.365756792, 
>> "avgtime": 0.000728337 
>> }, 
>> "op_r_process_latency": { 
>> "avgcount": 6188565, 
>> "sum": 942.363063429, 
>> "avgtime": 0.000152274 
>> }, 
>> "op_r_prepare_latency": { 
>> "avgcount": 6188644, 
>> "sum": 982.866710389, 
>> "avgtime": 0.000158817 
>> }, 
>> "op_w": 10546037, 
>> "op_w_in_bytes": 238334329494, 
>> "op_w_latency": { 
>> "avgcount": 10546037, 
>> "sum": 33160.719998316, 
>> "avgtime": 0.003144377 
>> }, 
>> "op_w_process_latency": { 
>> "avgcount": 10546037, 
>> "sum": 27668.702029030, 
>> "avgtime": 0.002623611 
>> }, 
>> "op_w_prepare_latency": { 
>> "avgcount": 10548652, 
>> "sum": 2499.688609173, 
>> "avgtime": 0.000236967 
>> }, 
>> "op_rw": 23500, 
>> "op_rw_in_bytes": 64491092, 
>> "op_rw_out_bytes": 0, 
>> "op_rw_latency": { 
>> "avgcount": 23500, 
>> "sum": 574.395885734, 
>> "avgtime": 0.024442378 
>> }, 
>> "op_rw_process_latency": { 
>> "avgcount": 23500, 
>> "sum": 33.841218228, 
>> "avgtime": 0.001440051 
>> }, 
>> "op_rw_prepare_latency": { 
>> "avgcount": 24071, 
>> "sum": 7.301280372, 
>> "avgtime": 0.000303322 
>> }, 
>> "op_before_queue_op_lat": { 
>> "avgcount": 57892986, 
>> "sum": 1502.117718889, 
>> "avgtime": 0.000025946 
>> }, 
>> "op_before_dequeue_op_lat": { 
>> "avgcount": 58091683, 
>> "sum": 45194.453254037, 
>> "avgtime": 0.000777984 
>> }, 
>> "subop": 19784758, 
>> "subop_in_bytes": 547174969754, 
>> "subop_latency": { 
>> "avgcount": 19784758, 
>> "sum": 13019.714424060, 
>> "avgtime": 0.000658067 
>> }, 
>> "subop_w": 19784758, 
>> "subop_w_in_bytes": 547174969754, 
>> "subop_w_latency": { 
>> "avgcount": 19784758, 
>> "sum": 13019.714424060, 
>> "avgtime": 0.000658067 
>> }, 
>> "subop_pull": 0, 
>> "subop_pull_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "subop_push": 0, 
>> "subop_push_in_bytes": 0, 
>> "subop_push_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "pull": 0, 
>> "push": 2003, 
>> "push_out_bytes": 5560009728, 
>> "recovery_ops": 1940, 
>> "loadavg": 118, 
>> "buffer_bytes": 0, 
>> "history_alloc_Mbytes": 0, 
>> "history_alloc_num": 0, 
>> "cached_crc": 0, 
>> "cached_crc_adjusted": 0, 
>> "missed_crc": 0, 
>> "numpg": 243, 
>> "numpg_primary": 82, 
>> "numpg_replica": 161, 
>> "numpg_stray": 0, 
>> "numpg_removing": 0, 
>> "heartbeat_to_peers": 10, 
>> "map_messages": 7013, 
>> "map_message_epochs": 7143, 
>> "map_message_epoch_dups": 6315, 
>> "messages_delayed_for_map": 0, 
>> "osd_map_cache_hit": 203309, 
>> "osd_map_cache_miss": 33, 
>> "osd_map_cache_miss_low": 0, 
>> "osd_map_cache_miss_low_avg": { 
>> "avgcount": 0, 
>> "sum": 0 
>> }, 
>> "osd_map_bl_cache_hit": 47012, 
>> "osd_map_bl_cache_miss": 1681, 
>> "stat_bytes": 6401248198656, 
>> "stat_bytes_used": 3777979072512, 
>> "stat_bytes_avail": 2623269126144, 
>> "copyfrom": 0, 
>> "tier_promote": 0, 
>> "tier_flush": 0, 
>> "tier_flush_fail": 0, 
>> "tier_try_flush": 0, 
>> "tier_try_flush_fail": 0, 
>> "tier_evict": 0, 
>> "tier_whiteout": 1631, 
>> "tier_dirty": 22360, 
>> "tier_clean": 0, 
>> "tier_delay": 0, 
>> "tier_proxy_read": 0, 
>> "tier_proxy_write": 0, 
>> "agent_wake": 0, 
>> "agent_skip": 0, 
>> "agent_flush": 0, 
>> "agent_evict": 0, 
>> "object_ctx_cache_hit": 16311156, 
>> "object_ctx_cache_total": 17426393, 
>> "op_cache_hit": 0, 
>> "osd_tier_flush_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_tier_promote_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_tier_r_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_pg_info": 30483113, 
>> "osd_pg_fastinfo": 29619885, 
>> "osd_pg_biginfo": 81703 
>> }, 
>> "recoverystate_perf": { 
>> "initial_latency": { 
>> "avgcount": 243, 
>> "sum": 6.869296500, 
>> "avgtime": 0.028268709 
>> }, 
>> "started_latency": { 
>> "avgcount": 1125, 
>> "sum": 13551384.917335850, 
>> "avgtime": 12045.675482076 
>> }, 
>> "reset_latency": { 
>> "avgcount": 1368, 
>> "sum": 1101.727799040, 
>> "avgtime": 0.805356578 
>> }, 
>> "start_latency": { 
>> "avgcount": 1368, 
>> "sum": 0.002014799, 
>> "avgtime": 0.000001472 
>> }, 
>> "primary_latency": { 
>> "avgcount": 507, 
>> "sum": 4575560.638823428, 
>> "avgtime": 9024.774435549 
>> }, 
>> "peering_latency": { 
>> "avgcount": 550, 
>> "sum": 499.372283616, 
>> "avgtime": 0.907949606 
>> }, 
>> "backfilling_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "waitremotebackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "waitlocalbackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "notbackfilling_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "repnotrecovering_latency": { 
>> "avgcount": 1009, 
>> "sum": 8975301.082274411, 
>> "avgtime": 8895.243887288 
>> }, 
>> "repwaitrecoveryreserved_latency": { 
>> "avgcount": 420, 
>> "sum": 99.846056520, 
>> "avgtime": 0.237728706 
>> }, 
>> "repwaitbackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "reprecovering_latency": { 
>> "avgcount": 420, 
>> "sum": 241.682764382, 
>> "avgtime": 0.575435153 
>> }, 
>> "activating_latency": { 
>> "avgcount": 507, 
>> "sum": 16.893347339, 
>> "avgtime": 0.033320211 
>> }, 
>> "waitlocalrecoveryreserved_latency": { 
>> "avgcount": 199, 
>> "sum": 672.335512769, 
>> "avgtime": 3.378570415 
>> }, 
>> "waitremoterecoveryreserved_latency": { 
>> "avgcount": 199, 
>> "sum": 213.536439363, 
>> "avgtime": 1.073047433 
>> }, 
>> "recovering_latency": { 
>> "avgcount": 199, 
>> "sum": 79.007696479, 
>> "avgtime": 0.397023600 
>> }, 
>> "recovered_latency": { 
>> "avgcount": 507, 
>> "sum": 14.000732748, 
>> "avgtime": 0.027614857 
>> }, 
>> "clean_latency": { 
>> "avgcount": 395, 
>> "sum": 4574325.900371083, 
>> "avgtime": 11580.571899673 
>> }, 
>> "active_latency": { 
>> "avgcount": 425, 
>> "sum": 4575107.630123680, 
>> "avgtime": 10764.959129702 
>> }, 
>> "replicaactive_latency": { 
>> "avgcount": 589, 
>> "sum": 8975184.499049954, 
>> "avgtime": 15238.004242869 
>> }, 
>> "stray_latency": { 
>> "avgcount": 818, 
>> "sum": 800.729455666, 
>> "avgtime": 0.978886865 
>> }, 
>> "getinfo_latency": { 
>> "avgcount": 550, 
>> "sum": 15.085667048, 
>> "avgtime": 0.027428485 
>> }, 
>> "getlog_latency": { 
>> "avgcount": 546, 
>> "sum": 3.482175693, 
>> "avgtime": 0.006377611 
>> }, 
>> "waitactingchange_latency": { 
>> "avgcount": 39, 
>> "sum": 35.444551284, 
>> "avgtime": 0.908834648 
>> }, 
>> "incomplete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "down_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "getmissing_latency": { 
>> "avgcount": 507, 
>> "sum": 6.702129624, 
>> "avgtime": 0.013219190 
>> }, 
>> "waitupthru_latency": { 
>> "avgcount": 507, 
>> "sum": 474.098261727, 
>> "avgtime": 0.935105052 
>> }, 
>> "notrecovering_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "rocksdb": { 
>> "get": 28320977, 
>> "submit_transaction": 30484924, 
>> "submit_transaction_sync": 26371957, 
>> "get_latency": { 
>> "avgcount": 28320977, 
>> "sum": 325.900908733, 
>> "avgtime": 0.000011507 
>> }, 
>> "submit_latency": { 
>> "avgcount": 30484924, 
>> "sum": 1835.888692371, 
>> "avgtime": 0.000060222 
>> }, 
>> "submit_sync_latency": { 
>> "avgcount": 26371957, 
>> "sum": 1431.555230628, 
>> "avgtime": 0.000054283 
>> }, 
>> "compact": 0, 
>> "compact_range": 0, 
>> "compact_queue_merge": 0, 
>> "compact_queue_len": 0, 
>> "rocksdb_write_wal_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_memtable_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_delay_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_pre_and_post_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> } 
>> } 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org> 
>> Envoyé: Mardi 5 Février 2019 18:56:51 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
>> 
>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>> (but I have bluestore_fragmentation_micros) 
>>> ok, this is the same 
>>> 
>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>> "How fragmented bluestore free space is (free extents / max 
possible number of free extents) * 1000"); 
>>> 
>>> 
>>> Here a graph on last month, with bluestore_fragmentation_micros and 
latency, 
>>> 
>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>> it? The same for other OSDs? 
>> 
>> This proves some issue with the allocator - generally fragmentation 
>> might grow but it shouldn't reset on restart. Looks like some intervals 
>> aren't properly merged in run-time. 
>> 
>> On the other side I'm not completely sure that latency degradation is 
>> caused by that - fragmentation growth is relatively small - I don't see 
>> how this might impact performance that high. 
>> 
>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>> output on admin socket) reports? Do you have any historic data? 
>> 
>> If not may I have current output and say a couple more samples with 
>> 8-12 hours interval? 
>> 
>> 
>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
plans 
>> before that but I'll discuss this at BlueStore meeting shortly. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>>> ----- Mail original ----- 
>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
"ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
>>> 
>>> Thanks Igor, 
>>> 
>>>>> Could you please collect BlueStore performance counters right 
after OSD 
>>>>> startup and once you get high latency. 
>>>>> 
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>> I'm already monitoring with 
>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
counters) 
>>> 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> 
>>> (but I have bluestore_fragmentation_micros) 
>>> 
>>> 
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's 
paramter to 
>>>>> make sure it's degraded and learn more details. 
>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>> But I have a test cluster, maybe I can try to put some load on it, 
and try to reproduce. 
>>> 
>>> 
>>> 
>>>>> More vigorous fix would be to backport bitmap allocator from 
Nautilus 
>>>>> and try the difference... 
>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>> perf results of new bitmap allocator seem very promising from what 
I've seen in PR. 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
>>> 
>>> Hi Alexandre, 
>>> 
>>> looks like a bug in StupidAllocator. 
>>> 
>>> Could you please collect BlueStore performance counters right after 
OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
>>> 
>>> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>> 
>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>> Hi again, 
>>>> 
>>>> I speak too fast, the problem has occured again, so it's not 
tcmalloc cache size related. 
>>>> 
>>>> 
>>>> I have notice something using a simple "perf top", 
>>>> 
>>>> each time I have this problem (I have seen exactly 4 times the 
same behaviour), 
>>>> 
>>>> when latency is bad, perf top give me : 
>>>> 
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>> 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, mempoo 
>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
unsigned long>&, std::pair<unsigned long 
>>>> const, unsigned long>*>::increment_slow() 
>>>> 
>>>> (around 10-20% time for both) 
>>>> 
>>>> 
>>>> when latency is good, I don't see them at all. 
>>>> 
>>>> 
>>>> I have used the Mark wallclock profiler, here the results: 
>>>> 
>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>> 
>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>> 
>>>> 
>>>> here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len 
>>>> 
>>>> 
>>>> + 100.00% clone 
>>>> + 100.00% start_thread 
>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*) 
>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
ThreadPool::TPHandle&) 
>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>> | + 70.00% 
PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&) 
>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, 
boost::intrusive_ptr<OpRequest>) 
>>>> | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, 
boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>> | | | + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*) 
>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int) 
>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int) 
>>>> | | | | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>, 
boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
unsigned long, long, unsigned long*, unsigned int*) 
>>>> | | | | | | + 34.00% 
btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
long, unsigned long, std::less<unsigned long>, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
unsigned long>&, std::pair<unsigned long const, unsigned 
long>*>::increment_slow() 
>>>> | | | | | | + 26.00% 
StupidAllocator::_aligned_len(interval_set<unsigned long, 
btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart 
>>>> 
>>>> Hi, 
>>>> 
>>>> some news: 
>>>> 
>>>> I have tried with different transparent hugepage values (madvise, 
never) : no change 
>>>> 
>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>> 
>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
some more days to be sure) 
>>>> 
>>>> 
>>>> Note that this behaviour seem to happen really faster (< 2 days) 
on my big nvme drives (6TB), 
>>>> my others clusters user 1,6TB ssd. 
>>>> 
>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
it's helping. 
>>>> 
>>>> 
>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
glibc >= 2.26 (which have also thread cache) ? 
>>>> 
>>>> 
>>>> Regards, 
>>>> 
>>>> Alexandre 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart 
>>>> 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>> 
>>>>>> Also why do you monitor op_w_process_latency? but not 
op_r_process_latency? 
>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
of graphs). 
>>>> 
>>>> I just don't see latency difference on reads. (or they are very 
very small vs the write latency increase) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
<ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart 
>>>> 
>>>> Hi, 
>>>> 
>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>> Hi Stefan, 
>>>>> 
>>>>>>> currently i'm in the process of switching back from jemalloc to 
tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my 
change. 
>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>> I need to compare with bigger latencies 
>>>>> 
>>>>> here an example, when all osd at 20-50ms before restart, then 
after restart (at 21:15), 1ms 
>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>> 
>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>> 
>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>> 
>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>> here my influxdb queries: 
>>>>> 
>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
"ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous) 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
=~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous) 
>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>> op_r_latency but instead op_latency? 
>>>> 
>>>> Also why do you monitor op_w_process_latency? but not 
op_r_process_latency? 
>>>> 
>>>> greets, 
>>>> Stefan 
>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
<sage@newdream.net> 
>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>> Hi, 
>>>>>> 
>>>>>> here some new results, 
>>>>>> different osd/ different cluster 
>>>>>> 
>>>>>> before osd restart latency was between 2-5ms 
>>>>>> after osd restart is around 1-1.5ms 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>> 
>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
but maybe I'm wrong. 
>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>> currently i'm in the process of switching back from jemalloc to 
tcmalloc 
>>>>> like suggested. This report makes me a little nervous about my 
change. 
>>>>> 
>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>> exact values out of the daemon do you use for bluestore? 
>>>>> 
>>>>> I would like to check if i see the same behaviour. 
>>>>> 
>>>>> Greets, 
>>>>> Stefan 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
<ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
restart 
>>>>>> 
>>>>>> Can you capture a perf top or perf record to see where teh CPU 
time is 
>>>>>> going on one of the OSDs wth a high latency? 
>>>>>> 
>>>>>> Thanks! 
>>>>>> sage 
>>>>>> 
>>>>>> 
>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>> 
>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
nvme drivers, 
>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>> 
>>>>>>> When the osd are refreshly started, the commit latency is 
between 0,5-1ms. 
>>>>>>> 
>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
day), until reaching crazy 
>>>>>>> values like 20-200ms. 
>>>>>>> 
>>>>>>> Some example graphs: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>> 
>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>> 
>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
full loaded) 
>>>>>>> 
>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>> 
>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
bluestore memory bug ? 
>>>>>>> 
>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________ 
>>>>>> ceph-users mailing list 
>>>>>> ceph-users@lists.ceph.com 
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>> 
>> 
> 
> 

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
> Hi Igor, 
> 
> Thanks again for helping ! 
> 
> 
> 
> I have upgrade to last mimic this weekend, and with new autotune memory, 
> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
> 
> 
> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
> here the reports for osd.0: 
> 
> http://odisoweb1.odiso.net/perfanalysis/ 
> 
> 
> osd has been started the 12-02-2019 at 08:00 
> 
> first report after 1h running 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
> 
> 
> 
> report after 24 before counter resets 
> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
> 
> report 1h after counter reset 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
> 
> 
> 
> 
> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
> Then after that, slowly decreasing. 
> 
> 
> Another strange thing, 
> I'm seeing total bytes at 5G at 12-02-2018.13:30 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
> 
> 
> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> À: "Alexandre Derumier" <aderumier@odiso.com> 
> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 11 Février 2019 12:03:17 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>> another mempool dump after 1h run. (latency ok) 
>> 
>> Biggest difference: 
>> 
>> before restart 
>> ------------- 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>> 
>> 
>> After restart 
>> ------------- 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> 
> This is fine as cache is warming after restart and some rebalancing 
> between data and metadata might occur. 
> 
> What relates to allocator and most probably to fragmentation growth is : 
> 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes": 165053952 
> }, 
> 
> which had been higher before the reset (if I got these dumps' order 
> properly) 
> 
> "bluestore_alloc": { 
> "items": 210243456, 
> "bytes": 210243456 
> }, 
> 
> But as I mentioned - I'm not 100% sure this might cause such a huge 
> latency increase... 
> 
> Do you have perf counters dump after the restart? 
> 
> Could you collect some more dumps - for both mempool and perf counters? 
> 
> So ideally I'd like to have: 
> 
> 1) mempool/perf counters dumps after the restart (1hour is OK) 
> 
> 2) mempool/perf counters dumps in 24+ hours after restart 
> 
> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
> restart) and dump mempool/perf counters again. 
> 
> So we'll be able to learn both allocator mem usage growth and operation 
> latency distribution for the following periods: 
> 
> a) 1st hour after restart 
> 
> b) 25th hour. 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
>> full mempool dump after restart 
>> ------------------------------- 
>> 
>> { 
>> "mempool": { 
>> "by_pool": { 
>> "bloom_filter": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> "bluestore_cache_onode": { 
>> "items": 22225, 
>> "bytes": 14935200 
>> }, 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_fsck": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_txc": { 
>> "items": 11, 
>> "bytes": 8184 
>> }, 
>> "bluestore_writing_deferred": { 
>> "items": 5047, 
>> "bytes": 22673736 
>> }, 
>> "bluestore_writing": { 
>> "items": 91, 
>> "bytes": 1662976 
>> }, 
>> "bluefs": { 
>> "items": 1907, 
>> "bytes": 95600 
>> }, 
>> "buffer_anon": { 
>> "items": 19664, 
>> "bytes": 25486050 
>> }, 
>> "buffer_meta": { 
>> "items": 46189, 
>> "bytes": 2956096 
>> }, 
>> "osd": { 
>> "items": 243, 
>> "bytes": 3089016 
>> }, 
>> "osd_mapbl": { 
>> "items": 17, 
>> "bytes": 214366 
>> }, 
>> "osd_pglog": { 
>> "items": 889673, 
>> "bytes": 367160400 
>> }, 
>> "osdmap": { 
>> "items": 3803, 
>> "bytes": 224552 
>> }, 
>> "osdmap_mapping": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "pgmap": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "mds_co": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_1": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_2": { 
>> "items": 0, 
>> "bytes": 0 
>> } 
>> }, 
>> "total": { 
>> "items": 178515204, 
>> "bytes": 2160630547 
>> } 
>> } 
>> } 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> I'm just seeing 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>> 
>> on 1 osd, both 10%. 
>> 
>> here the dump_mempools 
>> 
>> { 
>> "mempool": { 
>> "by_pool": { 
>> "bloom_filter": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> "bluestore_cache_onode": { 
>> "items": 105637, 
>> "bytes": 70988064 
>> }, 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_fsck": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "bluestore_txc": { 
>> "items": 12, 
>> "bytes": 8928 
>> }, 
>> "bluestore_writing_deferred": { 
>> "items": 406, 
>> "bytes": 4792868 
>> }, 
>> "bluestore_writing": { 
>> "items": 66, 
>> "bytes": 1085440 
>> }, 
>> "bluefs": { 
>> "items": 1882, 
>> "bytes": 93600 
>> }, 
>> "buffer_anon": { 
>> "items": 138986, 
>> "bytes": 24983701 
>> }, 
>> "buffer_meta": { 
>> "items": 544, 
>> "bytes": 34816 
>> }, 
>> "osd": { 
>> "items": 243, 
>> "bytes": 3089016 
>> }, 
>> "osd_mapbl": { 
>> "items": 36, 
>> "bytes": 179308 
>> }, 
>> "osd_pglog": { 
>> "items": 952564, 
>> "bytes": 372459684 
>> }, 
>> "osdmap": { 
>> "items": 3639, 
>> "bytes": 224664 
>> }, 
>> "osdmap_mapping": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "pgmap": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "mds_co": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_1": { 
>> "items": 0, 
>> "bytes": 0 
>> }, 
>> "unittest_2": { 
>> "items": 0, 
>> "bytes": 0 
>> } 
>> }, 
>> "total": { 
>> "items": 260109445, 
>> "bytes": 2228370845 
>> } 
>> } 
>> } 
>> 
>> 
>> and the perf dump 
>> 
>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>> { 
>> "AsyncMessenger::Worker-0": { 
>> "msgr_recv_messages": 22948570, 
>> "msgr_send_messages": 22561570, 
>> "msgr_recv_bytes": 333085080271, 
>> "msgr_send_bytes": 261798871204, 
>> "msgr_created_connections": 6152, 
>> "msgr_active_connections": 2701, 
>> "msgr_running_total_time": 1055.197867330, 
>> "msgr_running_send_time": 352.764480121, 
>> "msgr_running_recv_time": 499.206831955, 
>> "msgr_running_fast_dispatch_time": 130.982201607 
>> }, 
>> "AsyncMessenger::Worker-1": { 
>> "msgr_recv_messages": 18801593, 
>> "msgr_send_messages": 18430264, 
>> "msgr_recv_bytes": 306871760934, 
>> "msgr_send_bytes": 192789048666, 
>> "msgr_created_connections": 5773, 
>> "msgr_active_connections": 2721, 
>> "msgr_running_total_time": 816.821076305, 
>> "msgr_running_send_time": 261.353228926, 
>> "msgr_running_recv_time": 394.035587911, 
>> "msgr_running_fast_dispatch_time": 104.012155720 
>> }, 
>> "AsyncMessenger::Worker-2": { 
>> "msgr_recv_messages": 18463400, 
>> "msgr_send_messages": 18105856, 
>> "msgr_recv_bytes": 187425453590, 
>> "msgr_send_bytes": 220735102555, 
>> "msgr_created_connections": 5897, 
>> "msgr_active_connections": 2605, 
>> "msgr_running_total_time": 807.186854324, 
>> "msgr_running_send_time": 296.834435839, 
>> "msgr_running_recv_time": 351.364389691, 
>> "msgr_running_fast_dispatch_time": 101.215776792 
>> }, 
>> "bluefs": { 
>> "gift_bytes": 0, 
>> "reclaim_bytes": 0, 
>> "db_total_bytes": 256050724864, 
>> "db_used_bytes": 12413042688, 
>> "wal_total_bytes": 0, 
>> "wal_used_bytes": 0, 
>> "slow_total_bytes": 0, 
>> "slow_used_bytes": 0, 
>> "num_files": 209, 
>> "log_bytes": 10383360, 
>> "log_compactions": 14, 
>> "logged_bytes": 336498688, 
>> "files_written_wal": 2, 
>> "files_written_sst": 4499, 
>> "bytes_written_wal": 417989099783, 
>> "bytes_written_sst": 213188750209 
>> }, 
>> "bluestore": { 
>> "kv_flush_lat": { 
>> "avgcount": 26371957, 
>> "sum": 26.734038497, 
>> "avgtime": 0.000001013 
>> }, 
>> "kv_commit_lat": { 
>> "avgcount": 26371957, 
>> "sum": 3397.491150603, 
>> "avgtime": 0.000128829 
>> }, 
>> "kv_lat": { 
>> "avgcount": 26371957, 
>> "sum": 3424.225189100, 
>> "avgtime": 0.000129843 
>> }, 
>> "state_prepare_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3689.542105337, 
>> "avgtime": 0.000121028 
>> }, 
>> "state_aio_wait_lat": { 
>> "avgcount": 30484924, 
>> "sum": 509.864546111, 
>> "avgtime": 0.000016725 
>> }, 
>> "state_io_done_lat": { 
>> "avgcount": 30484924, 
>> "sum": 24.534052953, 
>> "avgtime": 0.000000804 
>> }, 
>> "state_kv_queued_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3488.338424238, 
>> "avgtime": 0.000114428 
>> }, 
>> "state_kv_commiting_lat": { 
>> "avgcount": 30484924, 
>> "sum": 5660.437003432, 
>> "avgtime": 0.000185679 
>> }, 
>> "state_kv_done_lat": { 
>> "avgcount": 30484924, 
>> "sum": 7.763511500, 
>> "avgtime": 0.000000254 
>> }, 
>> "state_deferred_queued_lat": { 
>> "avgcount": 26346134, 
>> "sum": 666071.296856696, 
>> "avgtime": 0.025281557 
>> }, 
>> "state_deferred_aio_wait_lat": { 
>> "avgcount": 26346134, 
>> "sum": 1755.660547071, 
>> "avgtime": 0.000066638 
>> }, 
>> "state_deferred_cleanup_lat": { 
>> "avgcount": 26346134, 
>> "sum": 185465.151653703, 
>> "avgtime": 0.007039558 
>> }, 
>> "state_finishing_lat": { 
>> "avgcount": 30484920, 
>> "sum": 3.046847481, 
>> "avgtime": 0.000000099 
>> }, 
>> "state_done_lat": { 
>> "avgcount": 30484920, 
>> "sum": 13193.362685280, 
>> "avgtime": 0.000432783 
>> }, 
>> "throttle_lat": { 
>> "avgcount": 30484924, 
>> "sum": 14.634269979, 
>> "avgtime": 0.000000480 
>> }, 
>> "submit_lat": { 
>> "avgcount": 30484924, 
>> "sum": 3873.883076148, 
>> "avgtime": 0.000127075 
>> }, 
>> "commit_lat": { 
>> "avgcount": 30484924, 
>> "sum": 13376.492317331, 
>> "avgtime": 0.000438790 
>> }, 
>> "read_lat": { 
>> "avgcount": 5873923, 
>> "sum": 1817.167582057, 
>> "avgtime": 0.000309361 
>> }, 
>> "read_onode_meta_lat": { 
>> "avgcount": 19608201, 
>> "sum": 146.770464482, 
>> "avgtime": 0.000007485 
>> }, 
>> "read_wait_aio_lat": { 
>> "avgcount": 13734278, 
>> "sum": 2532.578077242, 
>> "avgtime": 0.000184398 
>> }, 
>> "compress_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "decompress_lat": { 
>> "avgcount": 1346945, 
>> "sum": 26.227575896, 
>> "avgtime": 0.000019471 
>> }, 
>> "csum_lat": { 
>> "avgcount": 28020392, 
>> "sum": 149.587819041, 
>> "avgtime": 0.000005338 
>> }, 
>> "compress_success_count": 0, 
>> "compress_rejected_count": 0, 
>> "write_pad_bytes": 352923605, 
>> "deferred_write_ops": 24373340, 
>> "deferred_write_bytes": 216791842816, 
>> "write_penalty_read_ops": 8062366, 
>> "bluestore_allocated": 3765566013440, 
>> "bluestore_stored": 4186255221852, 
>> "bluestore_compressed": 39981379040, 
>> "bluestore_compressed_allocated": 73748348928, 
>> "bluestore_compressed_original": 165041381376, 
>> "bluestore_onodes": 104232, 
>> "bluestore_onode_hits": 71206874, 
>> "bluestore_onode_misses": 1217914, 
>> "bluestore_onode_shard_hits": 260183292, 
>> "bluestore_onode_shard_misses": 22851573, 
>> "bluestore_extents": 3394513, 
>> "bluestore_blobs": 2773587, 
>> "bluestore_buffers": 0, 
>> "bluestore_buffer_bytes": 0, 
>> "bluestore_buffer_hit_bytes": 62026011221, 
>> "bluestore_buffer_miss_bytes": 995233669922, 
>> "bluestore_write_big": 5648815, 
>> "bluestore_write_big_bytes": 552502214656, 
>> "bluestore_write_big_blobs": 12440992, 
>> "bluestore_write_small": 35883770, 
>> "bluestore_write_small_bytes": 223436965719, 
>> "bluestore_write_small_unused": 408125, 
>> "bluestore_write_small_deferred": 34961455, 
>> "bluestore_write_small_pre_read": 34961455, 
>> "bluestore_write_small_new": 514190, 
>> "bluestore_txc": 30484924, 
>> "bluestore_onode_reshard": 5144189, 
>> "bluestore_blob_split": 60104, 
>> "bluestore_extent_compress": 53347252, 
>> "bluestore_gc_merged": 21142528, 
>> "bluestore_read_eio": 0, 
>> "bluestore_fragmentation_micros": 67 
>> }, 
>> "finisher-defered_finisher": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "finisher-finisher-0": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 26625163, 
>> "sum": 1057.506990951, 
>> "avgtime": 0.000039718 
>> } 
>> }, 
>> "finisher-objecter-finisher-0": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.0::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.0::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.1::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.1::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.2::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.2::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.3::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.3::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.4::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.4::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.5::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.5::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.6::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.6::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.7::sdata_wait_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "mutex-OSDShard.7::shard_lock": { 
>> "wait": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "objecter": { 
>> "op_active": 0, 
>> "op_laggy": 0, 
>> "op_send": 0, 
>> "op_send_bytes": 0, 
>> "op_resend": 0, 
>> "op_reply": 0, 
>> "op": 0, 
>> "op_r": 0, 
>> "op_w": 0, 
>> "op_rmw": 0, 
>> "op_pg": 0, 
>> "osdop_stat": 0, 
>> "osdop_create": 0, 
>> "osdop_read": 0, 
>> "osdop_write": 0, 
>> "osdop_writefull": 0, 
>> "osdop_writesame": 0, 
>> "osdop_append": 0, 
>> "osdop_zero": 0, 
>> "osdop_truncate": 0, 
>> "osdop_delete": 0, 
>> "osdop_mapext": 0, 
>> "osdop_sparse_read": 0, 
>> "osdop_clonerange": 0, 
>> "osdop_getxattr": 0, 
>> "osdop_setxattr": 0, 
>> "osdop_cmpxattr": 0, 
>> "osdop_rmxattr": 0, 
>> "osdop_resetxattrs": 0, 
>> "osdop_tmap_up": 0, 
>> "osdop_tmap_put": 0, 
>> "osdop_tmap_get": 0, 
>> "osdop_call": 0, 
>> "osdop_watch": 0, 
>> "osdop_notify": 0, 
>> "osdop_src_cmpxattr": 0, 
>> "osdop_pgls": 0, 
>> "osdop_pgls_filter": 0, 
>> "osdop_other": 0, 
>> "linger_active": 0, 
>> "linger_send": 0, 
>> "linger_resend": 0, 
>> "linger_ping": 0, 
>> "poolop_active": 0, 
>> "poolop_send": 0, 
>> "poolop_resend": 0, 
>> "poolstat_active": 0, 
>> "poolstat_send": 0, 
>> "poolstat_resend": 0, 
>> "statfs_active": 0, 
>> "statfs_send": 0, 
>> "statfs_resend": 0, 
>> "command_active": 0, 
>> "command_send": 0, 
>> "command_resend": 0, 
>> "map_epoch": 105913, 
>> "map_full": 0, 
>> "map_inc": 828, 
>> "osd_sessions": 0, 
>> "osd_session_open": 0, 
>> "osd_session_close": 0, 
>> "osd_laggy": 0, 
>> "omap_wr": 0, 
>> "omap_rd": 0, 
>> "omap_del": 0 
>> }, 
>> "osd": { 
>> "op_wip": 0, 
>> "op": 16758102, 
>> "op_in_bytes": 238398820586, 
>> "op_out_bytes": 165484999463, 
>> "op_latency": { 
>> "avgcount": 16758102, 
>> "sum": 38242.481640842, 
>> "avgtime": 0.002282029 
>> }, 
>> "op_process_latency": { 
>> "avgcount": 16758102, 
>> "sum": 28644.906310687, 
>> "avgtime": 0.001709316 
>> }, 
>> "op_prepare_latency": { 
>> "avgcount": 16761367, 
>> "sum": 3489.856599934, 
>> "avgtime": 0.000208208 
>> }, 
>> "op_r": 6188565, 
>> "op_r_out_bytes": 165484999463, 
>> "op_r_latency": { 
>> "avgcount": 6188565, 
>> "sum": 4507.365756792, 
>> "avgtime": 0.000728337 
>> }, 
>> "op_r_process_latency": { 
>> "avgcount": 6188565, 
>> "sum": 942.363063429, 
>> "avgtime": 0.000152274 
>> }, 
>> "op_r_prepare_latency": { 
>> "avgcount": 6188644, 
>> "sum": 982.866710389, 
>> "avgtime": 0.000158817 
>> }, 
>> "op_w": 10546037, 
>> "op_w_in_bytes": 238334329494, 
>> "op_w_latency": { 
>> "avgcount": 10546037, 
>> "sum": 33160.719998316, 
>> "avgtime": 0.003144377 
>> }, 
>> "op_w_process_latency": { 
>> "avgcount": 10546037, 
>> "sum": 27668.702029030, 
>> "avgtime": 0.002623611 
>> }, 
>> "op_w_prepare_latency": { 
>> "avgcount": 10548652, 
>> "sum": 2499.688609173, 
>> "avgtime": 0.000236967 
>> }, 
>> "op_rw": 23500, 
>> "op_rw_in_bytes": 64491092, 
>> "op_rw_out_bytes": 0, 
>> "op_rw_latency": { 
>> "avgcount": 23500, 
>> "sum": 574.395885734, 
>> "avgtime": 0.024442378 
>> }, 
>> "op_rw_process_latency": { 
>> "avgcount": 23500, 
>> "sum": 33.841218228, 
>> "avgtime": 0.001440051 
>> }, 
>> "op_rw_prepare_latency": { 
>> "avgcount": 24071, 
>> "sum": 7.301280372, 
>> "avgtime": 0.000303322 
>> }, 
>> "op_before_queue_op_lat": { 
>> "avgcount": 57892986, 
>> "sum": 1502.117718889, 
>> "avgtime": 0.000025946 
>> }, 
>> "op_before_dequeue_op_lat": { 
>> "avgcount": 58091683, 
>> "sum": 45194.453254037, 
>> "avgtime": 0.000777984 
>> }, 
>> "subop": 19784758, 
>> "subop_in_bytes": 547174969754, 
>> "subop_latency": { 
>> "avgcount": 19784758, 
>> "sum": 13019.714424060, 
>> "avgtime": 0.000658067 
>> }, 
>> "subop_w": 19784758, 
>> "subop_w_in_bytes": 547174969754, 
>> "subop_w_latency": { 
>> "avgcount": 19784758, 
>> "sum": 13019.714424060, 
>> "avgtime": 0.000658067 
>> }, 
>> "subop_pull": 0, 
>> "subop_pull_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "subop_push": 0, 
>> "subop_push_in_bytes": 0, 
>> "subop_push_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "pull": 0, 
>> "push": 2003, 
>> "push_out_bytes": 5560009728, 
>> "recovery_ops": 1940, 
>> "loadavg": 118, 
>> "buffer_bytes": 0, 
>> "history_alloc_Mbytes": 0, 
>> "history_alloc_num": 0, 
>> "cached_crc": 0, 
>> "cached_crc_adjusted": 0, 
>> "missed_crc": 0, 
>> "numpg": 243, 
>> "numpg_primary": 82, 
>> "numpg_replica": 161, 
>> "numpg_stray": 0, 
>> "numpg_removing": 0, 
>> "heartbeat_to_peers": 10, 
>> "map_messages": 7013, 
>> "map_message_epochs": 7143, 
>> "map_message_epoch_dups": 6315, 
>> "messages_delayed_for_map": 0, 
>> "osd_map_cache_hit": 203309, 
>> "osd_map_cache_miss": 33, 
>> "osd_map_cache_miss_low": 0, 
>> "osd_map_cache_miss_low_avg": { 
>> "avgcount": 0, 
>> "sum": 0 
>> }, 
>> "osd_map_bl_cache_hit": 47012, 
>> "osd_map_bl_cache_miss": 1681, 
>> "stat_bytes": 6401248198656, 
>> "stat_bytes_used": 3777979072512, 
>> "stat_bytes_avail": 2623269126144, 
>> "copyfrom": 0, 
>> "tier_promote": 0, 
>> "tier_flush": 0, 
>> "tier_flush_fail": 0, 
>> "tier_try_flush": 0, 
>> "tier_try_flush_fail": 0, 
>> "tier_evict": 0, 
>> "tier_whiteout": 1631, 
>> "tier_dirty": 22360, 
>> "tier_clean": 0, 
>> "tier_delay": 0, 
>> "tier_proxy_read": 0, 
>> "tier_proxy_write": 0, 
>> "agent_wake": 0, 
>> "agent_skip": 0, 
>> "agent_flush": 0, 
>> "agent_evict": 0, 
>> "object_ctx_cache_hit": 16311156, 
>> "object_ctx_cache_total": 17426393, 
>> "op_cache_hit": 0, 
>> "osd_tier_flush_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_tier_promote_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_tier_r_lat": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "osd_pg_info": 30483113, 
>> "osd_pg_fastinfo": 29619885, 
>> "osd_pg_biginfo": 81703 
>> }, 
>> "recoverystate_perf": { 
>> "initial_latency": { 
>> "avgcount": 243, 
>> "sum": 6.869296500, 
>> "avgtime": 0.028268709 
>> }, 
>> "started_latency": { 
>> "avgcount": 1125, 
>> "sum": 13551384.917335850, 
>> "avgtime": 12045.675482076 
>> }, 
>> "reset_latency": { 
>> "avgcount": 1368, 
>> "sum": 1101.727799040, 
>> "avgtime": 0.805356578 
>> }, 
>> "start_latency": { 
>> "avgcount": 1368, 
>> "sum": 0.002014799, 
>> "avgtime": 0.000001472 
>> }, 
>> "primary_latency": { 
>> "avgcount": 507, 
>> "sum": 4575560.638823428, 
>> "avgtime": 9024.774435549 
>> }, 
>> "peering_latency": { 
>> "avgcount": 550, 
>> "sum": 499.372283616, 
>> "avgtime": 0.907949606 
>> }, 
>> "backfilling_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "waitremotebackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "waitlocalbackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "notbackfilling_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "repnotrecovering_latency": { 
>> "avgcount": 1009, 
>> "sum": 8975301.082274411, 
>> "avgtime": 8895.243887288 
>> }, 
>> "repwaitrecoveryreserved_latency": { 
>> "avgcount": 420, 
>> "sum": 99.846056520, 
>> "avgtime": 0.237728706 
>> }, 
>> "repwaitbackfillreserved_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "reprecovering_latency": { 
>> "avgcount": 420, 
>> "sum": 241.682764382, 
>> "avgtime": 0.575435153 
>> }, 
>> "activating_latency": { 
>> "avgcount": 507, 
>> "sum": 16.893347339, 
>> "avgtime": 0.033320211 
>> }, 
>> "waitlocalrecoveryreserved_latency": { 
>> "avgcount": 199, 
>> "sum": 672.335512769, 
>> "avgtime": 3.378570415 
>> }, 
>> "waitremoterecoveryreserved_latency": { 
>> "avgcount": 199, 
>> "sum": 213.536439363, 
>> "avgtime": 1.073047433 
>> }, 
>> "recovering_latency": { 
>> "avgcount": 199, 
>> "sum": 79.007696479, 
>> "avgtime": 0.397023600 
>> }, 
>> "recovered_latency": { 
>> "avgcount": 507, 
>> "sum": 14.000732748, 
>> "avgtime": 0.027614857 
>> }, 
>> "clean_latency": { 
>> "avgcount": 395, 
>> "sum": 4574325.900371083, 
>> "avgtime": 11580.571899673 
>> }, 
>> "active_latency": { 
>> "avgcount": 425, 
>> "sum": 4575107.630123680, 
>> "avgtime": 10764.959129702 
>> }, 
>> "replicaactive_latency": { 
>> "avgcount": 589, 
>> "sum": 8975184.499049954, 
>> "avgtime": 15238.004242869 
>> }, 
>> "stray_latency": { 
>> "avgcount": 818, 
>> "sum": 800.729455666, 
>> "avgtime": 0.978886865 
>> }, 
>> "getinfo_latency": { 
>> "avgcount": 550, 
>> "sum": 15.085667048, 
>> "avgtime": 0.027428485 
>> }, 
>> "getlog_latency": { 
>> "avgcount": 546, 
>> "sum": 3.482175693, 
>> "avgtime": 0.006377611 
>> }, 
>> "waitactingchange_latency": { 
>> "avgcount": 39, 
>> "sum": 35.444551284, 
>> "avgtime": 0.908834648 
>> }, 
>> "incomplete_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "down_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "getmissing_latency": { 
>> "avgcount": 507, 
>> "sum": 6.702129624, 
>> "avgtime": 0.013219190 
>> }, 
>> "waitupthru_latency": { 
>> "avgcount": 507, 
>> "sum": 474.098261727, 
>> "avgtime": 0.935105052 
>> }, 
>> "notrecovering_latency": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> }, 
>> "rocksdb": { 
>> "get": 28320977, 
>> "submit_transaction": 30484924, 
>> "submit_transaction_sync": 26371957, 
>> "get_latency": { 
>> "avgcount": 28320977, 
>> "sum": 325.900908733, 
>> "avgtime": 0.000011507 
>> }, 
>> "submit_latency": { 
>> "avgcount": 30484924, 
>> "sum": 1835.888692371, 
>> "avgtime": 0.000060222 
>> }, 
>> "submit_sync_latency": { 
>> "avgcount": 26371957, 
>> "sum": 1431.555230628, 
>> "avgtime": 0.000054283 
>> }, 
>> "compact": 0, 
>> "compact_range": 0, 
>> "compact_queue_merge": 0, 
>> "compact_queue_len": 0, 
>> "rocksdb_write_wal_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_memtable_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_delay_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> }, 
>> "rocksdb_write_pre_and_post_time": { 
>> "avgcount": 0, 
>> "sum": 0.000000000, 
>> "avgtime": 0.000000000 
>> } 
>> } 
>> } 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mardi 5 Février 2019 18:56:51 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>> (but I have bluestore_fragmentation_micros) 
>>> ok, this is the same 
>>> 
>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>> 
>>> 
>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>> 
>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>> it? The same for other OSDs? 
>> 
>> This proves some issue with the allocator - generally fragmentation 
>> might grow but it shouldn't reset on restart. Looks like some intervals 
>> aren't properly merged in run-time. 
>> 
>> On the other side I'm not completely sure that latency degradation is 
>> caused by that - fragmentation growth is relatively small - I don't see 
>> how this might impact performance that high. 
>> 
>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>> output on admin socket) reports? Do you have any historic data? 
>> 
>> If not may I have current output and say a couple more samples with 
>> 8-12 hours interval? 
>> 
>> 
>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>> before that but I'll discuss this at BlueStore meeting shortly. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>>> ----- Mail original ----- 
>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Thanks Igor, 
>>> 
>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>> startup and once you get high latency. 
>>>>> 
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>> I'm already monitoring with 
>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>> 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> 
>>> (but I have bluestore_fragmentation_micros) 
>>> 
>>> 
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>> make sure it's degraded and learn more details. 
>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>> 
>>> 
>>> 
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>> and try the difference... 
>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexandre, 
>>> 
>>> looks like a bug in StupidAllocator. 
>>> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
>>> 
>>> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>> 
>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>> Hi again, 
>>>> 
>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>> 
>>>> 
>>>> I have notice something using a simple "perf top", 
>>>> 
>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>> 
>>>> when latency is bad, perf top give me : 
>>>> 
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>> const, unsigned long>*>::increment_slow() 
>>>> 
>>>> (around 10-20% time for both) 
>>>> 
>>>> 
>>>> when latency is good, I don't see them at all. 
>>>> 
>>>> 
>>>> I have used the Mark wallclock profiler, here the results: 
>>>> 
>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>> 
>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>> 
>>>> 
>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>> 
>>>> 
>>>> + 100.00% clone 
>>>> + 100.00% start_thread 
>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Hi, 
>>>> 
>>>> some news: 
>>>> 
>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>> 
>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>> 
>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>> 
>>>> 
>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>> my others clusters user 1,6TB ssd. 
>>>> 
>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>> 
>>>> 
>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>> 
>>>> 
>>>> Regards, 
>>>> 
>>>> Alexandre 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>> 
>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>> 
>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Hi, 
>>>> 
>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>> Hi Stefan, 
>>>>> 
>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>> I need to compare with bigger latencies 
>>>>> 
>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>> 
>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>> 
>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>> 
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>> here my influxdb queries: 
>>>>> 
>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> 
>>>>> 
>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>> op_r_latency but instead op_latency? 
>>>> 
>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>> 
>>>> greets, 
>>>> Stefan 
>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>> Hi, 
>>>>>> 
>>>>>> here some new results, 
>>>>>> different osd/ different cluster 
>>>>>> 
>>>>>> before osd restart latency was between 2-5ms 
>>>>>> after osd restart is around 1-1.5ms 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>> 
>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>> 
>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>> exact values out of the daemon do you use for bluestore? 
>>>>> 
>>>>> I would like to check if i see the same behaviour. 
>>>>> 
>>>>> Greets, 
>>>>> Stefan 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>> going on one of the OSDs wth a high latency? 
>>>>>> 
>>>>>> Thanks! 
>>>>>> sage 
>>>>>> 
>>>>>> 
>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>> 
>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>> 
>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>> 
>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>> values like 20-200ms. 
>>>>>>> 
>>>>>>> Some example graphs: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>> 
>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>> 
>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>> 
>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>> 
>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>> 
>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________ 
>>>>>> ceph-users mailing list 
>>>>>> ceph-users@lists.ceph.com 
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>> 
>> 
>> 
> 
> 
> 



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <19368722.1223708.1550237472044.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                             ` <19368722.1223708.1550237472044.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-15 13:50                                                                               ` Wido den Hollander
       [not found]                                                                                 ` <056c13b4-fbcf-787f-cfbe-bb37044161f8-fspyXLx8qC4@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Wido den Hollander @ 2019-02-15 13:50 UTC (permalink / raw)
  To: Alexandre DERUMIER, Igor Fedotov; +Cc: ceph-users, ceph-devel

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote:
> Thanks Igor.
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different.
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem.
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

But we noticed this on two different 12.2.10/11 clusters.

A restart made the latency drop. Not only the numbers, but the
real-world latency as experienced by a VM as well.

Wido

> 
> 
> 
> 
> 
> ----- Mail original -----
> De: "Igor Fedotov" <ifedotov@suse.de>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Vendredi 15 Février 2019 13:47:57
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>>
>> Thanks again for helping ! 
>>
>>
>>
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>
>>
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>>
>> http://odisoweb1.odiso.net/perfanalysis/ 
>>
>>
>> osd has been started the 12-02-2019 at 08:00 
>>
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>
>>
>>
>> report after 24 before counter resets 
>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>
>>
>>
>>
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>>
>>
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is 
> still at 8G 
>>
>>
>> I'm graphing mempools counters too since yesterday, so I'll able to 
> track them over time. 
>>
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>>
>>> Biggest difference: 
>>>
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other 
> take all the memory) 
>>>
>>>
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>>
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>>
>> What relates to allocator and most probably to fragmentation growth is : 
>>
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>>
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>>
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>>
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>>
>> Do you have perf counters dump after the restart? 
>>
>> Could you collect some more dumps - for both mempool and perf counters? 
>>
>> So ideally I'd like to have: 
>>
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>>
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>>
>> a) 1st hour after restart 
>>
>> b) 25th hour. 
>>
>>
>> Thanks, 
>>
>> Igor 
>>
>>
>>> full mempool dump after restart 
>>> ------------------------------- 
>>>
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>>
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>
>>> I'm just seeing 
>>>
>>> StupidAllocator::_aligned_len 
>>> and 
>>>
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>>
>>> on 1 osd, both 10%. 
>>>
>>> here the dump_mempools 
>>>
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>>
>>>
>>> and the perf dump 
>>>
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>>
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>>
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max 
> possible number of free extents) * 1000"); 
>>>>
>>>>
>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
> latency, 
>>>>
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>>
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>>
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>>
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>>
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>>
>>>
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
> plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>
>>>
>>> Thanks, 
>>>
>>> Igor 
>>>
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>>
>>>> Thanks Igor, 
>>>>
>>>>>> Could you please collect BlueStore performance counters right 
> after OSD 
>>>>>> startup and once you get high latency. 
>>>>>>
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
> counters) 
>>>>
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>
>>>> (but I have bluestore_fragmentation_micros) 
>>>>
>>>>
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's 
> paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, 
> and try to reproduce. 
>>>>
>>>>
>>>>
>>>>>> More vigorous fix would be to backport bitmap allocator from 
> Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what 
> I've seen in PR. 
>>>>
>>>>
>>>>
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>>
>>>> Hi Alexandre, 
>>>>
>>>> looks like a bug in StupidAllocator. 
>>>>
>>>> Could you please collect BlueStore performance counters right after 
> OSD 
>>>> startup and once you get high latency. 
>>>>
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>>
>>>>
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>>
>>>>
>>>> Thanks, 
>>>>
>>>> Igor 
>>>>
>>>>
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>>
>>>>> I speak too fast, the problem has occured again, so it's not 
> tcmalloc cache size related. 
>>>>>
>>>>>
>>>>> I have notice something using a simple "perf top", 
>>>>>
>>>>> each time I have this problem (I have seen exactly 4 times the 
> same behaviour), 
>>>>>
>>>>> when latency is bad, perf top give me : 
>>>>>
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>>
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>>
>>>>> (around 10-20% time for both) 
>>>>>
>>>>>
>>>>> when latency is good, I don't see them at all. 
>>>>>
>>>>>
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>
>>>>>
>>>>> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
>>>>>
>>>>>
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% 
> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
> std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% 
> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% 
> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>, 
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
> unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long const, unsigned 
> long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% 
> StupidAllocator::_aligned_len(interval_set<unsigned long, 
> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>>
>>>>> Hi, 
>>>>>
>>>>> some news: 
>>>>>
>>>>> I have tried with different transparent hugepage values (madvise, 
> never) : no change 
>>>>>
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
> some more days to be sure) 
>>>>>
>>>>>
>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
> on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>>
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
> it's helping. 
>>>>>
>>>>>
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
> glibc >= 2.26 (which have also thread cache) ? 
>>>>>
>>>>>
>>>>> Regards, 
>>>>>
>>>>> Alexandre 
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>>
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>
>>>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
> of graphs). 
>>>>>
>>>>> I just don't see latency difference on reads. (or they are very 
> very small vs the write latency increase) 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>>
>>>>> Hi, 
>>>>>
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>>
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>>
>>>>>> here an example, when all osd at 20-50ms before restart, then 
> after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
> Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>>
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>>
>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>>
>>>>> greets, 
>>>>> Stefan 
>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
> <sage@newdream.net> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>>
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>>
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
> but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>>
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>
>>>>>> I would like to check if i see the same behaviour. 
>>>>>>
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>>
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
> restart 
>>>>>>>
>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
> time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>
>>>>>>>> Hi, 
>>>>>>>>
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
> nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>
>>>>>>>> When the osd are refreshly started, the commit latency is 
> between 0,5-1ms. 
>>>>>>>>
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
> day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>>
>>>>>>>> Some example graphs: 
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
> full loaded) 
>>>>>>>>
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
> bluestore memory bug ? 
>>>>>>>>
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards, 
>>>>>>>>
>>>>>>>> Alexandre 
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@lists.ceph.com 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>
>>>
>>
>>
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>>
>> Thanks again for helping ! 
>>
>>
>>
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>
>>
>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>> here the reports for osd.0: 
>>
>> http://odisoweb1.odiso.net/perfanalysis/ 
>>
>>
>> osd has been started the 12-02-2019 at 08:00 
>>
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>
>>
>>
>> report after 24 before counter resets 
>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>
>>
>>
>>
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>>
>>
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>
>>
>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>>
>>> Biggest difference: 
>>>
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>
>>>
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>>
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>>
>> What relates to allocator and most probably to fragmentation growth is : 
>>
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>>
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>>
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>>
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>>
>> Do you have perf counters dump after the restart? 
>>
>> Could you collect some more dumps - for both mempool and perf counters? 
>>
>> So ideally I'd like to have: 
>>
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>>
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>>
>> a) 1st hour after restart 
>>
>> b) 25th hour. 
>>
>>
>> Thanks, 
>>
>> Igor 
>>
>>
>>> full mempool dump after restart 
>>> ------------------------------- 
>>>
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>>
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>
>>> I'm just seeing 
>>>
>>> StupidAllocator::_aligned_len 
>>> and 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>
>>> on 1 osd, both 10%. 
>>>
>>> here the dump_mempools 
>>>
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>>
>>>
>>> and the perf dump 
>>>
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>>
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>>
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>
>>>>
>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>>
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>>
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>>
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>>
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>>
>>>
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>
>>>
>>> Thanks, 
>>>
>>> Igor 
>>>
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>
>>>> Thanks Igor, 
>>>>
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>>
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>
>>>> (but I have bluestore_fragmentation_micros) 
>>>>
>>>>
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>
>>>>
>>>>
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>
>>>>
>>>>
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>
>>>> Hi Alexandre, 
>>>>
>>>> looks like a bug in StupidAllocator. 
>>>>
>>>> Could you please collect BlueStore performance counters right after OSD 
>>>> startup and once you get high latency. 
>>>>
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>>
>>>>
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>>
>>>>
>>>> Thanks, 
>>>>
>>>> Igor 
>>>>
>>>>
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>>
>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>
>>>>>
>>>>> I have notice something using a simple "perf top", 
>>>>>
>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>
>>>>> when latency is bad, perf top give me : 
>>>>>
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>>
>>>>> (around 10-20% time for both) 
>>>>>
>>>>>
>>>>> when latency is good, I don't see them at all. 
>>>>>
>>>>>
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>
>>>>>
>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>
>>>>>
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>
>>>>> Hi, 
>>>>>
>>>>> some news: 
>>>>>
>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>
>>>>>
>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>>
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>
>>>>>
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>
>>>>>
>>>>> Regards, 
>>>>>
>>>>> Alexandre 
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>
>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>
>>>>> Hi, 
>>>>>
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>>
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>>
>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>>
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>
>>>>>>
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>>
>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>
>>>>> greets, 
>>>>> Stefan 
>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>>
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>>
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>
>>>>>> I would like to check if i see the same behaviour. 
>>>>>>
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>>
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>
>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>
>>>>>>>> Hi, 
>>>>>>>>
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>
>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>>
>>>>>>>> Some example graphs: 
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards, 
>>>>>>>>
>>>>>>>> Alexandre 
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@lists.ceph.com 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>
>>>
>>>
>>
>>
>>
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <056c13b4-fbcf-787f-cfbe-bb37044161f8-fspyXLx8qC4@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                 ` <056c13b4-fbcf-787f-cfbe-bb37044161f8-fspyXLx8qC4@public.gmane.org>
@ 2019-02-15 13:54                                                                                   ` Alexandre DERUMIER
       [not found]                                                                                     ` <1345632100.1225626.1550238886648.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  2019-02-28 20:57                                                                                   ` Stefan Kooman
  1 sibling, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-15 13:54 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-users, ceph-devel

>>Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>OSDs as well. Over time their latency increased until we started to 
>>notice I/O-wait inside VMs. 

I'm also notice it in the vms. BTW, what it your nvme disk size ?


>>A restart fixed it. We also increased memory target from 4G to 6G on 
>>these OSDs as the memory would allow it. 

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
(my last test was 8gb with 1osd of 6TB, but that didn't help)


----- Mail original -----
De: "Wido den Hollander" <wido@42on.com>
À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
> Thanks Igor. 
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
OSDs as well. Over time their latency increased until we started to 
notice I/O-wait inside VMs. 

A restart fixed it. We also increased memory target from 4G to 6G on 
these OSDs as the memory would allow it. 

But we noticed this on two different 12.2.10/11 clusters. 

A restart made the latency drop. Not only the numbers, but the 
real-world latency as experienced by a VM as well. 

Wido 

> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 13:47:57 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is 
> still at 8G 
>> 
>> 
>> I'm graphing mempools counters too since yesterday, so I'll able to 
> track them over time. 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>> 
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>> 
>>> Biggest difference: 
>>> 
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other 
> take all the memory) 
>>> 
>>> 
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> 
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>> 
>> What relates to allocator and most probably to fragmentation growth is : 
>> 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> 
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>> 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> 
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>> 
>> Do you have perf counters dump after the restart? 
>> 
>> Could you collect some more dumps - for both mempool and perf counters? 
>> 
>> So ideally I'd like to have: 
>> 
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>> 
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>> 
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>> 
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>> 
>> a) 1st hour after restart 
>> 
>> b) 25th hour. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> 
>>> full mempool dump after restart 
>>> ------------------------------- 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>> 
>>> I'm just seeing 
>>> 
>>> StupidAllocator::_aligned_len 
>>> and 
>>> 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>> 
>>> on 1 osd, both 10%. 
>>> 
>>> here the dump_mempools 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>> 
>>> 
>>> and the perf dump 
>>> 
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>> 
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>> 
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max 
> possible number of free extents) * 1000"); 
>>>> 
>>>> 
>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
> latency, 
>>>> 
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>> 
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>> 
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>> 
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>> 
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>> 
>>> 
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
> plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>> 
>>>> Thanks Igor, 
>>>> 
>>>>>> Could you please collect BlueStore performance counters right 
> after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
> counters) 
>>>> 
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>> 
>>>> (but I have bluestore_fragmentation_micros) 
>>>> 
>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's 
> paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, 
> and try to reproduce. 
>>>> 
>>>> 
>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from 
> Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what 
> I've seen in PR. 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> looks like a bug in StupidAllocator. 
>>>> 
>>>> Could you please collect BlueStore performance counters right after 
> OSD 
>>>> startup and once you get high latency. 
>>>> 
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> 
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>> 
>>>> 
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>> 
>>>>> I speak too fast, the problem has occured again, so it's not 
> tcmalloc cache size related. 
>>>>> 
>>>>> 
>>>>> I have notice something using a simple "perf top", 
>>>>> 
>>>>> each time I have this problem (I have seen exactly 4 times the 
> same behaviour), 
>>>>> 
>>>>> when latency is bad, perf top give me : 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>> 
>>>>> (around 10-20% time for both) 
>>>>> 
>>>>> 
>>>>> when latency is good, I don't see them at all. 
>>>>> 
>>>>> 
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>> 
>>>>> 
>>>>> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
>>>>> 
>>>>> 
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% 
> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
> std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% 
> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% 
> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>, 
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
> unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long const, unsigned 
> long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% 
> StupidAllocator::_aligned_len(interval_set<unsigned long, 
> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> some news: 
>>>>> 
>>>>> I have tried with different transparent hugepage values (madvise, 
> never) : no change 
>>>>> 
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>> 
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
> some more days to be sure) 
>>>>> 
>>>>> 
>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
> on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>> 
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
> it's helping. 
>>>>> 
>>>>> 
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
> glibc >= 2.26 (which have also thread cache) ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
> of graphs). 
>>>>> 
>>>>> I just don't see latency difference on reads. (or they are very 
> very small vs the write latency increase) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>> 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>> 
>>>>>> here an example, when all osd at 20-50ms before restart, then 
> after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>> 
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
> Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>> 
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>> 
>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>> 
>>>>> greets, 
>>>>> Stefan 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
> <sage@newdream.net> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
> time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>> 
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>> 
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
> but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>> 
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> 
>>>>>> I would like to check if i see the same behaviour. 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
> <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
> restart 
>>>>>>> 
>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
> time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>> 
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>> 
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
> nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>> 
>>>>>>>> When the osd are refreshly started, the commit latency is 
> between 0,5-1ms. 
>>>>>>>> 
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
> day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>> 
>>>>>>>> Some example graphs: 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>> 
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>> 
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
> full loaded) 
>>>>>>>> 
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>> 
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
> bluestore memory bug ? 
>>>>>>>> 
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards, 
>>>>>>>> 
>>>>>>>> Alexandre 
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@lists.ceph.com 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>> 
>>> 
>> 
>> 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>> 
>> 
>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>> 
>>> Biggest difference: 
>>> 
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>> 
>>> 
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> 
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>> 
>> What relates to allocator and most probably to fragmentation growth is : 
>> 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> 
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>> 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> 
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>> 
>> Do you have perf counters dump after the restart? 
>> 
>> Could you collect some more dumps - for both mempool and perf counters? 
>> 
>> So ideally I'd like to have: 
>> 
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>> 
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>> 
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>> 
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>> 
>> a) 1st hour after restart 
>> 
>> b) 25th hour. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> 
>>> full mempool dump after restart 
>>> ------------------------------- 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@odiso.com> 
>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> I'm just seeing 
>>> 
>>> StupidAllocator::_aligned_len 
>>> and 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>> 
>>> on 1 osd, both 10%. 
>>> 
>>> here the dump_mempools 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>> 
>>> 
>>> and the perf dump 
>>> 
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>> 
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>> 
>>>> 
>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>> 
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>> 
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>> 
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>> 
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>> 
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>> 
>>> 
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Thanks Igor, 
>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>> 
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>> 
>>>> (but I have bluestore_fragmentation_micros) 
>>>> 
>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>> 
>>>> 
>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> looks like a bug in StupidAllocator. 
>>>> 
>>>> Could you please collect BlueStore performance counters right after OSD 
>>>> startup and once you get high latency. 
>>>> 
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> 
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>> 
>>>> 
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>> 
>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>> 
>>>>> 
>>>>> I have notice something using a simple "perf top", 
>>>>> 
>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>> 
>>>>> when latency is bad, perf top give me : 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>> 
>>>>> (around 10-20% time for both) 
>>>>> 
>>>>> 
>>>>> when latency is good, I don't see them at all. 
>>>>> 
>>>>> 
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>> 
>>>>> 
>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>> 
>>>>> 
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> some news: 
>>>>> 
>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>> 
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>> 
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>> 
>>>>> 
>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>> 
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>> 
>>>>> 
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>> 
>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>> 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>> 
>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>> 
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>> 
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>> 
>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>> 
>>>>> greets, 
>>>>> Stefan 
>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>> 
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>> 
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>> 
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> 
>>>>>> I would like to check if i see the same behaviour. 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>> 
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>> 
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>> 
>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>> 
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>> 
>>>>>>>> Some example graphs: 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>> 
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>> 
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>> 
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>> 
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>> 
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards, 
>>>>>>>> 
>>>>>>>> Alexandre 
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@lists.ceph.com 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1345632100.1225626.1550238886648.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                     ` <1345632100.1225626.1550238886648.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-15 13:59                                                                                       ` Wido den Hollander
       [not found]                                                                                         ` <fdd3eaa2-567b-8e02-aadb-64a19c78bc23-fspyXLx8qC4@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Wido den Hollander @ 2019-02-15 13:59 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel



On 2/15/19 2:54 PM, Alexandre DERUMIER wrote:
>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>> OSDs as well. Over time their latency increased until we started to 
>>> notice I/O-wait inside VMs. 
> 
> I'm also notice it in the vms. BTW, what it your nvme disk size ?

Samsung PM983 3.84TB SSDs in both clusters.

> 
> 
>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>> these OSDs as the memory would allow it. 
> 
> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
> (my last test was 8gb with 1osd of 6TB, but that didn't help)

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

As these OSDs were all restarted earlier this week I can't tell how it
will hold up over a longer period. Monitoring (Zabbix) shows the latency
is fine at the moment.

Wido

> 
> 
> ----- Mail original -----
> De: "Wido den Hollander" <wido@42on.com>
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Vendredi 15 Février 2019 14:50:34
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
> 
> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>> Thanks Igor. 
>>
>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>
>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>
>>
> 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 
> 
> A restart fixed it. We also increased memory target from 4G to 6G on 
> these OSDs as the memory would allow it. 
> 
> But we noticed this on two different 12.2.10/11 clusters. 
> 
> A restart made the latency drop. Not only the numbers, but the 
> real-world latency as experienced by a VM as well. 
> 
> Wido 
> 
>>
>>
>>
>>
>>
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>
>> Hi Alexander, 
>>
>> I've read through your reports, nothing obvious so far. 
>>
>> I can only see several times average latency increase for OSD write ops 
>> (in seconds) 
>> 0.002040060 (first hour) vs. 
>>
>> 0.002483516 (last 24 hours) vs. 
>> 0.008382087 (last hour) 
>>
>> subop_w_latency: 
>> 0.000478934 (first hour) vs. 
>> 0.000537956 (last 24 hours) vs. 
>> 0.003073475 (last hour) 
>>
>> and OSD read ops, osd_r_latency: 
>>
>> 0.000408595 (first hour) 
>> 0.000709031 (24 hours) 
>> 0.004979540 (last hour) 
>>
>> What's interesting is that such latency differences aren't observed at 
>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>> rocksdb one. 
>>
>> Which probably means that the issue is rather somewhere above BlueStore. 
>>
>> Suggest to proceed with perf dumps collection to see if the picture 
>> stays the same. 
>>
>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>> decrease in RSS report is a known artifact that seems to be safe. 
>>
>> Thanks, 
>> Igor 
>>
>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>> Hi Igor, 
>>>
>>> Thanks again for helping ! 
>>>
>>>
>>>
>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>
>>>
>>> I have done a lot of perf dump and mempool dump and ps of process to 
>> see rss memory at different hours, 
>>> here the reports for osd.0: 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>
>>>
>>> osd has been started the 12-02-2019 at 08:00 
>>>
>>> first report after 1h running 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>
>>>
>>>
>>> report after 24 before counter resets 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>
>>> report 1h after counter reset 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>
>>>
>>>
>>>
>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>> around 12-02-2019 at 14:00 
>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>> Then after that, slowly decreasing. 
>>>
>>>
>>> Another strange thing, 
>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>> still at 8G 
>>>
>>>
>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>> track them over time. 
>>>
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>
>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>> another mempool dump after 1h run. (latency ok) 
>>>>
>>>> Biggest difference: 
>>>>
>>>> before restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>> take all the memory) 
>>>>
>>>>
>>>> After restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>>
>>> This is fine as cache is warming after restart and some rebalancing 
>>> between data and metadata might occur. 
>>>
>>> What relates to allocator and most probably to fragmentation growth is : 
>>>
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>>
>>> which had been higher before the reset (if I got these dumps' order 
>>> properly) 
>>>
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>>
>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>> latency increase... 
>>>
>>> Do you have perf counters dump after the restart? 
>>>
>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>
>>> So ideally I'd like to have: 
>>>
>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>
>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>
>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>> restart) and dump mempool/perf counters again. 
>>>
>>> So we'll be able to learn both allocator mem usage growth and operation 
>>> latency distribution for the following periods: 
>>>
>>> a) 1st hour after restart 
>>>
>>> b) 25th hour. 
>>>
>>>
>>> Thanks, 
>>>
>>> Igor 
>>>
>>>
>>>> full mempool dump after restart 
>>>> ------------------------------- 
>>>>
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 22225, 
>>>> "bytes": 14935200 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 11, 
>>>> "bytes": 8184 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 5047, 
>>>> "bytes": 22673736 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 91, 
>>>> "bytes": 1662976 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1907, 
>>>> "bytes": 95600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 19664, 
>>>> "bytes": 25486050 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 46189, 
>>>> "bytes": 2956096 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 17, 
>>>> "bytes": 214366 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 889673, 
>>>> "bytes": 367160400 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3803, 
>>>> "bytes": 224552 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 178515204, 
>>>> "bytes": 2160630547 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>
>>>> I'm just seeing 
>>>>
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>>
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>
>>>> on 1 osd, both 10%. 
>>>>
>>>> here the dump_mempools 
>>>>
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 105637, 
>>>> "bytes": 70988064 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 12, 
>>>> "bytes": 8928 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 406, 
>>>> "bytes": 4792868 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 66, 
>>>> "bytes": 1085440 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1882, 
>>>> "bytes": 93600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 138986, 
>>>> "bytes": 24983701 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 544, 
>>>> "bytes": 34816 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 36, 
>>>> "bytes": 179308 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 952564, 
>>>> "bytes": 372459684 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3639, 
>>>> "bytes": 224664 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 260109445, 
>>>> "bytes": 2228370845 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>>
>>>> and the perf dump 
>>>>
>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>> { 
>>>> "AsyncMessenger::Worker-0": { 
>>>> "msgr_recv_messages": 22948570, 
>>>> "msgr_send_messages": 22561570, 
>>>> "msgr_recv_bytes": 333085080271, 
>>>> "msgr_send_bytes": 261798871204, 
>>>> "msgr_created_connections": 6152, 
>>>> "msgr_active_connections": 2701, 
>>>> "msgr_running_total_time": 1055.197867330, 
>>>> "msgr_running_send_time": 352.764480121, 
>>>> "msgr_running_recv_time": 499.206831955, 
>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>> }, 
>>>> "AsyncMessenger::Worker-1": { 
>>>> "msgr_recv_messages": 18801593, 
>>>> "msgr_send_messages": 18430264, 
>>>> "msgr_recv_bytes": 306871760934, 
>>>> "msgr_send_bytes": 192789048666, 
>>>> "msgr_created_connections": 5773, 
>>>> "msgr_active_connections": 2721, 
>>>> "msgr_running_total_time": 816.821076305, 
>>>> "msgr_running_send_time": 261.353228926, 
>>>> "msgr_running_recv_time": 394.035587911, 
>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>> }, 
>>>> "AsyncMessenger::Worker-2": { 
>>>> "msgr_recv_messages": 18463400, 
>>>> "msgr_send_messages": 18105856, 
>>>> "msgr_recv_bytes": 187425453590, 
>>>> "msgr_send_bytes": 220735102555, 
>>>> "msgr_created_connections": 5897, 
>>>> "msgr_active_connections": 2605, 
>>>> "msgr_running_total_time": 807.186854324, 
>>>> "msgr_running_send_time": 296.834435839, 
>>>> "msgr_running_recv_time": 351.364389691, 
>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>> }, 
>>>> "bluefs": { 
>>>> "gift_bytes": 0, 
>>>> "reclaim_bytes": 0, 
>>>> "db_total_bytes": 256050724864, 
>>>> "db_used_bytes": 12413042688, 
>>>> "wal_total_bytes": 0, 
>>>> "wal_used_bytes": 0, 
>>>> "slow_total_bytes": 0, 
>>>> "slow_used_bytes": 0, 
>>>> "num_files": 209, 
>>>> "log_bytes": 10383360, 
>>>> "log_compactions": 14, 
>>>> "logged_bytes": 336498688, 
>>>> "files_written_wal": 2, 
>>>> "files_written_sst": 4499, 
>>>> "bytes_written_wal": 417989099783, 
>>>> "bytes_written_sst": 213188750209 
>>>> }, 
>>>> "bluestore": { 
>>>> "kv_flush_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 26.734038497, 
>>>> "avgtime": 0.000001013 
>>>> }, 
>>>> "kv_commit_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3397.491150603, 
>>>> "avgtime": 0.000128829 
>>>> }, 
>>>> "kv_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3424.225189100, 
>>>> "avgtime": 0.000129843 
>>>> }, 
>>>> "state_prepare_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3689.542105337, 
>>>> "avgtime": 0.000121028 
>>>> }, 
>>>> "state_aio_wait_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 509.864546111, 
>>>> "avgtime": 0.000016725 
>>>> }, 
>>>> "state_io_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 24.534052953, 
>>>> "avgtime": 0.000000804 
>>>> }, 
>>>> "state_kv_queued_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3488.338424238, 
>>>> "avgtime": 0.000114428 
>>>> }, 
>>>> "state_kv_commiting_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 5660.437003432, 
>>>> "avgtime": 0.000185679 
>>>> }, 
>>>> "state_kv_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 7.763511500, 
>>>> "avgtime": 0.000000254 
>>>> }, 
>>>> "state_deferred_queued_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 666071.296856696, 
>>>> "avgtime": 0.025281557 
>>>> }, 
>>>> "state_deferred_aio_wait_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 1755.660547071, 
>>>> "avgtime": 0.000066638 
>>>> }, 
>>>> "state_deferred_cleanup_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 185465.151653703, 
>>>> "avgtime": 0.007039558 
>>>> }, 
>>>> "state_finishing_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 3.046847481, 
>>>> "avgtime": 0.000000099 
>>>> }, 
>>>> "state_done_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 13193.362685280, 
>>>> "avgtime": 0.000432783 
>>>> }, 
>>>> "throttle_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 14.634269979, 
>>>> "avgtime": 0.000000480 
>>>> }, 
>>>> "submit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3873.883076148, 
>>>> "avgtime": 0.000127075 
>>>> }, 
>>>> "commit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 13376.492317331, 
>>>> "avgtime": 0.000438790 
>>>> }, 
>>>> "read_lat": { 
>>>> "avgcount": 5873923, 
>>>> "sum": 1817.167582057, 
>>>> "avgtime": 0.000309361 
>>>> }, 
>>>> "read_onode_meta_lat": { 
>>>> "avgcount": 19608201, 
>>>> "sum": 146.770464482, 
>>>> "avgtime": 0.000007485 
>>>> }, 
>>>> "read_wait_aio_lat": { 
>>>> "avgcount": 13734278, 
>>>> "sum": 2532.578077242, 
>>>> "avgtime": 0.000184398 
>>>> }, 
>>>> "compress_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "decompress_lat": { 
>>>> "avgcount": 1346945, 
>>>> "sum": 26.227575896, 
>>>> "avgtime": 0.000019471 
>>>> }, 
>>>> "csum_lat": { 
>>>> "avgcount": 28020392, 
>>>> "sum": 149.587819041, 
>>>> "avgtime": 0.000005338 
>>>> }, 
>>>> "compress_success_count": 0, 
>>>> "compress_rejected_count": 0, 
>>>> "write_pad_bytes": 352923605, 
>>>> "deferred_write_ops": 24373340, 
>>>> "deferred_write_bytes": 216791842816, 
>>>> "write_penalty_read_ops": 8062366, 
>>>> "bluestore_allocated": 3765566013440, 
>>>> "bluestore_stored": 4186255221852, 
>>>> "bluestore_compressed": 39981379040, 
>>>> "bluestore_compressed_allocated": 73748348928, 
>>>> "bluestore_compressed_original": 165041381376, 
>>>> "bluestore_onodes": 104232, 
>>>> "bluestore_onode_hits": 71206874, 
>>>> "bluestore_onode_misses": 1217914, 
>>>> "bluestore_onode_shard_hits": 260183292, 
>>>> "bluestore_onode_shard_misses": 22851573, 
>>>> "bluestore_extents": 3394513, 
>>>> "bluestore_blobs": 2773587, 
>>>> "bluestore_buffers": 0, 
>>>> "bluestore_buffer_bytes": 0, 
>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>> "bluestore_write_big": 5648815, 
>>>> "bluestore_write_big_bytes": 552502214656, 
>>>> "bluestore_write_big_blobs": 12440992, 
>>>> "bluestore_write_small": 35883770, 
>>>> "bluestore_write_small_bytes": 223436965719, 
>>>> "bluestore_write_small_unused": 408125, 
>>>> "bluestore_write_small_deferred": 34961455, 
>>>> "bluestore_write_small_pre_read": 34961455, 
>>>> "bluestore_write_small_new": 514190, 
>>>> "bluestore_txc": 30484924, 
>>>> "bluestore_onode_reshard": 5144189, 
>>>> "bluestore_blob_split": 60104, 
>>>> "bluestore_extent_compress": 53347252, 
>>>> "bluestore_gc_merged": 21142528, 
>>>> "bluestore_read_eio": 0, 
>>>> "bluestore_fragmentation_micros": 67 
>>>> }, 
>>>> "finisher-defered_finisher": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "finisher-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 26625163, 
>>>> "sum": 1057.506990951, 
>>>> "avgtime": 0.000039718 
>>>> } 
>>>> }, 
>>>> "finisher-objecter-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "objecter": { 
>>>> "op_active": 0, 
>>>> "op_laggy": 0, 
>>>> "op_send": 0, 
>>>> "op_send_bytes": 0, 
>>>> "op_resend": 0, 
>>>> "op_reply": 0, 
>>>> "op": 0, 
>>>> "op_r": 0, 
>>>> "op_w": 0, 
>>>> "op_rmw": 0, 
>>>> "op_pg": 0, 
>>>> "osdop_stat": 0, 
>>>> "osdop_create": 0, 
>>>> "osdop_read": 0, 
>>>> "osdop_write": 0, 
>>>> "osdop_writefull": 0, 
>>>> "osdop_writesame": 0, 
>>>> "osdop_append": 0, 
>>>> "osdop_zero": 0, 
>>>> "osdop_truncate": 0, 
>>>> "osdop_delete": 0, 
>>>> "osdop_mapext": 0, 
>>>> "osdop_sparse_read": 0, 
>>>> "osdop_clonerange": 0, 
>>>> "osdop_getxattr": 0, 
>>>> "osdop_setxattr": 0, 
>>>> "osdop_cmpxattr": 0, 
>>>> "osdop_rmxattr": 0, 
>>>> "osdop_resetxattrs": 0, 
>>>> "osdop_tmap_up": 0, 
>>>> "osdop_tmap_put": 0, 
>>>> "osdop_tmap_get": 0, 
>>>> "osdop_call": 0, 
>>>> "osdop_watch": 0, 
>>>> "osdop_notify": 0, 
>>>> "osdop_src_cmpxattr": 0, 
>>>> "osdop_pgls": 0, 
>>>> "osdop_pgls_filter": 0, 
>>>> "osdop_other": 0, 
>>>> "linger_active": 0, 
>>>> "linger_send": 0, 
>>>> "linger_resend": 0, 
>>>> "linger_ping": 0, 
>>>> "poolop_active": 0, 
>>>> "poolop_send": 0, 
>>>> "poolop_resend": 0, 
>>>> "poolstat_active": 0, 
>>>> "poolstat_send": 0, 
>>>> "poolstat_resend": 0, 
>>>> "statfs_active": 0, 
>>>> "statfs_send": 0, 
>>>> "statfs_resend": 0, 
>>>> "command_active": 0, 
>>>> "command_send": 0, 
>>>> "command_resend": 0, 
>>>> "map_epoch": 105913, 
>>>> "map_full": 0, 
>>>> "map_inc": 828, 
>>>> "osd_sessions": 0, 
>>>> "osd_session_open": 0, 
>>>> "osd_session_close": 0, 
>>>> "osd_laggy": 0, 
>>>> "omap_wr": 0, 
>>>> "omap_rd": 0, 
>>>> "omap_del": 0 
>>>> }, 
>>>> "osd": { 
>>>> "op_wip": 0, 
>>>> "op": 16758102, 
>>>> "op_in_bytes": 238398820586, 
>>>> "op_out_bytes": 165484999463, 
>>>> "op_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 38242.481640842, 
>>>> "avgtime": 0.002282029 
>>>> }, 
>>>> "op_process_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 28644.906310687, 
>>>> "avgtime": 0.001709316 
>>>> }, 
>>>> "op_prepare_latency": { 
>>>> "avgcount": 16761367, 
>>>> "sum": 3489.856599934, 
>>>> "avgtime": 0.000208208 
>>>> }, 
>>>> "op_r": 6188565, 
>>>> "op_r_out_bytes": 165484999463, 
>>>> "op_r_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 4507.365756792, 
>>>> "avgtime": 0.000728337 
>>>> }, 
>>>> "op_r_process_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 942.363063429, 
>>>> "avgtime": 0.000152274 
>>>> }, 
>>>> "op_r_prepare_latency": { 
>>>> "avgcount": 6188644, 
>>>> "sum": 982.866710389, 
>>>> "avgtime": 0.000158817 
>>>> }, 
>>>> "op_w": 10546037, 
>>>> "op_w_in_bytes": 238334329494, 
>>>> "op_w_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 33160.719998316, 
>>>> "avgtime": 0.003144377 
>>>> }, 
>>>> "op_w_process_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 27668.702029030, 
>>>> "avgtime": 0.002623611 
>>>> }, 
>>>> "op_w_prepare_latency": { 
>>>> "avgcount": 10548652, 
>>>> "sum": 2499.688609173, 
>>>> "avgtime": 0.000236967 
>>>> }, 
>>>> "op_rw": 23500, 
>>>> "op_rw_in_bytes": 64491092, 
>>>> "op_rw_out_bytes": 0, 
>>>> "op_rw_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 574.395885734, 
>>>> "avgtime": 0.024442378 
>>>> }, 
>>>> "op_rw_process_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 33.841218228, 
>>>> "avgtime": 0.001440051 
>>>> }, 
>>>> "op_rw_prepare_latency": { 
>>>> "avgcount": 24071, 
>>>> "sum": 7.301280372, 
>>>> "avgtime": 0.000303322 
>>>> }, 
>>>> "op_before_queue_op_lat": { 
>>>> "avgcount": 57892986, 
>>>> "sum": 1502.117718889, 
>>>> "avgtime": 0.000025946 
>>>> }, 
>>>> "op_before_dequeue_op_lat": { 
>>>> "avgcount": 58091683, 
>>>> "sum": 45194.453254037, 
>>>> "avgtime": 0.000777984 
>>>> }, 
>>>> "subop": 19784758, 
>>>> "subop_in_bytes": 547174969754, 
>>>> "subop_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_w": 19784758, 
>>>> "subop_w_in_bytes": 547174969754, 
>>>> "subop_w_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_pull": 0, 
>>>> "subop_pull_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "subop_push": 0, 
>>>> "subop_push_in_bytes": 0, 
>>>> "subop_push_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "pull": 0, 
>>>> "push": 2003, 
>>>> "push_out_bytes": 5560009728, 
>>>> "recovery_ops": 1940, 
>>>> "loadavg": 118, 
>>>> "buffer_bytes": 0, 
>>>> "history_alloc_Mbytes": 0, 
>>>> "history_alloc_num": 0, 
>>>> "cached_crc": 0, 
>>>> "cached_crc_adjusted": 0, 
>>>> "missed_crc": 0, 
>>>> "numpg": 243, 
>>>> "numpg_primary": 82, 
>>>> "numpg_replica": 161, 
>>>> "numpg_stray": 0, 
>>>> "numpg_removing": 0, 
>>>> "heartbeat_to_peers": 10, 
>>>> "map_messages": 7013, 
>>>> "map_message_epochs": 7143, 
>>>> "map_message_epoch_dups": 6315, 
>>>> "messages_delayed_for_map": 0, 
>>>> "osd_map_cache_hit": 203309, 
>>>> "osd_map_cache_miss": 33, 
>>>> "osd_map_cache_miss_low": 0, 
>>>> "osd_map_cache_miss_low_avg": { 
>>>> "avgcount": 0, 
>>>> "sum": 0 
>>>> }, 
>>>> "osd_map_bl_cache_hit": 47012, 
>>>> "osd_map_bl_cache_miss": 1681, 
>>>> "stat_bytes": 6401248198656, 
>>>> "stat_bytes_used": 3777979072512, 
>>>> "stat_bytes_avail": 2623269126144, 
>>>> "copyfrom": 0, 
>>>> "tier_promote": 0, 
>>>> "tier_flush": 0, 
>>>> "tier_flush_fail": 0, 
>>>> "tier_try_flush": 0, 
>>>> "tier_try_flush_fail": 0, 
>>>> "tier_evict": 0, 
>>>> "tier_whiteout": 1631, 
>>>> "tier_dirty": 22360, 
>>>> "tier_clean": 0, 
>>>> "tier_delay": 0, 
>>>> "tier_proxy_read": 0, 
>>>> "tier_proxy_write": 0, 
>>>> "agent_wake": 0, 
>>>> "agent_skip": 0, 
>>>> "agent_flush": 0, 
>>>> "agent_evict": 0, 
>>>> "object_ctx_cache_hit": 16311156, 
>>>> "object_ctx_cache_total": 17426393, 
>>>> "op_cache_hit": 0, 
>>>> "osd_tier_flush_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_promote_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_r_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_pg_info": 30483113, 
>>>> "osd_pg_fastinfo": 29619885, 
>>>> "osd_pg_biginfo": 81703 
>>>> }, 
>>>> "recoverystate_perf": { 
>>>> "initial_latency": { 
>>>> "avgcount": 243, 
>>>> "sum": 6.869296500, 
>>>> "avgtime": 0.028268709 
>>>> }, 
>>>> "started_latency": { 
>>>> "avgcount": 1125, 
>>>> "sum": 13551384.917335850, 
>>>> "avgtime": 12045.675482076 
>>>> }, 
>>>> "reset_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 1101.727799040, 
>>>> "avgtime": 0.805356578 
>>>> }, 
>>>> "start_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 0.002014799, 
>>>> "avgtime": 0.000001472 
>>>> }, 
>>>> "primary_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 4575560.638823428, 
>>>> "avgtime": 9024.774435549 
>>>> }, 
>>>> "peering_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 499.372283616, 
>>>> "avgtime": 0.907949606 
>>>> }, 
>>>> "backfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitremotebackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitlocalbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "notbackfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "repnotrecovering_latency": { 
>>>> "avgcount": 1009, 
>>>> "sum": 8975301.082274411, 
>>>> "avgtime": 8895.243887288 
>>>> }, 
>>>> "repwaitrecoveryreserved_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 99.846056520, 
>>>> "avgtime": 0.237728706 
>>>> }, 
>>>> "repwaitbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "reprecovering_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 241.682764382, 
>>>> "avgtime": 0.575435153 
>>>> }, 
>>>> "activating_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 16.893347339, 
>>>> "avgtime": 0.033320211 
>>>> }, 
>>>> "waitlocalrecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 672.335512769, 
>>>> "avgtime": 3.378570415 
>>>> }, 
>>>> "waitremoterecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 213.536439363, 
>>>> "avgtime": 1.073047433 
>>>> }, 
>>>> "recovering_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 79.007696479, 
>>>> "avgtime": 0.397023600 
>>>> }, 
>>>> "recovered_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 14.000732748, 
>>>> "avgtime": 0.027614857 
>>>> }, 
>>>> "clean_latency": { 
>>>> "avgcount": 395, 
>>>> "sum": 4574325.900371083, 
>>>> "avgtime": 11580.571899673 
>>>> }, 
>>>> "active_latency": { 
>>>> "avgcount": 425, 
>>>> "sum": 4575107.630123680, 
>>>> "avgtime": 10764.959129702 
>>>> }, 
>>>> "replicaactive_latency": { 
>>>> "avgcount": 589, 
>>>> "sum": 8975184.499049954, 
>>>> "avgtime": 15238.004242869 
>>>> }, 
>>>> "stray_latency": { 
>>>> "avgcount": 818, 
>>>> "sum": 800.729455666, 
>>>> "avgtime": 0.978886865 
>>>> }, 
>>>> "getinfo_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 15.085667048, 
>>>> "avgtime": 0.027428485 
>>>> }, 
>>>> "getlog_latency": { 
>>>> "avgcount": 546, 
>>>> "sum": 3.482175693, 
>>>> "avgtime": 0.006377611 
>>>> }, 
>>>> "waitactingchange_latency": { 
>>>> "avgcount": 39, 
>>>> "sum": 35.444551284, 
>>>> "avgtime": 0.908834648 
>>>> }, 
>>>> "incomplete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "down_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "getmissing_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 6.702129624, 
>>>> "avgtime": 0.013219190 
>>>> }, 
>>>> "waitupthru_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 474.098261727, 
>>>> "avgtime": 0.935105052 
>>>> }, 
>>>> "notrecovering_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "rocksdb": { 
>>>> "get": 28320977, 
>>>> "submit_transaction": 30484924, 
>>>> "submit_transaction_sync": 26371957, 
>>>> "get_latency": { 
>>>> "avgcount": 28320977, 
>>>> "sum": 325.900908733, 
>>>> "avgtime": 0.000011507 
>>>> }, 
>>>> "submit_latency": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 1835.888692371, 
>>>> "avgtime": 0.000060222 
>>>> }, 
>>>> "submit_sync_latency": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 1431.555230628, 
>>>> "avgtime": 0.000054283 
>>>> }, 
>>>> "compact": 0, 
>>>> "compact_range": 0, 
>>>> "compact_queue_merge": 0, 
>>>> "compact_queue_len": 0, 
>>>> "rocksdb_write_wal_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_memtable_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_delay_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_pre_and_post_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>
>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> ok, this is the same 
>>>>>
>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>> "How fragmented bluestore free space is (free extents / max 
>> possible number of free extents) * 1000"); 
>>>>>
>>>>>
>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>> latency, 
>>>>>
>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>> it? The same for other OSDs? 
>>>>
>>>> This proves some issue with the allocator - generally fragmentation 
>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>> aren't properly merged in run-time. 
>>>>
>>>> On the other side I'm not completely sure that latency degradation is 
>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>> how this might impact performance that high. 
>>>>
>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>> output on admin socket) reports? Do you have any historic data? 
>>>>
>>>> If not may I have current output and say a couple more samples with 
>>>> 8-12 hours interval? 
>>>>
>>>>
>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>> plans 
>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>
>>>>
>>>> Thanks, 
>>>>
>>>> Igor 
>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>>
>>>>> Thanks Igor, 
>>>>>
>>>>>>> Could you please collect BlueStore performance counters right 
>> after OSD 
>>>>>>> startup and once you get high latency. 
>>>>>>>
>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> I'm already monitoring with 
>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>> counters) 
>>>>>
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>
>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>
>>>>>
>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>> patch to track latency and some other internal allocator's 
>> paramter to 
>>>>>>> make sure it's degraded and learn more details. 
>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>> and try to reproduce. 
>>>>>
>>>>>
>>>>>
>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>> Nautilus 
>>>>>>> and try the difference... 
>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>> perf results of new bitmap allocator seem very promising from what 
>> I've seen in PR. 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>>
>>>>> Hi Alexandre, 
>>>>>
>>>>> looks like a bug in StupidAllocator. 
>>>>>
>>>>> Could you please collect BlueStore performance counters right after 
>> OSD 
>>>>> startup and once you get high latency. 
>>>>>
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>> make sure it's degraded and learn more details. 
>>>>>
>>>>>
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>> and try the difference... 
>>>>>
>>>>>
>>>>> Thanks, 
>>>>>
>>>>> Igor 
>>>>>
>>>>>
>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>> Hi again, 
>>>>>>
>>>>>> I speak too fast, the problem has occured again, so it's not 
>> tcmalloc cache size related. 
>>>>>>
>>>>>>
>>>>>> I have notice something using a simple "perf top", 
>>>>>>
>>>>>> each time I have this problem (I have seen exactly 4 times the 
>> same behaviour), 
>>>>>>
>>>>>> when latency is bad, perf top give me : 
>>>>>>
>>>>>> StupidAllocator::_aligned_len 
>>>>>> and 
>>>>>>
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>> unsigned long>&, std::pair<unsigned long 
>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>
>>>>>> (around 10-20% time for both) 
>>>>>>
>>>>>>
>>>>>> when latency is good, I don't see them at all. 
>>>>>>
>>>>>>
>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>
>>>>>>
>>>>>> here an extract of the thread with btree::btree_iterator && 
>> StupidAllocator::_aligned_len 
>>>>>>
>>>>>>
>>>>>> + 100.00% clone 
>>>>>> + 100.00% start_thread 
>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>> ceph::heartbeat_handle_d*) 
>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>> ThreadPool::TPHandle&) 
>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% 
>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>> ThreadPool::TPHandle&) 
>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% 
>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% 
>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 67.00% non-virtual thunk to 
>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>> std::allocator<ObjectStore::Transaction> >&, 
>> boost::intrusive_ptr<OpRequest>) 
>>>>>> | | | + 67.00% 
>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>> std::vector<ObjectStore::Transaction, 
>> std::allocator<ObjectStore::Transaction> >&, 
>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>> | | | + 66.00% 
>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>> ObjectStore::Transaction*) 
>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>> ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>> ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 65.00% 
>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>, 
>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>> | | | | | | + 34.00% 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, 
>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>> unsigned long>&, std::pair<unsigned long const, unsigned 
>> long>*>::increment_slow() 
>>>>>> | | | | | | + 26.00% 
>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> some news: 
>>>>>>
>>>>>> I have tried with different transparent hugepage values (madvise, 
>> never) : no change 
>>>>>>
>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>
>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>> some more days to be sure) 
>>>>>>
>>>>>>
>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>> on my big nvme drives (6TB), 
>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>
>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>> it's helping. 
>>>>>>
>>>>>>
>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>
>>>>>>
>>>>>> Regards, 
>>>>>>
>>>>>> Alexandre 
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>>
>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>
>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>> op_r_process_latency? 
>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>> of graphs). 
>>>>>>
>>>>>> I just don't see latency difference on reads. (or they are very 
>> very small vs the write latency increase) 
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>> Hi Stefan, 
>>>>>>>
>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>> tcmalloc 
>>>>>>>>> like suggested. This report makes me a little nervous about my 
>> change. 
>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>> I need to compare with bigger latencies 
>>>>>>>
>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>> after restart (at 21:15), 1ms 
>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>
>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>
>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>> Which 
>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> here my influxdb queries: 
>>>>>>>
>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>> fill(previous) 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>> fill(previous) 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>>
>>>>>> Also why do you monitor op_w_process_latency? but not 
>> op_r_process_latency? 
>>>>>>
>>>>>> greets, 
>>>>>> Stefan 
>>>>>>
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>> <sage@newdream.net> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>>>
>>>>>>> Hi, 
>>>>>>>
>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi, 
>>>>>>>>
>>>>>>>> here some new results, 
>>>>>>>> different osd/ different cluster 
>>>>>>>>
>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>
>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>> but maybe I'm wrong. 
>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>> currently i'm in the process of switching back from jemalloc to 
>> tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my 
>> change. 
>>>>>>>
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>
>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>
>>>>>>> Greets, 
>>>>>>> Stefan 
>>>>>>>
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>> restart 
>>>>>>>>
>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>> time is 
>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>
>>>>>>>> Thanks! 
>>>>>>>> sage 
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>
>>>>>>>>> Hi, 
>>>>>>>>>
>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>
>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>> nvme drivers, 
>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>
>>>>>>>>> When the osd are refreshly started, the commit latency is 
>> between 0,5-1ms. 
>>>>>>>>>
>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>> day), until reaching crazy 
>>>>>>>>> values like 20-200ms. 
>>>>>>>>>
>>>>>>>>> Some example graphs: 
>>>>>>>>>
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>
>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>
>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>> full loaded) 
>>>>>>>>>
>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>
>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>> bluestore memory bug ? 
>>>>>>>>>
>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards, 
>>>>>>>>>
>>>>>>>>> Alexandre 
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________ 
>>>>>>>> ceph-users mailing list 
>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>
>>>>
>>>
>>>
>>
>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>> Hi Igor, 
>>>
>>> Thanks again for helping ! 
>>>
>>>
>>>
>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>
>>>
>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>> here the reports for osd.0: 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>
>>>
>>> osd has been started the 12-02-2019 at 08:00 
>>>
>>> first report after 1h running 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>
>>>
>>>
>>> report after 24 before counter resets 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>
>>> report 1h after counter reset 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>
>>>
>>>
>>>
>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>> Then after that, slowly decreasing. 
>>>
>>>
>>> Another strange thing, 
>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>
>>>
>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>
>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>> another mempool dump after 1h run. (latency ok) 
>>>>
>>>> Biggest difference: 
>>>>
>>>> before restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>
>>>>
>>>> After restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>>
>>> This is fine as cache is warming after restart and some rebalancing 
>>> between data and metadata might occur. 
>>>
>>> What relates to allocator and most probably to fragmentation growth is : 
>>>
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>>
>>> which had been higher before the reset (if I got these dumps' order 
>>> properly) 
>>>
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>>
>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>> latency increase... 
>>>
>>> Do you have perf counters dump after the restart? 
>>>
>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>
>>> So ideally I'd like to have: 
>>>
>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>
>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>
>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>> restart) and dump mempool/perf counters again. 
>>>
>>> So we'll be able to learn both allocator mem usage growth and operation 
>>> latency distribution for the following periods: 
>>>
>>> a) 1st hour after restart 
>>>
>>> b) 25th hour. 
>>>
>>>
>>> Thanks, 
>>>
>>> Igor 
>>>
>>>
>>>> full mempool dump after restart 
>>>> ------------------------------- 
>>>>
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 22225, 
>>>> "bytes": 14935200 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 11, 
>>>> "bytes": 8184 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 5047, 
>>>> "bytes": 22673736 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 91, 
>>>> "bytes": 1662976 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1907, 
>>>> "bytes": 95600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 19664, 
>>>> "bytes": 25486050 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 46189, 
>>>> "bytes": 2956096 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 17, 
>>>> "bytes": 214366 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 889673, 
>>>> "bytes": 367160400 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3803, 
>>>> "bytes": 224552 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 178515204, 
>>>> "bytes": 2160630547 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>
>>>> I'm just seeing 
>>>>
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>
>>>> on 1 osd, both 10%. 
>>>>
>>>> here the dump_mempools 
>>>>
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 105637, 
>>>> "bytes": 70988064 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 12, 
>>>> "bytes": 8928 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 406, 
>>>> "bytes": 4792868 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 66, 
>>>> "bytes": 1085440 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1882, 
>>>> "bytes": 93600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 138986, 
>>>> "bytes": 24983701 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 544, 
>>>> "bytes": 34816 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 36, 
>>>> "bytes": 179308 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 952564, 
>>>> "bytes": 372459684 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3639, 
>>>> "bytes": 224664 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 260109445, 
>>>> "bytes": 2228370845 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>>
>>>> and the perf dump 
>>>>
>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>> { 
>>>> "AsyncMessenger::Worker-0": { 
>>>> "msgr_recv_messages": 22948570, 
>>>> "msgr_send_messages": 22561570, 
>>>> "msgr_recv_bytes": 333085080271, 
>>>> "msgr_send_bytes": 261798871204, 
>>>> "msgr_created_connections": 6152, 
>>>> "msgr_active_connections": 2701, 
>>>> "msgr_running_total_time": 1055.197867330, 
>>>> "msgr_running_send_time": 352.764480121, 
>>>> "msgr_running_recv_time": 499.206831955, 
>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>> }, 
>>>> "AsyncMessenger::Worker-1": { 
>>>> "msgr_recv_messages": 18801593, 
>>>> "msgr_send_messages": 18430264, 
>>>> "msgr_recv_bytes": 306871760934, 
>>>> "msgr_send_bytes": 192789048666, 
>>>> "msgr_created_connections": 5773, 
>>>> "msgr_active_connections": 2721, 
>>>> "msgr_running_total_time": 816.821076305, 
>>>> "msgr_running_send_time": 261.353228926, 
>>>> "msgr_running_recv_time": 394.035587911, 
>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>> }, 
>>>> "AsyncMessenger::Worker-2": { 
>>>> "msgr_recv_messages": 18463400, 
>>>> "msgr_send_messages": 18105856, 
>>>> "msgr_recv_bytes": 187425453590, 
>>>> "msgr_send_bytes": 220735102555, 
>>>> "msgr_created_connections": 5897, 
>>>> "msgr_active_connections": 2605, 
>>>> "msgr_running_total_time": 807.186854324, 
>>>> "msgr_running_send_time": 296.834435839, 
>>>> "msgr_running_recv_time": 351.364389691, 
>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>> }, 
>>>> "bluefs": { 
>>>> "gift_bytes": 0, 
>>>> "reclaim_bytes": 0, 
>>>> "db_total_bytes": 256050724864, 
>>>> "db_used_bytes": 12413042688, 
>>>> "wal_total_bytes": 0, 
>>>> "wal_used_bytes": 0, 
>>>> "slow_total_bytes": 0, 
>>>> "slow_used_bytes": 0, 
>>>> "num_files": 209, 
>>>> "log_bytes": 10383360, 
>>>> "log_compactions": 14, 
>>>> "logged_bytes": 336498688, 
>>>> "files_written_wal": 2, 
>>>> "files_written_sst": 4499, 
>>>> "bytes_written_wal": 417989099783, 
>>>> "bytes_written_sst": 213188750209 
>>>> }, 
>>>> "bluestore": { 
>>>> "kv_flush_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 26.734038497, 
>>>> "avgtime": 0.000001013 
>>>> }, 
>>>> "kv_commit_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3397.491150603, 
>>>> "avgtime": 0.000128829 
>>>> }, 
>>>> "kv_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3424.225189100, 
>>>> "avgtime": 0.000129843 
>>>> }, 
>>>> "state_prepare_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3689.542105337, 
>>>> "avgtime": 0.000121028 
>>>> }, 
>>>> "state_aio_wait_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 509.864546111, 
>>>> "avgtime": 0.000016725 
>>>> }, 
>>>> "state_io_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 24.534052953, 
>>>> "avgtime": 0.000000804 
>>>> }, 
>>>> "state_kv_queued_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3488.338424238, 
>>>> "avgtime": 0.000114428 
>>>> }, 
>>>> "state_kv_commiting_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 5660.437003432, 
>>>> "avgtime": 0.000185679 
>>>> }, 
>>>> "state_kv_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 7.763511500, 
>>>> "avgtime": 0.000000254 
>>>> }, 
>>>> "state_deferred_queued_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 666071.296856696, 
>>>> "avgtime": 0.025281557 
>>>> }, 
>>>> "state_deferred_aio_wait_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 1755.660547071, 
>>>> "avgtime": 0.000066638 
>>>> }, 
>>>> "state_deferred_cleanup_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 185465.151653703, 
>>>> "avgtime": 0.007039558 
>>>> }, 
>>>> "state_finishing_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 3.046847481, 
>>>> "avgtime": 0.000000099 
>>>> }, 
>>>> "state_done_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 13193.362685280, 
>>>> "avgtime": 0.000432783 
>>>> }, 
>>>> "throttle_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 14.634269979, 
>>>> "avgtime": 0.000000480 
>>>> }, 
>>>> "submit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3873.883076148, 
>>>> "avgtime": 0.000127075 
>>>> }, 
>>>> "commit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 13376.492317331, 
>>>> "avgtime": 0.000438790 
>>>> }, 
>>>> "read_lat": { 
>>>> "avgcount": 5873923, 
>>>> "sum": 1817.167582057, 
>>>> "avgtime": 0.000309361 
>>>> }, 
>>>> "read_onode_meta_lat": { 
>>>> "avgcount": 19608201, 
>>>> "sum": 146.770464482, 
>>>> "avgtime": 0.000007485 
>>>> }, 
>>>> "read_wait_aio_lat": { 
>>>> "avgcount": 13734278, 
>>>> "sum": 2532.578077242, 
>>>> "avgtime": 0.000184398 
>>>> }, 
>>>> "compress_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "decompress_lat": { 
>>>> "avgcount": 1346945, 
>>>> "sum": 26.227575896, 
>>>> "avgtime": 0.000019471 
>>>> }, 
>>>> "csum_lat": { 
>>>> "avgcount": 28020392, 
>>>> "sum": 149.587819041, 
>>>> "avgtime": 0.000005338 
>>>> }, 
>>>> "compress_success_count": 0, 
>>>> "compress_rejected_count": 0, 
>>>> "write_pad_bytes": 352923605, 
>>>> "deferred_write_ops": 24373340, 
>>>> "deferred_write_bytes": 216791842816, 
>>>> "write_penalty_read_ops": 8062366, 
>>>> "bluestore_allocated": 3765566013440, 
>>>> "bluestore_stored": 4186255221852, 
>>>> "bluestore_compressed": 39981379040, 
>>>> "bluestore_compressed_allocated": 73748348928, 
>>>> "bluestore_compressed_original": 165041381376, 
>>>> "bluestore_onodes": 104232, 
>>>> "bluestore_onode_hits": 71206874, 
>>>> "bluestore_onode_misses": 1217914, 
>>>> "bluestore_onode_shard_hits": 260183292, 
>>>> "bluestore_onode_shard_misses": 22851573, 
>>>> "bluestore_extents": 3394513, 
>>>> "bluestore_blobs": 2773587, 
>>>> "bluestore_buffers": 0, 
>>>> "bluestore_buffer_bytes": 0, 
>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>> "bluestore_write_big": 5648815, 
>>>> "bluestore_write_big_bytes": 552502214656, 
>>>> "bluestore_write_big_blobs": 12440992, 
>>>> "bluestore_write_small": 35883770, 
>>>> "bluestore_write_small_bytes": 223436965719, 
>>>> "bluestore_write_small_unused": 408125, 
>>>> "bluestore_write_small_deferred": 34961455, 
>>>> "bluestore_write_small_pre_read": 34961455, 
>>>> "bluestore_write_small_new": 514190, 
>>>> "bluestore_txc": 30484924, 
>>>> "bluestore_onode_reshard": 5144189, 
>>>> "bluestore_blob_split": 60104, 
>>>> "bluestore_extent_compress": 53347252, 
>>>> "bluestore_gc_merged": 21142528, 
>>>> "bluestore_read_eio": 0, 
>>>> "bluestore_fragmentation_micros": 67 
>>>> }, 
>>>> "finisher-defered_finisher": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "finisher-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 26625163, 
>>>> "sum": 1057.506990951, 
>>>> "avgtime": 0.000039718 
>>>> } 
>>>> }, 
>>>> "finisher-objecter-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "objecter": { 
>>>> "op_active": 0, 
>>>> "op_laggy": 0, 
>>>> "op_send": 0, 
>>>> "op_send_bytes": 0, 
>>>> "op_resend": 0, 
>>>> "op_reply": 0, 
>>>> "op": 0, 
>>>> "op_r": 0, 
>>>> "op_w": 0, 
>>>> "op_rmw": 0, 
>>>> "op_pg": 0, 
>>>> "osdop_stat": 0, 
>>>> "osdop_create": 0, 
>>>> "osdop_read": 0, 
>>>> "osdop_write": 0, 
>>>> "osdop_writefull": 0, 
>>>> "osdop_writesame": 0, 
>>>> "osdop_append": 0, 
>>>> "osdop_zero": 0, 
>>>> "osdop_truncate": 0, 
>>>> "osdop_delete": 0, 
>>>> "osdop_mapext": 0, 
>>>> "osdop_sparse_read": 0, 
>>>> "osdop_clonerange": 0, 
>>>> "osdop_getxattr": 0, 
>>>> "osdop_setxattr": 0, 
>>>> "osdop_cmpxattr": 0, 
>>>> "osdop_rmxattr": 0, 
>>>> "osdop_resetxattrs": 0, 
>>>> "osdop_tmap_up": 0, 
>>>> "osdop_tmap_put": 0, 
>>>> "osdop_tmap_get": 0, 
>>>> "osdop_call": 0, 
>>>> "osdop_watch": 0, 
>>>> "osdop_notify": 0, 
>>>> "osdop_src_cmpxattr": 0, 
>>>> "osdop_pgls": 0, 
>>>> "osdop_pgls_filter": 0, 
>>>> "osdop_other": 0, 
>>>> "linger_active": 0, 
>>>> "linger_send": 0, 
>>>> "linger_resend": 0, 
>>>> "linger_ping": 0, 
>>>> "poolop_active": 0, 
>>>> "poolop_send": 0, 
>>>> "poolop_resend": 0, 
>>>> "poolstat_active": 0, 
>>>> "poolstat_send": 0, 
>>>> "poolstat_resend": 0, 
>>>> "statfs_active": 0, 
>>>> "statfs_send": 0, 
>>>> "statfs_resend": 0, 
>>>> "command_active": 0, 
>>>> "command_send": 0, 
>>>> "command_resend": 0, 
>>>> "map_epoch": 105913, 
>>>> "map_full": 0, 
>>>> "map_inc": 828, 
>>>> "osd_sessions": 0, 
>>>> "osd_session_open": 0, 
>>>> "osd_session_close": 0, 
>>>> "osd_laggy": 0, 
>>>> "omap_wr": 0, 
>>>> "omap_rd": 0, 
>>>> "omap_del": 0 
>>>> }, 
>>>> "osd": { 
>>>> "op_wip": 0, 
>>>> "op": 16758102, 
>>>> "op_in_bytes": 238398820586, 
>>>> "op_out_bytes": 165484999463, 
>>>> "op_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 38242.481640842, 
>>>> "avgtime": 0.002282029 
>>>> }, 
>>>> "op_process_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 28644.906310687, 
>>>> "avgtime": 0.001709316 
>>>> }, 
>>>> "op_prepare_latency": { 
>>>> "avgcount": 16761367, 
>>>> "sum": 3489.856599934, 
>>>> "avgtime": 0.000208208 
>>>> }, 
>>>> "op_r": 6188565, 
>>>> "op_r_out_bytes": 165484999463, 
>>>> "op_r_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 4507.365756792, 
>>>> "avgtime": 0.000728337 
>>>> }, 
>>>> "op_r_process_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 942.363063429, 
>>>> "avgtime": 0.000152274 
>>>> }, 
>>>> "op_r_prepare_latency": { 
>>>> "avgcount": 6188644, 
>>>> "sum": 982.866710389, 
>>>> "avgtime": 0.000158817 
>>>> }, 
>>>> "op_w": 10546037, 
>>>> "op_w_in_bytes": 238334329494, 
>>>> "op_w_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 33160.719998316, 
>>>> "avgtime": 0.003144377 
>>>> }, 
>>>> "op_w_process_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 27668.702029030, 
>>>> "avgtime": 0.002623611 
>>>> }, 
>>>> "op_w_prepare_latency": { 
>>>> "avgcount": 10548652, 
>>>> "sum": 2499.688609173, 
>>>> "avgtime": 0.000236967 
>>>> }, 
>>>> "op_rw": 23500, 
>>>> "op_rw_in_bytes": 64491092, 
>>>> "op_rw_out_bytes": 0, 
>>>> "op_rw_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 574.395885734, 
>>>> "avgtime": 0.024442378 
>>>> }, 
>>>> "op_rw_process_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 33.841218228, 
>>>> "avgtime": 0.001440051 
>>>> }, 
>>>> "op_rw_prepare_latency": { 
>>>> "avgcount": 24071, 
>>>> "sum": 7.301280372, 
>>>> "avgtime": 0.000303322 
>>>> }, 
>>>> "op_before_queue_op_lat": { 
>>>> "avgcount": 57892986, 
>>>> "sum": 1502.117718889, 
>>>> "avgtime": 0.000025946 
>>>> }, 
>>>> "op_before_dequeue_op_lat": { 
>>>> "avgcount": 58091683, 
>>>> "sum": 45194.453254037, 
>>>> "avgtime": 0.000777984 
>>>> }, 
>>>> "subop": 19784758, 
>>>> "subop_in_bytes": 547174969754, 
>>>> "subop_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_w": 19784758, 
>>>> "subop_w_in_bytes": 547174969754, 
>>>> "subop_w_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_pull": 0, 
>>>> "subop_pull_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "subop_push": 0, 
>>>> "subop_push_in_bytes": 0, 
>>>> "subop_push_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "pull": 0, 
>>>> "push": 2003, 
>>>> "push_out_bytes": 5560009728, 
>>>> "recovery_ops": 1940, 
>>>> "loadavg": 118, 
>>>> "buffer_bytes": 0, 
>>>> "history_alloc_Mbytes": 0, 
>>>> "history_alloc_num": 0, 
>>>> "cached_crc": 0, 
>>>> "cached_crc_adjusted": 0, 
>>>> "missed_crc": 0, 
>>>> "numpg": 243, 
>>>> "numpg_primary": 82, 
>>>> "numpg_replica": 161, 
>>>> "numpg_stray": 0, 
>>>> "numpg_removing": 0, 
>>>> "heartbeat_to_peers": 10, 
>>>> "map_messages": 7013, 
>>>> "map_message_epochs": 7143, 
>>>> "map_message_epoch_dups": 6315, 
>>>> "messages_delayed_for_map": 0, 
>>>> "osd_map_cache_hit": 203309, 
>>>> "osd_map_cache_miss": 33, 
>>>> "osd_map_cache_miss_low": 0, 
>>>> "osd_map_cache_miss_low_avg": { 
>>>> "avgcount": 0, 
>>>> "sum": 0 
>>>> }, 
>>>> "osd_map_bl_cache_hit": 47012, 
>>>> "osd_map_bl_cache_miss": 1681, 
>>>> "stat_bytes": 6401248198656, 
>>>> "stat_bytes_used": 3777979072512, 
>>>> "stat_bytes_avail": 2623269126144, 
>>>> "copyfrom": 0, 
>>>> "tier_promote": 0, 
>>>> "tier_flush": 0, 
>>>> "tier_flush_fail": 0, 
>>>> "tier_try_flush": 0, 
>>>> "tier_try_flush_fail": 0, 
>>>> "tier_evict": 0, 
>>>> "tier_whiteout": 1631, 
>>>> "tier_dirty": 22360, 
>>>> "tier_clean": 0, 
>>>> "tier_delay": 0, 
>>>> "tier_proxy_read": 0, 
>>>> "tier_proxy_write": 0, 
>>>> "agent_wake": 0, 
>>>> "agent_skip": 0, 
>>>> "agent_flush": 0, 
>>>> "agent_evict": 0, 
>>>> "object_ctx_cache_hit": 16311156, 
>>>> "object_ctx_cache_total": 17426393, 
>>>> "op_cache_hit": 0, 
>>>> "osd_tier_flush_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_promote_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_r_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_pg_info": 30483113, 
>>>> "osd_pg_fastinfo": 29619885, 
>>>> "osd_pg_biginfo": 81703 
>>>> }, 
>>>> "recoverystate_perf": { 
>>>> "initial_latency": { 
>>>> "avgcount": 243, 
>>>> "sum": 6.869296500, 
>>>> "avgtime": 0.028268709 
>>>> }, 
>>>> "started_latency": { 
>>>> "avgcount": 1125, 
>>>> "sum": 13551384.917335850, 
>>>> "avgtime": 12045.675482076 
>>>> }, 
>>>> "reset_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 1101.727799040, 
>>>> "avgtime": 0.805356578 
>>>> }, 
>>>> "start_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 0.002014799, 
>>>> "avgtime": 0.000001472 
>>>> }, 
>>>> "primary_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 4575560.638823428, 
>>>> "avgtime": 9024.774435549 
>>>> }, 
>>>> "peering_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 499.372283616, 
>>>> "avgtime": 0.907949606 
>>>> }, 
>>>> "backfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitremotebackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitlocalbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "notbackfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "repnotrecovering_latency": { 
>>>> "avgcount": 1009, 
>>>> "sum": 8975301.082274411, 
>>>> "avgtime": 8895.243887288 
>>>> }, 
>>>> "repwaitrecoveryreserved_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 99.846056520, 
>>>> "avgtime": 0.237728706 
>>>> }, 
>>>> "repwaitbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "reprecovering_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 241.682764382, 
>>>> "avgtime": 0.575435153 
>>>> }, 
>>>> "activating_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 16.893347339, 
>>>> "avgtime": 0.033320211 
>>>> }, 
>>>> "waitlocalrecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 672.335512769, 
>>>> "avgtime": 3.378570415 
>>>> }, 
>>>> "waitremoterecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 213.536439363, 
>>>> "avgtime": 1.073047433 
>>>> }, 
>>>> "recovering_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 79.007696479, 
>>>> "avgtime": 0.397023600 
>>>> }, 
>>>> "recovered_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 14.000732748, 
>>>> "avgtime": 0.027614857 
>>>> }, 
>>>> "clean_latency": { 
>>>> "avgcount": 395, 
>>>> "sum": 4574325.900371083, 
>>>> "avgtime": 11580.571899673 
>>>> }, 
>>>> "active_latency": { 
>>>> "avgcount": 425, 
>>>> "sum": 4575107.630123680, 
>>>> "avgtime": 10764.959129702 
>>>> }, 
>>>> "replicaactive_latency": { 
>>>> "avgcount": 589, 
>>>> "sum": 8975184.499049954, 
>>>> "avgtime": 15238.004242869 
>>>> }, 
>>>> "stray_latency": { 
>>>> "avgcount": 818, 
>>>> "sum": 800.729455666, 
>>>> "avgtime": 0.978886865 
>>>> }, 
>>>> "getinfo_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 15.085667048, 
>>>> "avgtime": 0.027428485 
>>>> }, 
>>>> "getlog_latency": { 
>>>> "avgcount": 546, 
>>>> "sum": 3.482175693, 
>>>> "avgtime": 0.006377611 
>>>> }, 
>>>> "waitactingchange_latency": { 
>>>> "avgcount": 39, 
>>>> "sum": 35.444551284, 
>>>> "avgtime": 0.908834648 
>>>> }, 
>>>> "incomplete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "down_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "getmissing_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 6.702129624, 
>>>> "avgtime": 0.013219190 
>>>> }, 
>>>> "waitupthru_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 474.098261727, 
>>>> "avgtime": 0.935105052 
>>>> }, 
>>>> "notrecovering_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "rocksdb": { 
>>>> "get": 28320977, 
>>>> "submit_transaction": 30484924, 
>>>> "submit_transaction_sync": 26371957, 
>>>> "get_latency": { 
>>>> "avgcount": 28320977, 
>>>> "sum": 325.900908733, 
>>>> "avgtime": 0.000011507 
>>>> }, 
>>>> "submit_latency": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 1835.888692371, 
>>>> "avgtime": 0.000060222 
>>>> }, 
>>>> "submit_sync_latency": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 1431.555230628, 
>>>> "avgtime": 0.000054283 
>>>> }, 
>>>> "compact": 0, 
>>>> "compact_range": 0, 
>>>> "compact_queue_merge": 0, 
>>>> "compact_queue_len": 0, 
>>>> "rocksdb_write_wal_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_memtable_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_delay_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_pre_and_post_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> } 
>>>> } 
>>>>
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>
>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> ok, this is the same 
>>>>>
>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>
>>>>>
>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>
>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>> it? The same for other OSDs? 
>>>>
>>>> This proves some issue with the allocator - generally fragmentation 
>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>> aren't properly merged in run-time. 
>>>>
>>>> On the other side I'm not completely sure that latency degradation is 
>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>> how this might impact performance that high. 
>>>>
>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>> output on admin socket) reports? Do you have any historic data? 
>>>>
>>>> If not may I have current output and say a couple more samples with 
>>>> 8-12 hours interval? 
>>>>
>>>>
>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>
>>>>
>>>> Thanks, 
>>>>
>>>> Igor 
>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>
>>>>> Thanks Igor, 
>>>>>
>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>> startup and once you get high latency. 
>>>>>>>
>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> I'm already monitoring with 
>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>
>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>
>>>>>
>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>> make sure it's degraded and learn more details. 
>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>
>>>>>
>>>>>
>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>> and try the difference... 
>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>
>>>>> Hi Alexandre, 
>>>>>
>>>>> looks like a bug in StupidAllocator. 
>>>>>
>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>> startup and once you get high latency. 
>>>>>
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>> make sure it's degraded and learn more details. 
>>>>>
>>>>>
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>> and try the difference... 
>>>>>
>>>>>
>>>>> Thanks, 
>>>>>
>>>>> Igor 
>>>>>
>>>>>
>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>> Hi again, 
>>>>>>
>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>
>>>>>>
>>>>>> I have notice something using a simple "perf top", 
>>>>>>
>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>
>>>>>> when latency is bad, perf top give me : 
>>>>>>
>>>>>> StupidAllocator::_aligned_len 
>>>>>> and 
>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>
>>>>>> (around 10-20% time for both) 
>>>>>>
>>>>>>
>>>>>> when latency is good, I don't see them at all. 
>>>>>>
>>>>>>
>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>
>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>
>>>>>>
>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>
>>>>>>
>>>>>> + 100.00% clone 
>>>>>> + 100.00% start_thread 
>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> some news: 
>>>>>>
>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>
>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>
>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>
>>>>>>
>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>
>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>
>>>>>>
>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>
>>>>>>
>>>>>> Regards, 
>>>>>>
>>>>>> Alexandre 
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>
>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>
>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>
>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>
>>>>>> Hi, 
>>>>>>
>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>> Hi Stefan, 
>>>>>>>
>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>> I need to compare with bigger latencies 
>>>>>>>
>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>
>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>
>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> here my influxdb queries: 
>>>>>>>
>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>
>>>>>>>
>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>>
>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>
>>>>>> greets, 
>>>>>> Stefan 
>>>>>>
>>>>>>>
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>
>>>>>>> Hi, 
>>>>>>>
>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi, 
>>>>>>>>
>>>>>>>> here some new results, 
>>>>>>>> different osd/ different cluster 
>>>>>>>>
>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>
>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>
>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>
>>>>>>> Greets, 
>>>>>>> Stefan 
>>>>>>>
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>
>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>
>>>>>>>> Thanks! 
>>>>>>>> sage 
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>
>>>>>>>>> Hi, 
>>>>>>>>>
>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>
>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>
>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>
>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>> values like 20-200ms. 
>>>>>>>>>
>>>>>>>>> Some example graphs: 
>>>>>>>>>
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>
>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>
>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>
>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>
>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>
>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards, 
>>>>>>>>>
>>>>>>>>> Alexandre 
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________ 
>>>>>>>> ceph-users mailing list 
>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <fdd3eaa2-567b-8e02-aadb-64a19c78bc23-fspyXLx8qC4@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                         ` <fdd3eaa2-567b-8e02-aadb-64a19c78bc23-fspyXLx8qC4@public.gmane.org>
@ 2019-02-16  8:29                                                                                           ` Alexandre DERUMIER
       [not found]                                                                                             ` <622347904.1243911.1550305749920.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-16  8:29 UTC (permalink / raw)
  To: Wido den Hollander, Igor Fedotov; +Cc: ceph-users, ceph-devel

>>There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>runnigh with memory target on 6G right now to make sure there is no 
>>leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>so it will max out on 80GB leaving 16GB as spare. 

Thanks Wido. I send results monday with my increased memory



@Igor:

I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example).
(op_w_process_latency).

If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too.

does "op_w_process_latency" counter include replication time ?

----- Mail original -----
De: "Wido den Hollander" <wido@42on.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 15 Février 2019 14:59:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>> OSDs as well. Over time their latency increased until we started to 
>>> notice I/O-wait inside VMs. 
> 
> I'm also notice it in the vms. BTW, what it your nvme disk size ? 

Samsung PM983 3.84TB SSDs in both clusters. 

> 
> 
>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>> these OSDs as the memory would allow it. 
> 
> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
> (my last test was 8gb with 1osd of 6TB, but that didn't help) 

There are 10 OSDs in these systems with 96GB of memory in total. We are 
runnigh with memory target on 6G right now to make sure there is no 
leakage. If this runs fine for a longer period we will go to 8GB per OSD 
so it will max out on 80GB leaving 16GB as spare. 

As these OSDs were all restarted earlier this week I can't tell how it 
will hold up over a longer period. Monitoring (Zabbix) shows the latency 
is fine at the moment. 

Wido 

> 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:50:34 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>> Thanks Igor. 
>> 
>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>> 
>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>> 
>> 
> 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 
> 
> A restart fixed it. We also increased memory target from 4G to 6G on 
> these OSDs as the memory would allow it. 
> 
> But we noticed this on two different 12.2.10/11 clusters. 
> 
> A restart made the latency drop. Not only the numbers, but the 
> real-world latency as experienced by a VM as well. 
> 
> Wido 
> 
>> 
>> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> Hi Alexander, 
>> 
>> I've read through your reports, nothing obvious so far. 
>> 
>> I can only see several times average latency increase for OSD write ops 
>> (in seconds) 
>> 0.002040060 (first hour) vs. 
>> 
>> 0.002483516 (last 24 hours) vs. 
>> 0.008382087 (last hour) 
>> 
>> subop_w_latency: 
>> 0.000478934 (first hour) vs. 
>> 0.000537956 (last 24 hours) vs. 
>> 0.003073475 (last hour) 
>> 
>> and OSD read ops, osd_r_latency: 
>> 
>> 0.000408595 (first hour) 
>> 0.000709031 (24 hours) 
>> 0.004979540 (last hour) 
>> 
>> What's interesting is that such latency differences aren't observed at 
>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>> rocksdb one. 
>> 
>> Which probably means that the issue is rather somewhere above BlueStore. 
>> 
>> Suggest to proceed with perf dumps collection to see if the picture 
>> stays the same. 
>> 
>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>> decrease in RSS report is a known artifact that seems to be safe. 
>> 
>> Thanks, 
>> Igor 
>> 
>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>> Hi Igor, 
>>> 
>>> Thanks again for helping ! 
>>> 
>>> 
>>> 
>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>> 
>>> 
>>> I have done a lot of perf dump and mempool dump and ps of process to 
>> see rss memory at different hours, 
>>> here the reports for osd.0: 
>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>> 
>>> 
>>> osd has been started the 12-02-2019 at 08:00 
>>> 
>>> first report after 1h running 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>> 
>>> 
>>> 
>>> report after 24 before counter resets 
>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>> 
>>> report 1h after counter reset 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>> 
>>> 
>>> 
>>> 
>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>> around 12-02-2019 at 14:00 
>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>> Then after that, slowly decreasing. 
>>> 
>>> 
>>> Another strange thing, 
>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>> still at 8G 
>>> 
>>> 
>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>> track them over time. 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>> 
>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>> another mempool dump after 1h run. (latency ok) 
>>>> 
>>>> Biggest difference: 
>>>> 
>>>> before restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>> take all the memory) 
>>>> 
>>>> 
>>>> After restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> 
>>> This is fine as cache is warming after restart and some rebalancing 
>>> between data and metadata might occur. 
>>> 
>>> What relates to allocator and most probably to fragmentation growth is : 
>>> 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> 
>>> which had been higher before the reset (if I got these dumps' order 
>>> properly) 
>>> 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> 
>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>> latency increase... 
>>> 
>>> Do you have perf counters dump after the restart? 
>>> 
>>> Could you collect some more dumps - for both mempool and perf counters? 
>>> 
>>> So ideally I'd like to have: 
>>> 
>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>> 
>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>> 
>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>> restart) and dump mempool/perf counters again. 
>>> 
>>> So we'll be able to learn both allocator mem usage growth and operation 
>>> latency distribution for the following periods: 
>>> 
>>> a) 1st hour after restart 
>>> 
>>> b) 25th hour. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>> 
>>>> full mempool dump after restart 
>>>> ------------------------------- 
>>>> 
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 22225, 
>>>> "bytes": 14935200 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 11, 
>>>> "bytes": 8184 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 5047, 
>>>> "bytes": 22673736 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 91, 
>>>> "bytes": 1662976 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1907, 
>>>> "bytes": 95600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 19664, 
>>>> "bytes": 25486050 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 46189, 
>>>> "bytes": 2956096 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 17, 
>>>> "bytes": 214366 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 889673, 
>>>> "bytes": 367160400 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3803, 
>>>> "bytes": 224552 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 178515204, 
>>>> "bytes": 2160630547 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>> 
>>>> I'm just seeing 
>>>> 
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>> 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>> 
>>>> on 1 osd, both 10%. 
>>>> 
>>>> here the dump_mempools 
>>>> 
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 105637, 
>>>> "bytes": 70988064 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 12, 
>>>> "bytes": 8928 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 406, 
>>>> "bytes": 4792868 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 66, 
>>>> "bytes": 1085440 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1882, 
>>>> "bytes": 93600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 138986, 
>>>> "bytes": 24983701 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 544, 
>>>> "bytes": 34816 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 36, 
>>>> "bytes": 179308 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 952564, 
>>>> "bytes": 372459684 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3639, 
>>>> "bytes": 224664 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 260109445, 
>>>> "bytes": 2228370845 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> 
>>>> and the perf dump 
>>>> 
>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>> { 
>>>> "AsyncMessenger::Worker-0": { 
>>>> "msgr_recv_messages": 22948570, 
>>>> "msgr_send_messages": 22561570, 
>>>> "msgr_recv_bytes": 333085080271, 
>>>> "msgr_send_bytes": 261798871204, 
>>>> "msgr_created_connections": 6152, 
>>>> "msgr_active_connections": 2701, 
>>>> "msgr_running_total_time": 1055.197867330, 
>>>> "msgr_running_send_time": 352.764480121, 
>>>> "msgr_running_recv_time": 499.206831955, 
>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>> }, 
>>>> "AsyncMessenger::Worker-1": { 
>>>> "msgr_recv_messages": 18801593, 
>>>> "msgr_send_messages": 18430264, 
>>>> "msgr_recv_bytes": 306871760934, 
>>>> "msgr_send_bytes": 192789048666, 
>>>> "msgr_created_connections": 5773, 
>>>> "msgr_active_connections": 2721, 
>>>> "msgr_running_total_time": 816.821076305, 
>>>> "msgr_running_send_time": 261.353228926, 
>>>> "msgr_running_recv_time": 394.035587911, 
>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>> }, 
>>>> "AsyncMessenger::Worker-2": { 
>>>> "msgr_recv_messages": 18463400, 
>>>> "msgr_send_messages": 18105856, 
>>>> "msgr_recv_bytes": 187425453590, 
>>>> "msgr_send_bytes": 220735102555, 
>>>> "msgr_created_connections": 5897, 
>>>> "msgr_active_connections": 2605, 
>>>> "msgr_running_total_time": 807.186854324, 
>>>> "msgr_running_send_time": 296.834435839, 
>>>> "msgr_running_recv_time": 351.364389691, 
>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>> }, 
>>>> "bluefs": { 
>>>> "gift_bytes": 0, 
>>>> "reclaim_bytes": 0, 
>>>> "db_total_bytes": 256050724864, 
>>>> "db_used_bytes": 12413042688, 
>>>> "wal_total_bytes": 0, 
>>>> "wal_used_bytes": 0, 
>>>> "slow_total_bytes": 0, 
>>>> "slow_used_bytes": 0, 
>>>> "num_files": 209, 
>>>> "log_bytes": 10383360, 
>>>> "log_compactions": 14, 
>>>> "logged_bytes": 336498688, 
>>>> "files_written_wal": 2, 
>>>> "files_written_sst": 4499, 
>>>> "bytes_written_wal": 417989099783, 
>>>> "bytes_written_sst": 213188750209 
>>>> }, 
>>>> "bluestore": { 
>>>> "kv_flush_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 26.734038497, 
>>>> "avgtime": 0.000001013 
>>>> }, 
>>>> "kv_commit_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3397.491150603, 
>>>> "avgtime": 0.000128829 
>>>> }, 
>>>> "kv_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3424.225189100, 
>>>> "avgtime": 0.000129843 
>>>> }, 
>>>> "state_prepare_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3689.542105337, 
>>>> "avgtime": 0.000121028 
>>>> }, 
>>>> "state_aio_wait_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 509.864546111, 
>>>> "avgtime": 0.000016725 
>>>> }, 
>>>> "state_io_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 24.534052953, 
>>>> "avgtime": 0.000000804 
>>>> }, 
>>>> "state_kv_queued_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3488.338424238, 
>>>> "avgtime": 0.000114428 
>>>> }, 
>>>> "state_kv_commiting_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 5660.437003432, 
>>>> "avgtime": 0.000185679 
>>>> }, 
>>>> "state_kv_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 7.763511500, 
>>>> "avgtime": 0.000000254 
>>>> }, 
>>>> "state_deferred_queued_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 666071.296856696, 
>>>> "avgtime": 0.025281557 
>>>> }, 
>>>> "state_deferred_aio_wait_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 1755.660547071, 
>>>> "avgtime": 0.000066638 
>>>> }, 
>>>> "state_deferred_cleanup_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 185465.151653703, 
>>>> "avgtime": 0.007039558 
>>>> }, 
>>>> "state_finishing_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 3.046847481, 
>>>> "avgtime": 0.000000099 
>>>> }, 
>>>> "state_done_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 13193.362685280, 
>>>> "avgtime": 0.000432783 
>>>> }, 
>>>> "throttle_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 14.634269979, 
>>>> "avgtime": 0.000000480 
>>>> }, 
>>>> "submit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3873.883076148, 
>>>> "avgtime": 0.000127075 
>>>> }, 
>>>> "commit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 13376.492317331, 
>>>> "avgtime": 0.000438790 
>>>> }, 
>>>> "read_lat": { 
>>>> "avgcount": 5873923, 
>>>> "sum": 1817.167582057, 
>>>> "avgtime": 0.000309361 
>>>> }, 
>>>> "read_onode_meta_lat": { 
>>>> "avgcount": 19608201, 
>>>> "sum": 146.770464482, 
>>>> "avgtime": 0.000007485 
>>>> }, 
>>>> "read_wait_aio_lat": { 
>>>> "avgcount": 13734278, 
>>>> "sum": 2532.578077242, 
>>>> "avgtime": 0.000184398 
>>>> }, 
>>>> "compress_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "decompress_lat": { 
>>>> "avgcount": 1346945, 
>>>> "sum": 26.227575896, 
>>>> "avgtime": 0.000019471 
>>>> }, 
>>>> "csum_lat": { 
>>>> "avgcount": 28020392, 
>>>> "sum": 149.587819041, 
>>>> "avgtime": 0.000005338 
>>>> }, 
>>>> "compress_success_count": 0, 
>>>> "compress_rejected_count": 0, 
>>>> "write_pad_bytes": 352923605, 
>>>> "deferred_write_ops": 24373340, 
>>>> "deferred_write_bytes": 216791842816, 
>>>> "write_penalty_read_ops": 8062366, 
>>>> "bluestore_allocated": 3765566013440, 
>>>> "bluestore_stored": 4186255221852, 
>>>> "bluestore_compressed": 39981379040, 
>>>> "bluestore_compressed_allocated": 73748348928, 
>>>> "bluestore_compressed_original": 165041381376, 
>>>> "bluestore_onodes": 104232, 
>>>> "bluestore_onode_hits": 71206874, 
>>>> "bluestore_onode_misses": 1217914, 
>>>> "bluestore_onode_shard_hits": 260183292, 
>>>> "bluestore_onode_shard_misses": 22851573, 
>>>> "bluestore_extents": 3394513, 
>>>> "bluestore_blobs": 2773587, 
>>>> "bluestore_buffers": 0, 
>>>> "bluestore_buffer_bytes": 0, 
>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>> "bluestore_write_big": 5648815, 
>>>> "bluestore_write_big_bytes": 552502214656, 
>>>> "bluestore_write_big_blobs": 12440992, 
>>>> "bluestore_write_small": 35883770, 
>>>> "bluestore_write_small_bytes": 223436965719, 
>>>> "bluestore_write_small_unused": 408125, 
>>>> "bluestore_write_small_deferred": 34961455, 
>>>> "bluestore_write_small_pre_read": 34961455, 
>>>> "bluestore_write_small_new": 514190, 
>>>> "bluestore_txc": 30484924, 
>>>> "bluestore_onode_reshard": 5144189, 
>>>> "bluestore_blob_split": 60104, 
>>>> "bluestore_extent_compress": 53347252, 
>>>> "bluestore_gc_merged": 21142528, 
>>>> "bluestore_read_eio": 0, 
>>>> "bluestore_fragmentation_micros": 67 
>>>> }, 
>>>> "finisher-defered_finisher": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "finisher-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 26625163, 
>>>> "sum": 1057.506990951, 
>>>> "avgtime": 0.000039718 
>>>> } 
>>>> }, 
>>>> "finisher-objecter-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "objecter": { 
>>>> "op_active": 0, 
>>>> "op_laggy": 0, 
>>>> "op_send": 0, 
>>>> "op_send_bytes": 0, 
>>>> "op_resend": 0, 
>>>> "op_reply": 0, 
>>>> "op": 0, 
>>>> "op_r": 0, 
>>>> "op_w": 0, 
>>>> "op_rmw": 0, 
>>>> "op_pg": 0, 
>>>> "osdop_stat": 0, 
>>>> "osdop_create": 0, 
>>>> "osdop_read": 0, 
>>>> "osdop_write": 0, 
>>>> "osdop_writefull": 0, 
>>>> "osdop_writesame": 0, 
>>>> "osdop_append": 0, 
>>>> "osdop_zero": 0, 
>>>> "osdop_truncate": 0, 
>>>> "osdop_delete": 0, 
>>>> "osdop_mapext": 0, 
>>>> "osdop_sparse_read": 0, 
>>>> "osdop_clonerange": 0, 
>>>> "osdop_getxattr": 0, 
>>>> "osdop_setxattr": 0, 
>>>> "osdop_cmpxattr": 0, 
>>>> "osdop_rmxattr": 0, 
>>>> "osdop_resetxattrs": 0, 
>>>> "osdop_tmap_up": 0, 
>>>> "osdop_tmap_put": 0, 
>>>> "osdop_tmap_get": 0, 
>>>> "osdop_call": 0, 
>>>> "osdop_watch": 0, 
>>>> "osdop_notify": 0, 
>>>> "osdop_src_cmpxattr": 0, 
>>>> "osdop_pgls": 0, 
>>>> "osdop_pgls_filter": 0, 
>>>> "osdop_other": 0, 
>>>> "linger_active": 0, 
>>>> "linger_send": 0, 
>>>> "linger_resend": 0, 
>>>> "linger_ping": 0, 
>>>> "poolop_active": 0, 
>>>> "poolop_send": 0, 
>>>> "poolop_resend": 0, 
>>>> "poolstat_active": 0, 
>>>> "poolstat_send": 0, 
>>>> "poolstat_resend": 0, 
>>>> "statfs_active": 0, 
>>>> "statfs_send": 0, 
>>>> "statfs_resend": 0, 
>>>> "command_active": 0, 
>>>> "command_send": 0, 
>>>> "command_resend": 0, 
>>>> "map_epoch": 105913, 
>>>> "map_full": 0, 
>>>> "map_inc": 828, 
>>>> "osd_sessions": 0, 
>>>> "osd_session_open": 0, 
>>>> "osd_session_close": 0, 
>>>> "osd_laggy": 0, 
>>>> "omap_wr": 0, 
>>>> "omap_rd": 0, 
>>>> "omap_del": 0 
>>>> }, 
>>>> "osd": { 
>>>> "op_wip": 0, 
>>>> "op": 16758102, 
>>>> "op_in_bytes": 238398820586, 
>>>> "op_out_bytes": 165484999463, 
>>>> "op_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 38242.481640842, 
>>>> "avgtime": 0.002282029 
>>>> }, 
>>>> "op_process_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 28644.906310687, 
>>>> "avgtime": 0.001709316 
>>>> }, 
>>>> "op_prepare_latency": { 
>>>> "avgcount": 16761367, 
>>>> "sum": 3489.856599934, 
>>>> "avgtime": 0.000208208 
>>>> }, 
>>>> "op_r": 6188565, 
>>>> "op_r_out_bytes": 165484999463, 
>>>> "op_r_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 4507.365756792, 
>>>> "avgtime": 0.000728337 
>>>> }, 
>>>> "op_r_process_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 942.363063429, 
>>>> "avgtime": 0.000152274 
>>>> }, 
>>>> "op_r_prepare_latency": { 
>>>> "avgcount": 6188644, 
>>>> "sum": 982.866710389, 
>>>> "avgtime": 0.000158817 
>>>> }, 
>>>> "op_w": 10546037, 
>>>> "op_w_in_bytes": 238334329494, 
>>>> "op_w_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 33160.719998316, 
>>>> "avgtime": 0.003144377 
>>>> }, 
>>>> "op_w_process_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 27668.702029030, 
>>>> "avgtime": 0.002623611 
>>>> }, 
>>>> "op_w_prepare_latency": { 
>>>> "avgcount": 10548652, 
>>>> "sum": 2499.688609173, 
>>>> "avgtime": 0.000236967 
>>>> }, 
>>>> "op_rw": 23500, 
>>>> "op_rw_in_bytes": 64491092, 
>>>> "op_rw_out_bytes": 0, 
>>>> "op_rw_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 574.395885734, 
>>>> "avgtime": 0.024442378 
>>>> }, 
>>>> "op_rw_process_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 33.841218228, 
>>>> "avgtime": 0.001440051 
>>>> }, 
>>>> "op_rw_prepare_latency": { 
>>>> "avgcount": 24071, 
>>>> "sum": 7.301280372, 
>>>> "avgtime": 0.000303322 
>>>> }, 
>>>> "op_before_queue_op_lat": { 
>>>> "avgcount": 57892986, 
>>>> "sum": 1502.117718889, 
>>>> "avgtime": 0.000025946 
>>>> }, 
>>>> "op_before_dequeue_op_lat": { 
>>>> "avgcount": 58091683, 
>>>> "sum": 45194.453254037, 
>>>> "avgtime": 0.000777984 
>>>> }, 
>>>> "subop": 19784758, 
>>>> "subop_in_bytes": 547174969754, 
>>>> "subop_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_w": 19784758, 
>>>> "subop_w_in_bytes": 547174969754, 
>>>> "subop_w_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_pull": 0, 
>>>> "subop_pull_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "subop_push": 0, 
>>>> "subop_push_in_bytes": 0, 
>>>> "subop_push_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "pull": 0, 
>>>> "push": 2003, 
>>>> "push_out_bytes": 5560009728, 
>>>> "recovery_ops": 1940, 
>>>> "loadavg": 118, 
>>>> "buffer_bytes": 0, 
>>>> "history_alloc_Mbytes": 0, 
>>>> "history_alloc_num": 0, 
>>>> "cached_crc": 0, 
>>>> "cached_crc_adjusted": 0, 
>>>> "missed_crc": 0, 
>>>> "numpg": 243, 
>>>> "numpg_primary": 82, 
>>>> "numpg_replica": 161, 
>>>> "numpg_stray": 0, 
>>>> "numpg_removing": 0, 
>>>> "heartbeat_to_peers": 10, 
>>>> "map_messages": 7013, 
>>>> "map_message_epochs": 7143, 
>>>> "map_message_epoch_dups": 6315, 
>>>> "messages_delayed_for_map": 0, 
>>>> "osd_map_cache_hit": 203309, 
>>>> "osd_map_cache_miss": 33, 
>>>> "osd_map_cache_miss_low": 0, 
>>>> "osd_map_cache_miss_low_avg": { 
>>>> "avgcount": 0, 
>>>> "sum": 0 
>>>> }, 
>>>> "osd_map_bl_cache_hit": 47012, 
>>>> "osd_map_bl_cache_miss": 1681, 
>>>> "stat_bytes": 6401248198656, 
>>>> "stat_bytes_used": 3777979072512, 
>>>> "stat_bytes_avail": 2623269126144, 
>>>> "copyfrom": 0, 
>>>> "tier_promote": 0, 
>>>> "tier_flush": 0, 
>>>> "tier_flush_fail": 0, 
>>>> "tier_try_flush": 0, 
>>>> "tier_try_flush_fail": 0, 
>>>> "tier_evict": 0, 
>>>> "tier_whiteout": 1631, 
>>>> "tier_dirty": 22360, 
>>>> "tier_clean": 0, 
>>>> "tier_delay": 0, 
>>>> "tier_proxy_read": 0, 
>>>> "tier_proxy_write": 0, 
>>>> "agent_wake": 0, 
>>>> "agent_skip": 0, 
>>>> "agent_flush": 0, 
>>>> "agent_evict": 0, 
>>>> "object_ctx_cache_hit": 16311156, 
>>>> "object_ctx_cache_total": 17426393, 
>>>> "op_cache_hit": 0, 
>>>> "osd_tier_flush_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_promote_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_r_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_pg_info": 30483113, 
>>>> "osd_pg_fastinfo": 29619885, 
>>>> "osd_pg_biginfo": 81703 
>>>> }, 
>>>> "recoverystate_perf": { 
>>>> "initial_latency": { 
>>>> "avgcount": 243, 
>>>> "sum": 6.869296500, 
>>>> "avgtime": 0.028268709 
>>>> }, 
>>>> "started_latency": { 
>>>> "avgcount": 1125, 
>>>> "sum": 13551384.917335850, 
>>>> "avgtime": 12045.675482076 
>>>> }, 
>>>> "reset_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 1101.727799040, 
>>>> "avgtime": 0.805356578 
>>>> }, 
>>>> "start_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 0.002014799, 
>>>> "avgtime": 0.000001472 
>>>> }, 
>>>> "primary_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 4575560.638823428, 
>>>> "avgtime": 9024.774435549 
>>>> }, 
>>>> "peering_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 499.372283616, 
>>>> "avgtime": 0.907949606 
>>>> }, 
>>>> "backfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitremotebackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitlocalbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "notbackfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "repnotrecovering_latency": { 
>>>> "avgcount": 1009, 
>>>> "sum": 8975301.082274411, 
>>>> "avgtime": 8895.243887288 
>>>> }, 
>>>> "repwaitrecoveryreserved_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 99.846056520, 
>>>> "avgtime": 0.237728706 
>>>> }, 
>>>> "repwaitbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "reprecovering_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 241.682764382, 
>>>> "avgtime": 0.575435153 
>>>> }, 
>>>> "activating_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 16.893347339, 
>>>> "avgtime": 0.033320211 
>>>> }, 
>>>> "waitlocalrecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 672.335512769, 
>>>> "avgtime": 3.378570415 
>>>> }, 
>>>> "waitremoterecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 213.536439363, 
>>>> "avgtime": 1.073047433 
>>>> }, 
>>>> "recovering_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 79.007696479, 
>>>> "avgtime": 0.397023600 
>>>> }, 
>>>> "recovered_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 14.000732748, 
>>>> "avgtime": 0.027614857 
>>>> }, 
>>>> "clean_latency": { 
>>>> "avgcount": 395, 
>>>> "sum": 4574325.900371083, 
>>>> "avgtime": 11580.571899673 
>>>> }, 
>>>> "active_latency": { 
>>>> "avgcount": 425, 
>>>> "sum": 4575107.630123680, 
>>>> "avgtime": 10764.959129702 
>>>> }, 
>>>> "replicaactive_latency": { 
>>>> "avgcount": 589, 
>>>> "sum": 8975184.499049954, 
>>>> "avgtime": 15238.004242869 
>>>> }, 
>>>> "stray_latency": { 
>>>> "avgcount": 818, 
>>>> "sum": 800.729455666, 
>>>> "avgtime": 0.978886865 
>>>> }, 
>>>> "getinfo_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 15.085667048, 
>>>> "avgtime": 0.027428485 
>>>> }, 
>>>> "getlog_latency": { 
>>>> "avgcount": 546, 
>>>> "sum": 3.482175693, 
>>>> "avgtime": 0.006377611 
>>>> }, 
>>>> "waitactingchange_latency": { 
>>>> "avgcount": 39, 
>>>> "sum": 35.444551284, 
>>>> "avgtime": 0.908834648 
>>>> }, 
>>>> "incomplete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "down_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "getmissing_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 6.702129624, 
>>>> "avgtime": 0.013219190 
>>>> }, 
>>>> "waitupthru_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 474.098261727, 
>>>> "avgtime": 0.935105052 
>>>> }, 
>>>> "notrecovering_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "rocksdb": { 
>>>> "get": 28320977, 
>>>> "submit_transaction": 30484924, 
>>>> "submit_transaction_sync": 26371957, 
>>>> "get_latency": { 
>>>> "avgcount": 28320977, 
>>>> "sum": 325.900908733, 
>>>> "avgtime": 0.000011507 
>>>> }, 
>>>> "submit_latency": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 1835.888692371, 
>>>> "avgtime": 0.000060222 
>>>> }, 
>>>> "submit_sync_latency": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 1431.555230628, 
>>>> "avgtime": 0.000054283 
>>>> }, 
>>>> "compact": 0, 
>>>> "compact_range": 0, 
>>>> "compact_queue_merge": 0, 
>>>> "compact_queue_len": 0, 
>>>> "rocksdb_write_wal_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_memtable_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_delay_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_pre_and_post_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>> 
>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> ok, this is the same 
>>>>> 
>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>> "How fragmented bluestore free space is (free extents / max 
>> possible number of free extents) * 1000"); 
>>>>> 
>>>>> 
>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>> latency, 
>>>>> 
>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>> it? The same for other OSDs? 
>>>> 
>>>> This proves some issue with the allocator - generally fragmentation 
>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>> aren't properly merged in run-time. 
>>>> 
>>>> On the other side I'm not completely sure that latency degradation is 
>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>> how this might impact performance that high. 
>>>> 
>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>> output on admin socket) reports? Do you have any historic data? 
>>>> 
>>>> If not may I have current output and say a couple more samples with 
>>>> 8-12 hours interval? 
>>>> 
>>>> 
>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>> plans 
>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>> 
>>>>> Thanks Igor, 
>>>>> 
>>>>>>> Could you please collect BlueStore performance counters right 
>> after OSD 
>>>>>>> startup and once you get high latency. 
>>>>>>> 
>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> I'm already monitoring with 
>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>> counters) 
>>>>> 
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>> 
>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> 
>>>>> 
>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>> patch to track latency and some other internal allocator's 
>> paramter to 
>>>>>>> make sure it's degraded and learn more details. 
>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>> and try to reproduce. 
>>>>> 
>>>>> 
>>>>> 
>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>> Nautilus 
>>>>>>> and try the difference... 
>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>> perf results of new bitmap allocator seem very promising from what 
>> I've seen in PR. 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>> until restart 
>>>>> 
>>>>> Hi Alexandre, 
>>>>> 
>>>>> looks like a bug in StupidAllocator. 
>>>>> 
>>>>> Could you please collect BlueStore performance counters right after 
>> OSD 
>>>>> startup and once you get high latency. 
>>>>> 
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> 
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>> make sure it's degraded and learn more details. 
>>>>> 
>>>>> 
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>> and try the difference... 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>> 
>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>> Hi again, 
>>>>>> 
>>>>>> I speak too fast, the problem has occured again, so it's not 
>> tcmalloc cache size related. 
>>>>>> 
>>>>>> 
>>>>>> I have notice something using a simple "perf top", 
>>>>>> 
>>>>>> each time I have this problem (I have seen exactly 4 times the 
>> same behaviour), 
>>>>>> 
>>>>>> when latency is bad, perf top give me : 
>>>>>> 
>>>>>> StupidAllocator::_aligned_len 
>>>>>> and 
>>>>>> 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>> unsigned long>&, std::pair<unsigned long 
>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>> 
>>>>>> (around 10-20% time for both) 
>>>>>> 
>>>>>> 
>>>>>> when latency is good, I don't see them at all. 
>>>>>> 
>>>>>> 
>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>> 
>>>>>> 
>>>>>> here an extract of the thread with btree::btree_iterator && 
>> StupidAllocator::_aligned_len 
>>>>>> 
>>>>>> 
>>>>>> + 100.00% clone 
>>>>>> + 100.00% start_thread 
>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>> ceph::heartbeat_handle_d*) 
>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>> ThreadPool::TPHandle&) 
>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% 
>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>> ThreadPool::TPHandle&) 
>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% 
>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% 
>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 67.00% non-virtual thunk to 
>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>> std::allocator<ObjectStore::Transaction> >&, 
>> boost::intrusive_ptr<OpRequest>) 
>>>>>> | | | + 67.00% 
>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>> std::vector<ObjectStore::Transaction, 
>> std::allocator<ObjectStore::Transaction> >&, 
>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>> | | | + 66.00% 
>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>> ObjectStore::Transaction*) 
>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>> ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>> ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 65.00% 
>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>, 
>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>> | | | | | | + 34.00% 
>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>> long, unsigned long, std::less<unsigned long>, 
>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>> unsigned long>&, std::pair<unsigned long const, unsigned 
>> long>*>::increment_slow() 
>>>>>> | | | | | | + 26.00% 
>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> some news: 
>>>>>> 
>>>>>> I have tried with different transparent hugepage values (madvise, 
>> never) : no change 
>>>>>> 
>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>> 
>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>> some more days to be sure) 
>>>>>> 
>>>>>> 
>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>> on my big nvme drives (6TB), 
>>>>>> my others clusters user 1,6TB ssd. 
>>>>>> 
>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>> it's helping. 
>>>>>> 
>>>>>> 
>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>> 
>>>>>> 
>>>>>> Regards, 
>>>>>> 
>>>>>> Alexandre 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>> 
>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>> 
>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>> op_r_process_latency? 
>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>> of graphs). 
>>>>>> 
>>>>>> I just don't see latency difference on reads. (or they are very 
>> very small vs the write latency increase) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>> Hi Stefan, 
>>>>>>> 
>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>> tcmalloc 
>>>>>>>>> like suggested. This report makes me a little nervous about my 
>> change. 
>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>> I need to compare with bigger latencies 
>>>>>>> 
>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>> after restart (at 21:15), 1ms 
>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>> 
>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>> 
>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>> Which 
>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> here my influxdb queries: 
>>>>>>> 
>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>> fill(previous) 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>> fill(previous) 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>> 
>>>>>> Also why do you monitor op_w_process_latency? but not 
>> op_r_process_latency? 
>>>>>> 
>>>>>> greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>> <sage@newdream.net> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>> time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> here some new results, 
>>>>>>>> different osd/ different cluster 
>>>>>>>> 
>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>> 
>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>> but maybe I'm wrong. 
>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>> currently i'm in the process of switching back from jemalloc to 
>> tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my 
>> change. 
>>>>>>> 
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> 
>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>> 
>>>>>>> Greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>> restart 
>>>>>>>> 
>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>> time is 
>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>> 
>>>>>>>> Thanks! 
>>>>>>>> sage 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>> 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>> 
>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>> nvme drivers, 
>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>> 
>>>>>>>>> When the osd are refreshly started, the commit latency is 
>> between 0,5-1ms. 
>>>>>>>>> 
>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>> day), until reaching crazy 
>>>>>>>>> values like 20-200ms. 
>>>>>>>>> 
>>>>>>>>> Some example graphs: 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>> 
>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>> 
>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>> full loaded) 
>>>>>>>>> 
>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>> 
>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>> bluestore memory bug ? 
>>>>>>>>> 
>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regards, 
>>>>>>>>> 
>>>>>>>>> Alexandre 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> _______________________________________________ 
>>>>>>>> ceph-users mailing list 
>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>> 
>>>> 
>>> 
>>> 
>> 
>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>> Hi Igor, 
>>> 
>>> Thanks again for helping ! 
>>> 
>>> 
>>> 
>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>> 
>>> 
>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>> here the reports for osd.0: 
>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>> 
>>> 
>>> osd has been started the 12-02-2019 at 08:00 
>>> 
>>> first report after 1h running 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>> 
>>> 
>>> 
>>> report after 24 before counter resets 
>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>> 
>>> report 1h after counter reset 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>> 
>>> 
>>> 
>>> 
>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>> Then after that, slowly decreasing. 
>>> 
>>> 
>>> Another strange thing, 
>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>> 
>>> 
>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>> another mempool dump after 1h run. (latency ok) 
>>>> 
>>>> Biggest difference: 
>>>> 
>>>> before restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>> 
>>>> 
>>>> After restart 
>>>> ------------- 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> 
>>> This is fine as cache is warming after restart and some rebalancing 
>>> between data and metadata might occur. 
>>> 
>>> What relates to allocator and most probably to fragmentation growth is : 
>>> 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> 
>>> which had been higher before the reset (if I got these dumps' order 
>>> properly) 
>>> 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> 
>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>> latency increase... 
>>> 
>>> Do you have perf counters dump after the restart? 
>>> 
>>> Could you collect some more dumps - for both mempool and perf counters? 
>>> 
>>> So ideally I'd like to have: 
>>> 
>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>> 
>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>> 
>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>> restart) and dump mempool/perf counters again. 
>>> 
>>> So we'll be able to learn both allocator mem usage growth and operation 
>>> latency distribution for the following periods: 
>>> 
>>> a) 1st hour after restart 
>>> 
>>> b) 25th hour. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>> 
>>>> full mempool dump after restart 
>>>> ------------------------------- 
>>>> 
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 40084, 
>>>> "bytes": 1056235520 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 22225, 
>>>> "bytes": 14935200 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 12432298, 
>>>> "bytes": 500834899 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 11, 
>>>> "bytes": 8184 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 5047, 
>>>> "bytes": 22673736 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 91, 
>>>> "bytes": 1662976 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1907, 
>>>> "bytes": 95600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 19664, 
>>>> "bytes": 25486050 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 46189, 
>>>> "bytes": 2956096 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 17, 
>>>> "bytes": 214366 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 889673, 
>>>> "bytes": 367160400 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3803, 
>>>> "bytes": 224552 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 178515204, 
>>>> "bytes": 2160630547 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" <aderumier@odiso.com> 
>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> I'm just seeing 
>>>> 
>>>> StupidAllocator::_aligned_len 
>>>> and 
>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>> 
>>>> on 1 osd, both 10%. 
>>>> 
>>>> here the dump_mempools 
>>>> 
>>>> { 
>>>> "mempool": { 
>>>> "by_pool": { 
>>>> "bloom_filter": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> "bluestore_cache_data": { 
>>>> "items": 54, 
>>>> "bytes": 643072 
>>>> }, 
>>>> "bluestore_cache_onode": { 
>>>> "items": 105637, 
>>>> "bytes": 70988064 
>>>> }, 
>>>> "bluestore_cache_other": { 
>>>> "items": 48661920, 
>>>> "bytes": 1539544228 
>>>> }, 
>>>> "bluestore_fsck": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "bluestore_txc": { 
>>>> "items": 12, 
>>>> "bytes": 8928 
>>>> }, 
>>>> "bluestore_writing_deferred": { 
>>>> "items": 406, 
>>>> "bytes": 4792868 
>>>> }, 
>>>> "bluestore_writing": { 
>>>> "items": 66, 
>>>> "bytes": 1085440 
>>>> }, 
>>>> "bluefs": { 
>>>> "items": 1882, 
>>>> "bytes": 93600 
>>>> }, 
>>>> "buffer_anon": { 
>>>> "items": 138986, 
>>>> "bytes": 24983701 
>>>> }, 
>>>> "buffer_meta": { 
>>>> "items": 544, 
>>>> "bytes": 34816 
>>>> }, 
>>>> "osd": { 
>>>> "items": 243, 
>>>> "bytes": 3089016 
>>>> }, 
>>>> "osd_mapbl": { 
>>>> "items": 36, 
>>>> "bytes": 179308 
>>>> }, 
>>>> "osd_pglog": { 
>>>> "items": 952564, 
>>>> "bytes": 372459684 
>>>> }, 
>>>> "osdmap": { 
>>>> "items": 3639, 
>>>> "bytes": 224664 
>>>> }, 
>>>> "osdmap_mapping": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "pgmap": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "mds_co": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_1": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> }, 
>>>> "unittest_2": { 
>>>> "items": 0, 
>>>> "bytes": 0 
>>>> } 
>>>> }, 
>>>> "total": { 
>>>> "items": 260109445, 
>>>> "bytes": 2228370845 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> 
>>>> and the perf dump 
>>>> 
>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>> { 
>>>> "AsyncMessenger::Worker-0": { 
>>>> "msgr_recv_messages": 22948570, 
>>>> "msgr_send_messages": 22561570, 
>>>> "msgr_recv_bytes": 333085080271, 
>>>> "msgr_send_bytes": 261798871204, 
>>>> "msgr_created_connections": 6152, 
>>>> "msgr_active_connections": 2701, 
>>>> "msgr_running_total_time": 1055.197867330, 
>>>> "msgr_running_send_time": 352.764480121, 
>>>> "msgr_running_recv_time": 499.206831955, 
>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>> }, 
>>>> "AsyncMessenger::Worker-1": { 
>>>> "msgr_recv_messages": 18801593, 
>>>> "msgr_send_messages": 18430264, 
>>>> "msgr_recv_bytes": 306871760934, 
>>>> "msgr_send_bytes": 192789048666, 
>>>> "msgr_created_connections": 5773, 
>>>> "msgr_active_connections": 2721, 
>>>> "msgr_running_total_time": 816.821076305, 
>>>> "msgr_running_send_time": 261.353228926, 
>>>> "msgr_running_recv_time": 394.035587911, 
>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>> }, 
>>>> "AsyncMessenger::Worker-2": { 
>>>> "msgr_recv_messages": 18463400, 
>>>> "msgr_send_messages": 18105856, 
>>>> "msgr_recv_bytes": 187425453590, 
>>>> "msgr_send_bytes": 220735102555, 
>>>> "msgr_created_connections": 5897, 
>>>> "msgr_active_connections": 2605, 
>>>> "msgr_running_total_time": 807.186854324, 
>>>> "msgr_running_send_time": 296.834435839, 
>>>> "msgr_running_recv_time": 351.364389691, 
>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>> }, 
>>>> "bluefs": { 
>>>> "gift_bytes": 0, 
>>>> "reclaim_bytes": 0, 
>>>> "db_total_bytes": 256050724864, 
>>>> "db_used_bytes": 12413042688, 
>>>> "wal_total_bytes": 0, 
>>>> "wal_used_bytes": 0, 
>>>> "slow_total_bytes": 0, 
>>>> "slow_used_bytes": 0, 
>>>> "num_files": 209, 
>>>> "log_bytes": 10383360, 
>>>> "log_compactions": 14, 
>>>> "logged_bytes": 336498688, 
>>>> "files_written_wal": 2, 
>>>> "files_written_sst": 4499, 
>>>> "bytes_written_wal": 417989099783, 
>>>> "bytes_written_sst": 213188750209 
>>>> }, 
>>>> "bluestore": { 
>>>> "kv_flush_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 26.734038497, 
>>>> "avgtime": 0.000001013 
>>>> }, 
>>>> "kv_commit_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3397.491150603, 
>>>> "avgtime": 0.000128829 
>>>> }, 
>>>> "kv_lat": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 3424.225189100, 
>>>> "avgtime": 0.000129843 
>>>> }, 
>>>> "state_prepare_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3689.542105337, 
>>>> "avgtime": 0.000121028 
>>>> }, 
>>>> "state_aio_wait_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 509.864546111, 
>>>> "avgtime": 0.000016725 
>>>> }, 
>>>> "state_io_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 24.534052953, 
>>>> "avgtime": 0.000000804 
>>>> }, 
>>>> "state_kv_queued_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3488.338424238, 
>>>> "avgtime": 0.000114428 
>>>> }, 
>>>> "state_kv_commiting_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 5660.437003432, 
>>>> "avgtime": 0.000185679 
>>>> }, 
>>>> "state_kv_done_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 7.763511500, 
>>>> "avgtime": 0.000000254 
>>>> }, 
>>>> "state_deferred_queued_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 666071.296856696, 
>>>> "avgtime": 0.025281557 
>>>> }, 
>>>> "state_deferred_aio_wait_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 1755.660547071, 
>>>> "avgtime": 0.000066638 
>>>> }, 
>>>> "state_deferred_cleanup_lat": { 
>>>> "avgcount": 26346134, 
>>>> "sum": 185465.151653703, 
>>>> "avgtime": 0.007039558 
>>>> }, 
>>>> "state_finishing_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 3.046847481, 
>>>> "avgtime": 0.000000099 
>>>> }, 
>>>> "state_done_lat": { 
>>>> "avgcount": 30484920, 
>>>> "sum": 13193.362685280, 
>>>> "avgtime": 0.000432783 
>>>> }, 
>>>> "throttle_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 14.634269979, 
>>>> "avgtime": 0.000000480 
>>>> }, 
>>>> "submit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 3873.883076148, 
>>>> "avgtime": 0.000127075 
>>>> }, 
>>>> "commit_lat": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 13376.492317331, 
>>>> "avgtime": 0.000438790 
>>>> }, 
>>>> "read_lat": { 
>>>> "avgcount": 5873923, 
>>>> "sum": 1817.167582057, 
>>>> "avgtime": 0.000309361 
>>>> }, 
>>>> "read_onode_meta_lat": { 
>>>> "avgcount": 19608201, 
>>>> "sum": 146.770464482, 
>>>> "avgtime": 0.000007485 
>>>> }, 
>>>> "read_wait_aio_lat": { 
>>>> "avgcount": 13734278, 
>>>> "sum": 2532.578077242, 
>>>> "avgtime": 0.000184398 
>>>> }, 
>>>> "compress_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "decompress_lat": { 
>>>> "avgcount": 1346945, 
>>>> "sum": 26.227575896, 
>>>> "avgtime": 0.000019471 
>>>> }, 
>>>> "csum_lat": { 
>>>> "avgcount": 28020392, 
>>>> "sum": 149.587819041, 
>>>> "avgtime": 0.000005338 
>>>> }, 
>>>> "compress_success_count": 0, 
>>>> "compress_rejected_count": 0, 
>>>> "write_pad_bytes": 352923605, 
>>>> "deferred_write_ops": 24373340, 
>>>> "deferred_write_bytes": 216791842816, 
>>>> "write_penalty_read_ops": 8062366, 
>>>> "bluestore_allocated": 3765566013440, 
>>>> "bluestore_stored": 4186255221852, 
>>>> "bluestore_compressed": 39981379040, 
>>>> "bluestore_compressed_allocated": 73748348928, 
>>>> "bluestore_compressed_original": 165041381376, 
>>>> "bluestore_onodes": 104232, 
>>>> "bluestore_onode_hits": 71206874, 
>>>> "bluestore_onode_misses": 1217914, 
>>>> "bluestore_onode_shard_hits": 260183292, 
>>>> "bluestore_onode_shard_misses": 22851573, 
>>>> "bluestore_extents": 3394513, 
>>>> "bluestore_blobs": 2773587, 
>>>> "bluestore_buffers": 0, 
>>>> "bluestore_buffer_bytes": 0, 
>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>> "bluestore_write_big": 5648815, 
>>>> "bluestore_write_big_bytes": 552502214656, 
>>>> "bluestore_write_big_blobs": 12440992, 
>>>> "bluestore_write_small": 35883770, 
>>>> "bluestore_write_small_bytes": 223436965719, 
>>>> "bluestore_write_small_unused": 408125, 
>>>> "bluestore_write_small_deferred": 34961455, 
>>>> "bluestore_write_small_pre_read": 34961455, 
>>>> "bluestore_write_small_new": 514190, 
>>>> "bluestore_txc": 30484924, 
>>>> "bluestore_onode_reshard": 5144189, 
>>>> "bluestore_blob_split": 60104, 
>>>> "bluestore_extent_compress": 53347252, 
>>>> "bluestore_gc_merged": 21142528, 
>>>> "bluestore_read_eio": 0, 
>>>> "bluestore_fragmentation_micros": 67 
>>>> }, 
>>>> "finisher-defered_finisher": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "finisher-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 26625163, 
>>>> "sum": 1057.506990951, 
>>>> "avgtime": 0.000039718 
>>>> } 
>>>> }, 
>>>> "finisher-objecter-finisher-0": { 
>>>> "queue_len": 0, 
>>>> "complete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.0::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.1::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.2::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.3::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.4::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.5::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.6::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "mutex-OSDShard.7::shard_lock": { 
>>>> "wait": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "objecter": { 
>>>> "op_active": 0, 
>>>> "op_laggy": 0, 
>>>> "op_send": 0, 
>>>> "op_send_bytes": 0, 
>>>> "op_resend": 0, 
>>>> "op_reply": 0, 
>>>> "op": 0, 
>>>> "op_r": 0, 
>>>> "op_w": 0, 
>>>> "op_rmw": 0, 
>>>> "op_pg": 0, 
>>>> "osdop_stat": 0, 
>>>> "osdop_create": 0, 
>>>> "osdop_read": 0, 
>>>> "osdop_write": 0, 
>>>> "osdop_writefull": 0, 
>>>> "osdop_writesame": 0, 
>>>> "osdop_append": 0, 
>>>> "osdop_zero": 0, 
>>>> "osdop_truncate": 0, 
>>>> "osdop_delete": 0, 
>>>> "osdop_mapext": 0, 
>>>> "osdop_sparse_read": 0, 
>>>> "osdop_clonerange": 0, 
>>>> "osdop_getxattr": 0, 
>>>> "osdop_setxattr": 0, 
>>>> "osdop_cmpxattr": 0, 
>>>> "osdop_rmxattr": 0, 
>>>> "osdop_resetxattrs": 0, 
>>>> "osdop_tmap_up": 0, 
>>>> "osdop_tmap_put": 0, 
>>>> "osdop_tmap_get": 0, 
>>>> "osdop_call": 0, 
>>>> "osdop_watch": 0, 
>>>> "osdop_notify": 0, 
>>>> "osdop_src_cmpxattr": 0, 
>>>> "osdop_pgls": 0, 
>>>> "osdop_pgls_filter": 0, 
>>>> "osdop_other": 0, 
>>>> "linger_active": 0, 
>>>> "linger_send": 0, 
>>>> "linger_resend": 0, 
>>>> "linger_ping": 0, 
>>>> "poolop_active": 0, 
>>>> "poolop_send": 0, 
>>>> "poolop_resend": 0, 
>>>> "poolstat_active": 0, 
>>>> "poolstat_send": 0, 
>>>> "poolstat_resend": 0, 
>>>> "statfs_active": 0, 
>>>> "statfs_send": 0, 
>>>> "statfs_resend": 0, 
>>>> "command_active": 0, 
>>>> "command_send": 0, 
>>>> "command_resend": 0, 
>>>> "map_epoch": 105913, 
>>>> "map_full": 0, 
>>>> "map_inc": 828, 
>>>> "osd_sessions": 0, 
>>>> "osd_session_open": 0, 
>>>> "osd_session_close": 0, 
>>>> "osd_laggy": 0, 
>>>> "omap_wr": 0, 
>>>> "omap_rd": 0, 
>>>> "omap_del": 0 
>>>> }, 
>>>> "osd": { 
>>>> "op_wip": 0, 
>>>> "op": 16758102, 
>>>> "op_in_bytes": 238398820586, 
>>>> "op_out_bytes": 165484999463, 
>>>> "op_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 38242.481640842, 
>>>> "avgtime": 0.002282029 
>>>> }, 
>>>> "op_process_latency": { 
>>>> "avgcount": 16758102, 
>>>> "sum": 28644.906310687, 
>>>> "avgtime": 0.001709316 
>>>> }, 
>>>> "op_prepare_latency": { 
>>>> "avgcount": 16761367, 
>>>> "sum": 3489.856599934, 
>>>> "avgtime": 0.000208208 
>>>> }, 
>>>> "op_r": 6188565, 
>>>> "op_r_out_bytes": 165484999463, 
>>>> "op_r_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 4507.365756792, 
>>>> "avgtime": 0.000728337 
>>>> }, 
>>>> "op_r_process_latency": { 
>>>> "avgcount": 6188565, 
>>>> "sum": 942.363063429, 
>>>> "avgtime": 0.000152274 
>>>> }, 
>>>> "op_r_prepare_latency": { 
>>>> "avgcount": 6188644, 
>>>> "sum": 982.866710389, 
>>>> "avgtime": 0.000158817 
>>>> }, 
>>>> "op_w": 10546037, 
>>>> "op_w_in_bytes": 238334329494, 
>>>> "op_w_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 33160.719998316, 
>>>> "avgtime": 0.003144377 
>>>> }, 
>>>> "op_w_process_latency": { 
>>>> "avgcount": 10546037, 
>>>> "sum": 27668.702029030, 
>>>> "avgtime": 0.002623611 
>>>> }, 
>>>> "op_w_prepare_latency": { 
>>>> "avgcount": 10548652, 
>>>> "sum": 2499.688609173, 
>>>> "avgtime": 0.000236967 
>>>> }, 
>>>> "op_rw": 23500, 
>>>> "op_rw_in_bytes": 64491092, 
>>>> "op_rw_out_bytes": 0, 
>>>> "op_rw_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 574.395885734, 
>>>> "avgtime": 0.024442378 
>>>> }, 
>>>> "op_rw_process_latency": { 
>>>> "avgcount": 23500, 
>>>> "sum": 33.841218228, 
>>>> "avgtime": 0.001440051 
>>>> }, 
>>>> "op_rw_prepare_latency": { 
>>>> "avgcount": 24071, 
>>>> "sum": 7.301280372, 
>>>> "avgtime": 0.000303322 
>>>> }, 
>>>> "op_before_queue_op_lat": { 
>>>> "avgcount": 57892986, 
>>>> "sum": 1502.117718889, 
>>>> "avgtime": 0.000025946 
>>>> }, 
>>>> "op_before_dequeue_op_lat": { 
>>>> "avgcount": 58091683, 
>>>> "sum": 45194.453254037, 
>>>> "avgtime": 0.000777984 
>>>> }, 
>>>> "subop": 19784758, 
>>>> "subop_in_bytes": 547174969754, 
>>>> "subop_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_w": 19784758, 
>>>> "subop_w_in_bytes": 547174969754, 
>>>> "subop_w_latency": { 
>>>> "avgcount": 19784758, 
>>>> "sum": 13019.714424060, 
>>>> "avgtime": 0.000658067 
>>>> }, 
>>>> "subop_pull": 0, 
>>>> "subop_pull_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "subop_push": 0, 
>>>> "subop_push_in_bytes": 0, 
>>>> "subop_push_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "pull": 0, 
>>>> "push": 2003, 
>>>> "push_out_bytes": 5560009728, 
>>>> "recovery_ops": 1940, 
>>>> "loadavg": 118, 
>>>> "buffer_bytes": 0, 
>>>> "history_alloc_Mbytes": 0, 
>>>> "history_alloc_num": 0, 
>>>> "cached_crc": 0, 
>>>> "cached_crc_adjusted": 0, 
>>>> "missed_crc": 0, 
>>>> "numpg": 243, 
>>>> "numpg_primary": 82, 
>>>> "numpg_replica": 161, 
>>>> "numpg_stray": 0, 
>>>> "numpg_removing": 0, 
>>>> "heartbeat_to_peers": 10, 
>>>> "map_messages": 7013, 
>>>> "map_message_epochs": 7143, 
>>>> "map_message_epoch_dups": 6315, 
>>>> "messages_delayed_for_map": 0, 
>>>> "osd_map_cache_hit": 203309, 
>>>> "osd_map_cache_miss": 33, 
>>>> "osd_map_cache_miss_low": 0, 
>>>> "osd_map_cache_miss_low_avg": { 
>>>> "avgcount": 0, 
>>>> "sum": 0 
>>>> }, 
>>>> "osd_map_bl_cache_hit": 47012, 
>>>> "osd_map_bl_cache_miss": 1681, 
>>>> "stat_bytes": 6401248198656, 
>>>> "stat_bytes_used": 3777979072512, 
>>>> "stat_bytes_avail": 2623269126144, 
>>>> "copyfrom": 0, 
>>>> "tier_promote": 0, 
>>>> "tier_flush": 0, 
>>>> "tier_flush_fail": 0, 
>>>> "tier_try_flush": 0, 
>>>> "tier_try_flush_fail": 0, 
>>>> "tier_evict": 0, 
>>>> "tier_whiteout": 1631, 
>>>> "tier_dirty": 22360, 
>>>> "tier_clean": 0, 
>>>> "tier_delay": 0, 
>>>> "tier_proxy_read": 0, 
>>>> "tier_proxy_write": 0, 
>>>> "agent_wake": 0, 
>>>> "agent_skip": 0, 
>>>> "agent_flush": 0, 
>>>> "agent_evict": 0, 
>>>> "object_ctx_cache_hit": 16311156, 
>>>> "object_ctx_cache_total": 17426393, 
>>>> "op_cache_hit": 0, 
>>>> "osd_tier_flush_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_promote_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_tier_r_lat": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "osd_pg_info": 30483113, 
>>>> "osd_pg_fastinfo": 29619885, 
>>>> "osd_pg_biginfo": 81703 
>>>> }, 
>>>> "recoverystate_perf": { 
>>>> "initial_latency": { 
>>>> "avgcount": 243, 
>>>> "sum": 6.869296500, 
>>>> "avgtime": 0.028268709 
>>>> }, 
>>>> "started_latency": { 
>>>> "avgcount": 1125, 
>>>> "sum": 13551384.917335850, 
>>>> "avgtime": 12045.675482076 
>>>> }, 
>>>> "reset_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 1101.727799040, 
>>>> "avgtime": 0.805356578 
>>>> }, 
>>>> "start_latency": { 
>>>> "avgcount": 1368, 
>>>> "sum": 0.002014799, 
>>>> "avgtime": 0.000001472 
>>>> }, 
>>>> "primary_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 4575560.638823428, 
>>>> "avgtime": 9024.774435549 
>>>> }, 
>>>> "peering_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 499.372283616, 
>>>> "avgtime": 0.907949606 
>>>> }, 
>>>> "backfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitremotebackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "waitlocalbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "notbackfilling_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "repnotrecovering_latency": { 
>>>> "avgcount": 1009, 
>>>> "sum": 8975301.082274411, 
>>>> "avgtime": 8895.243887288 
>>>> }, 
>>>> "repwaitrecoveryreserved_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 99.846056520, 
>>>> "avgtime": 0.237728706 
>>>> }, 
>>>> "repwaitbackfillreserved_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "reprecovering_latency": { 
>>>> "avgcount": 420, 
>>>> "sum": 241.682764382, 
>>>> "avgtime": 0.575435153 
>>>> }, 
>>>> "activating_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 16.893347339, 
>>>> "avgtime": 0.033320211 
>>>> }, 
>>>> "waitlocalrecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 672.335512769, 
>>>> "avgtime": 3.378570415 
>>>> }, 
>>>> "waitremoterecoveryreserved_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 213.536439363, 
>>>> "avgtime": 1.073047433 
>>>> }, 
>>>> "recovering_latency": { 
>>>> "avgcount": 199, 
>>>> "sum": 79.007696479, 
>>>> "avgtime": 0.397023600 
>>>> }, 
>>>> "recovered_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 14.000732748, 
>>>> "avgtime": 0.027614857 
>>>> }, 
>>>> "clean_latency": { 
>>>> "avgcount": 395, 
>>>> "sum": 4574325.900371083, 
>>>> "avgtime": 11580.571899673 
>>>> }, 
>>>> "active_latency": { 
>>>> "avgcount": 425, 
>>>> "sum": 4575107.630123680, 
>>>> "avgtime": 10764.959129702 
>>>> }, 
>>>> "replicaactive_latency": { 
>>>> "avgcount": 589, 
>>>> "sum": 8975184.499049954, 
>>>> "avgtime": 15238.004242869 
>>>> }, 
>>>> "stray_latency": { 
>>>> "avgcount": 818, 
>>>> "sum": 800.729455666, 
>>>> "avgtime": 0.978886865 
>>>> }, 
>>>> "getinfo_latency": { 
>>>> "avgcount": 550, 
>>>> "sum": 15.085667048, 
>>>> "avgtime": 0.027428485 
>>>> }, 
>>>> "getlog_latency": { 
>>>> "avgcount": 546, 
>>>> "sum": 3.482175693, 
>>>> "avgtime": 0.006377611 
>>>> }, 
>>>> "waitactingchange_latency": { 
>>>> "avgcount": 39, 
>>>> "sum": 35.444551284, 
>>>> "avgtime": 0.908834648 
>>>> }, 
>>>> "incomplete_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "down_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "getmissing_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 6.702129624, 
>>>> "avgtime": 0.013219190 
>>>> }, 
>>>> "waitupthru_latency": { 
>>>> "avgcount": 507, 
>>>> "sum": 474.098261727, 
>>>> "avgtime": 0.935105052 
>>>> }, 
>>>> "notrecovering_latency": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> }, 
>>>> "rocksdb": { 
>>>> "get": 28320977, 
>>>> "submit_transaction": 30484924, 
>>>> "submit_transaction_sync": 26371957, 
>>>> "get_latency": { 
>>>> "avgcount": 28320977, 
>>>> "sum": 325.900908733, 
>>>> "avgtime": 0.000011507 
>>>> }, 
>>>> "submit_latency": { 
>>>> "avgcount": 30484924, 
>>>> "sum": 1835.888692371, 
>>>> "avgtime": 0.000060222 
>>>> }, 
>>>> "submit_sync_latency": { 
>>>> "avgcount": 26371957, 
>>>> "sum": 1431.555230628, 
>>>> "avgtime": 0.000054283 
>>>> }, 
>>>> "compact": 0, 
>>>> "compact_range": 0, 
>>>> "compact_queue_merge": 0, 
>>>> "compact_queue_len": 0, 
>>>> "rocksdb_write_wal_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_memtable_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_delay_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> }, 
>>>> "rocksdb_write_pre_and_post_time": { 
>>>> "avgcount": 0, 
>>>> "sum": 0.000000000, 
>>>> "avgtime": 0.000000000 
>>>> } 
>>>> } 
>>>> } 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> ok, this is the same 
>>>>> 
>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>> 
>>>>> 
>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>> 
>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>> it? The same for other OSDs? 
>>>> 
>>>> This proves some issue with the allocator - generally fragmentation 
>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>> aren't properly merged in run-time. 
>>>> 
>>>> On the other side I'm not completely sure that latency degradation is 
>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>> how this might impact performance that high. 
>>>> 
>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>> output on admin socket) reports? Do you have any historic data? 
>>>> 
>>>> If not may I have current output and say a couple more samples with 
>>>> 8-12 hours interval? 
>>>> 
>>>> 
>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Thanks Igor, 
>>>>> 
>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>> startup and once you get high latency. 
>>>>>>> 
>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> I'm already monitoring with 
>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>> 
>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>> 
>>>>> (but I have bluestore_fragmentation_micros) 
>>>>> 
>>>>> 
>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>> make sure it's degraded and learn more details. 
>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>> 
>>>>> 
>>>>> 
>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>> and try the difference... 
>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi Alexandre, 
>>>>> 
>>>>> looks like a bug in StupidAllocator. 
>>>>> 
>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>> startup and once you get high latency. 
>>>>> 
>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>> 
>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>> make sure it's degraded and learn more details. 
>>>>> 
>>>>> 
>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>> and try the difference... 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>> 
>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>> Hi again, 
>>>>>> 
>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>> 
>>>>>> 
>>>>>> I have notice something using a simple "perf top", 
>>>>>> 
>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>> 
>>>>>> when latency is bad, perf top give me : 
>>>>>> 
>>>>>> StupidAllocator::_aligned_len 
>>>>>> and 
>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>> 
>>>>>> (around 10-20% time for both) 
>>>>>> 
>>>>>> 
>>>>>> when latency is good, I don't see them at all. 
>>>>>> 
>>>>>> 
>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>> 
>>>>>> 
>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>> 
>>>>>> 
>>>>>> + 100.00% clone 
>>>>>> + 100.00% start_thread 
>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> some news: 
>>>>>> 
>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>> 
>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>> 
>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>> 
>>>>>> 
>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>> my others clusters user 1,6TB ssd. 
>>>>>> 
>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>> 
>>>>>> 
>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>> 
>>>>>> 
>>>>>> Regards, 
>>>>>> 
>>>>>> Alexandre 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>> 
>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>> 
>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>> Hi Stefan, 
>>>>>>> 
>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>> I need to compare with bigger latencies 
>>>>>>> 
>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>> 
>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>> 
>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> here my influxdb queries: 
>>>>>>> 
>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> 
>>>>>>> 
>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>> op_r_latency but instead op_latency? 
>>>>>> 
>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>> 
>>>>>> greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> here some new results, 
>>>>>>>> different osd/ different cluster 
>>>>>>>> 
>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>> 
>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>> 
>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>> 
>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>> 
>>>>>>> Greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>> 
>>>>>>>> Thanks! 
>>>>>>>> sage 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>> 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>> 
>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>> 
>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>> 
>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>> values like 20-200ms. 
>>>>>>>>> 
>>>>>>>>> Some example graphs: 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>> 
>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>> 
>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>> 
>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>> 
>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>> 
>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regards, 
>>>>>>>>> 
>>>>>>>>> Alexandre 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> _______________________________________________ 
>>>>>>>> ceph-users mailing list 
>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <622347904.1243911.1550305749920.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                             ` <622347904.1243911.1550305749920.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-19 10:12                                                                                               ` Igor Fedotov
       [not found]                                                                                                 ` <76764043-4d0d-bb46-2e2e-0b4261963a98-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-19 10:12 UTC (permalink / raw)
  To: Alexandre DERUMIER, Wido den Hollander; +Cc: ceph-users, ceph-devel

Hi Alexander,

I think op_w_process_latency includes replication times, not 100% sure 
though.

So restarting other nodes might affect latencies at this specific OSD.


Thanks,

Igot

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote:
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are
>>> runnigh with memory target on 6G right now to make sure there is no
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD
>>> so it will max out on 80GB leaving 16GB as spare.
> Thanks Wido. I send results monday with my increased memory
>
>
>
> @Igor:
>
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example).
> (op_w_process_latency).
>
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too.
>
> does "op_w_process_latency" counter include replication time ?
>
> ----- Mail original -----
> De: "Wido den Hollander" <wido@42on.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Vendredi 15 Février 2019 14:59:30
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote:
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
>>>> OSDs as well. Over time their latency increased until we started to
>>>> notice I/O-wait inside VMs.
>> I'm also notice it in the vms. BTW, what it your nvme disk size ?
> Samsung PM983 3.84TB SSDs in both clusters.
>
>>
>>>> A restart fixed it. We also increased memory target from 4G to 6G on
>>>> these OSDs as the memory would allow it.
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.
>> (my last test was 8gb with 1osd of 6TB, but that didn't help)
> There are 10 OSDs in these systems with 96GB of memory in total. We are
> runnigh with memory target on 6G right now to make sure there is no
> leakage. If this runs fine for a longer period we will go to 8GB per OSD
> so it will max out on 80GB leaving 16GB as spare.
>
> As these OSDs were all restarted earlier this week I can't tell how it
> will hold up over a longer period. Monitoring (Zabbix) shows the latency
> is fine at the moment.
>
> Wido
>
>>
>> ----- Mail original -----
>> De: "Wido den Hollander" <wido@42on.com>
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de>
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Vendredi 15 Février 2019 14:50:34
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote:
>>> Thanks Igor.
>>>
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different.
>>>
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem.
>>>
>>>
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
>> OSDs as well. Over time their latency increased until we started to
>> notice I/O-wait inside VMs.
>>
>> A restart fixed it. We also increased memory target from 4G to 6G on
>> these OSDs as the memory would allow it.
>>
>> But we noticed this on two different 12.2.10/11 clusters.
>>
>> A restart made the latency drop. Not only the numbers, but the
>> real-world latency as experienced by a VM as well.
>>
>> Wido
>>
>>>
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Vendredi 15 Février 2019 13:47:57
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>
>>> Hi Alexander,
>>>
>>> I've read through your reports, nothing obvious so far.
>>>
>>> I can only see several times average latency increase for OSD write ops
>>> (in seconds)
>>> 0.002040060 (first hour) vs.
>>>
>>> 0.002483516 (last 24 hours) vs.
>>> 0.008382087 (last hour)
>>>
>>> subop_w_latency:
>>> 0.000478934 (first hour) vs.
>>> 0.000537956 (last 24 hours) vs.
>>> 0.003073475 (last hour)
>>>
>>> and OSD read ops, osd_r_latency:
>>>
>>> 0.000408595 (first hour)
>>> 0.000709031 (24 hours)
>>> 0.004979540 (last hour)
>>>
>>> What's interesting is that such latency differences aren't observed at
>>> neither BlueStore level (any _lat params under "bluestore" section) nor
>>> rocksdb one.
>>>
>>> Which probably means that the issue is rather somewhere above BlueStore.
>>>
>>> Suggest to proceed with perf dumps collection to see if the picture
>>> stays the same.
>>>
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No
>>> decrease in RSS report is a known artifact that seems to be safe.
>>>
>>> Thanks,
>>> Igor
>>>
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
>>>> Hi Igor,
>>>>
>>>> Thanks again for helping !
>>>>
>>>>
>>>>
>>>> I have upgrade to last mimic this weekend, and with new autotune memory,
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB)
>>>>
>>>>
>>>> I have done a lot of perf dump and mempool dump and ps of process to
>>> see rss memory at different hours,
>>>> here the reports for osd.0:
>>>>
>>>> http://odisoweb1.odiso.net/perfanalysis/
>>>>
>>>>
>>>> osd has been started the 12-02-2019 at 08:00
>>>>
>>>> first report after 1h running
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
>>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>>>>
>>>>
>>>>
>>>> report after 24 before counter resets
>>>>
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
>>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>>>>
>>>> report 1h after counter reset
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
>>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>>>>
>>>>
>>>>
>>>>
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G
>>> around 12-02-2019 at 14:00
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
>>>> Then after that, slowly decreasing.
>>>>
>>>>
>>>> Another strange thing,
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30
>>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is
>>> still at 8G
>>>>
>>>> I'm graphing mempools counters too since yesterday, so I'll able to
>>> track them over time.
>>>> ----- Mail original -----
>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users"
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 11 Février 2019 12:03:17
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time,
>>> until restart
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>>>>> another mempool dump after 1h run. (latency ok)
>>>>>
>>>>> Biggest difference:
>>>>>
>>>>> before restart
>>>>> -------------
>>>>> "bluestore_cache_other": {
>>>>> "items": 48661920,
>>>>> "bytes": 1539544228
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 54,
>>>>> "bytes": 643072
>>>>> },
>>>>> (other caches seem to be quite low too, like bluestore_cache_other
>>> take all the memory)
>>>>>
>>>>> After restart
>>>>> -------------
>>>>> "bluestore_cache_other": {
>>>>> "items": 12432298,
>>>>> "bytes": 500834899
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 40084,
>>>>> "bytes": 1056235520
>>>>> },
>>>>>
>>>> This is fine as cache is warming after restart and some rebalancing
>>>> between data and metadata might occur.
>>>>
>>>> What relates to allocator and most probably to fragmentation growth is :
>>>>
>>>> "bluestore_alloc": {
>>>> "items": 165053952,
>>>> "bytes": 165053952
>>>> },
>>>>
>>>> which had been higher before the reset (if I got these dumps' order
>>>> properly)
>>>>
>>>> "bluestore_alloc": {
>>>> "items": 210243456,
>>>> "bytes": 210243456
>>>> },
>>>>
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge
>>>> latency increase...
>>>>
>>>> Do you have perf counters dump after the restart?
>>>>
>>>> Could you collect some more dumps - for both mempool and perf counters?
>>>>
>>>> So ideally I'd like to have:
>>>>
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK)
>>>>
>>>> 2) mempool/perf counters dumps in 24+ hours after restart
>>>>
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD
>>>> restart) and dump mempool/perf counters again.
>>>>
>>>> So we'll be able to learn both allocator mem usage growth and operation
>>>> latency distribution for the following periods:
>>>>
>>>> a) 1st hour after restart
>>>>
>>>> b) 25th hour.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>> full mempool dump after restart
>>>>> -------------------------------
>>>>>
>>>>> {
>>>>> "mempool": {
>>>>> "by_pool": {
>>>>> "bloom_filter": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_alloc": {
>>>>> "items": 165053952,
>>>>> "bytes": 165053952
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 40084,
>>>>> "bytes": 1056235520
>>>>> },
>>>>> "bluestore_cache_onode": {
>>>>> "items": 22225,
>>>>> "bytes": 14935200
>>>>> },
>>>>> "bluestore_cache_other": {
>>>>> "items": 12432298,
>>>>> "bytes": 500834899
>>>>> },
>>>>> "bluestore_fsck": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_txc": {
>>>>> "items": 11,
>>>>> "bytes": 8184
>>>>> },
>>>>> "bluestore_writing_deferred": {
>>>>> "items": 5047,
>>>>> "bytes": 22673736
>>>>> },
>>>>> "bluestore_writing": {
>>>>> "items": 91,
>>>>> "bytes": 1662976
>>>>> },
>>>>> "bluefs": {
>>>>> "items": 1907,
>>>>> "bytes": 95600
>>>>> },
>>>>> "buffer_anon": {
>>>>> "items": 19664,
>>>>> "bytes": 25486050
>>>>> },
>>>>> "buffer_meta": {
>>>>> "items": 46189,
>>>>> "bytes": 2956096
>>>>> },
>>>>> "osd": {
>>>>> "items": 243,
>>>>> "bytes": 3089016
>>>>> },
>>>>> "osd_mapbl": {
>>>>> "items": 17,
>>>>> "bytes": 214366
>>>>> },
>>>>> "osd_pglog": {
>>>>> "items": 889673,
>>>>> "bytes": 367160400
>>>>> },
>>>>> "osdmap": {
>>>>> "items": 3803,
>>>>> "bytes": 224552
>>>>> },
>>>>> "osdmap_mapping": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "pgmap": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "mds_co": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_1": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_2": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> }
>>>>> },
>>>>> "total": {
>>>>> "items": 178515204,
>>>>> "bytes": 2160630547
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "aderumier" <aderumier@odiso.com>
>>>>> À: "Igor Fedotov" <ifedotov@suse.de>
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time,
>>> until restart
>>>>> I'm just seeing
>>>>>
>>>>> StupidAllocator::_aligned_len
>>>>> and
>>>>>
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned
>>> long, unsigned long, std::less<unsigned long>, mempoo
>>>>> on 1 osd, both 10%.
>>>>>
>>>>> here the dump_mempools
>>>>>
>>>>> {
>>>>> "mempool": {
>>>>> "by_pool": {
>>>>> "bloom_filter": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_alloc": {
>>>>> "items": 210243456,
>>>>> "bytes": 210243456
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 54,
>>>>> "bytes": 643072
>>>>> },
>>>>> "bluestore_cache_onode": {
>>>>> "items": 105637,
>>>>> "bytes": 70988064
>>>>> },
>>>>> "bluestore_cache_other": {
>>>>> "items": 48661920,
>>>>> "bytes": 1539544228
>>>>> },
>>>>> "bluestore_fsck": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_txc": {
>>>>> "items": 12,
>>>>> "bytes": 8928
>>>>> },
>>>>> "bluestore_writing_deferred": {
>>>>> "items": 406,
>>>>> "bytes": 4792868
>>>>> },
>>>>> "bluestore_writing": {
>>>>> "items": 66,
>>>>> "bytes": 1085440
>>>>> },
>>>>> "bluefs": {
>>>>> "items": 1882,
>>>>> "bytes": 93600
>>>>> },
>>>>> "buffer_anon": {
>>>>> "items": 138986,
>>>>> "bytes": 24983701
>>>>> },
>>>>> "buffer_meta": {
>>>>> "items": 544,
>>>>> "bytes": 34816
>>>>> },
>>>>> "osd": {
>>>>> "items": 243,
>>>>> "bytes": 3089016
>>>>> },
>>>>> "osd_mapbl": {
>>>>> "items": 36,
>>>>> "bytes": 179308
>>>>> },
>>>>> "osd_pglog": {
>>>>> "items": 952564,
>>>>> "bytes": 372459684
>>>>> },
>>>>> "osdmap": {
>>>>> "items": 3639,
>>>>> "bytes": 224664
>>>>> },
>>>>> "osdmap_mapping": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "pgmap": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "mds_co": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_1": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_2": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> }
>>>>> },
>>>>> "total": {
>>>>> "items": 260109445,
>>>>> "bytes": 2228370845
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> and the perf dump
>>>>>
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump
>>>>> {
>>>>> "AsyncMessenger::Worker-0": {
>>>>> "msgr_recv_messages": 22948570,
>>>>> "msgr_send_messages": 22561570,
>>>>> "msgr_recv_bytes": 333085080271,
>>>>> "msgr_send_bytes": 261798871204,
>>>>> "msgr_created_connections": 6152,
>>>>> "msgr_active_connections": 2701,
>>>>> "msgr_running_total_time": 1055.197867330,
>>>>> "msgr_running_send_time": 352.764480121,
>>>>> "msgr_running_recv_time": 499.206831955,
>>>>> "msgr_running_fast_dispatch_time": 130.982201607
>>>>> },
>>>>> "AsyncMessenger::Worker-1": {
>>>>> "msgr_recv_messages": 18801593,
>>>>> "msgr_send_messages": 18430264,
>>>>> "msgr_recv_bytes": 306871760934,
>>>>> "msgr_send_bytes": 192789048666,
>>>>> "msgr_created_connections": 5773,
>>>>> "msgr_active_connections": 2721,
>>>>> "msgr_running_total_time": 816.821076305,
>>>>> "msgr_running_send_time": 261.353228926,
>>>>> "msgr_running_recv_time": 394.035587911,
>>>>> "msgr_running_fast_dispatch_time": 104.012155720
>>>>> },
>>>>> "AsyncMessenger::Worker-2": {
>>>>> "msgr_recv_messages": 18463400,
>>>>> "msgr_send_messages": 18105856,
>>>>> "msgr_recv_bytes": 187425453590,
>>>>> "msgr_send_bytes": 220735102555,
>>>>> "msgr_created_connections": 5897,
>>>>> "msgr_active_connections": 2605,
>>>>> "msgr_running_total_time": 807.186854324,
>>>>> "msgr_running_send_time": 296.834435839,
>>>>> "msgr_running_recv_time": 351.364389691,
>>>>> "msgr_running_fast_dispatch_time": 101.215776792
>>>>> },
>>>>> "bluefs": {
>>>>> "gift_bytes": 0,
>>>>> "reclaim_bytes": 0,
>>>>> "db_total_bytes": 256050724864,
>>>>> "db_used_bytes": 12413042688,
>>>>> "wal_total_bytes": 0,
>>>>> "wal_used_bytes": 0,
>>>>> "slow_total_bytes": 0,
>>>>> "slow_used_bytes": 0,
>>>>> "num_files": 209,
>>>>> "log_bytes": 10383360,
>>>>> "log_compactions": 14,
>>>>> "logged_bytes": 336498688,
>>>>> "files_written_wal": 2,
>>>>> "files_written_sst": 4499,
>>>>> "bytes_written_wal": 417989099783,
>>>>> "bytes_written_sst": 213188750209
>>>>> },
>>>>> "bluestore": {
>>>>> "kv_flush_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 26.734038497,
>>>>> "avgtime": 0.000001013
>>>>> },
>>>>> "kv_commit_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 3397.491150603,
>>>>> "avgtime": 0.000128829
>>>>> },
>>>>> "kv_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 3424.225189100,
>>>>> "avgtime": 0.000129843
>>>>> },
>>>>> "state_prepare_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3689.542105337,
>>>>> "avgtime": 0.000121028
>>>>> },
>>>>> "state_aio_wait_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 509.864546111,
>>>>> "avgtime": 0.000016725
>>>>> },
>>>>> "state_io_done_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 24.534052953,
>>>>> "avgtime": 0.000000804
>>>>> },
>>>>> "state_kv_queued_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3488.338424238,
>>>>> "avgtime": 0.000114428
>>>>> },
>>>>> "state_kv_commiting_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 5660.437003432,
>>>>> "avgtime": 0.000185679
>>>>> },
>>>>> "state_kv_done_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 7.763511500,
>>>>> "avgtime": 0.000000254
>>>>> },
>>>>> "state_deferred_queued_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 666071.296856696,
>>>>> "avgtime": 0.025281557
>>>>> },
>>>>> "state_deferred_aio_wait_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 1755.660547071,
>>>>> "avgtime": 0.000066638
>>>>> },
>>>>> "state_deferred_cleanup_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 185465.151653703,
>>>>> "avgtime": 0.007039558
>>>>> },
>>>>> "state_finishing_lat": {
>>>>> "avgcount": 30484920,
>>>>> "sum": 3.046847481,
>>>>> "avgtime": 0.000000099
>>>>> },
>>>>> "state_done_lat": {
>>>>> "avgcount": 30484920,
>>>>> "sum": 13193.362685280,
>>>>> "avgtime": 0.000432783
>>>>> },
>>>>> "throttle_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 14.634269979,
>>>>> "avgtime": 0.000000480
>>>>> },
>>>>> "submit_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3873.883076148,
>>>>> "avgtime": 0.000127075
>>>>> },
>>>>> "commit_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 13376.492317331,
>>>>> "avgtime": 0.000438790
>>>>> },
>>>>> "read_lat": {
>>>>> "avgcount": 5873923,
>>>>> "sum": 1817.167582057,
>>>>> "avgtime": 0.000309361
>>>>> },
>>>>> "read_onode_meta_lat": {
>>>>> "avgcount": 19608201,
>>>>> "sum": 146.770464482,
>>>>> "avgtime": 0.000007485
>>>>> },
>>>>> "read_wait_aio_lat": {
>>>>> "avgcount": 13734278,
>>>>> "sum": 2532.578077242,
>>>>> "avgtime": 0.000184398
>>>>> },
>>>>> "compress_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "decompress_lat": {
>>>>> "avgcount": 1346945,
>>>>> "sum": 26.227575896,
>>>>> "avgtime": 0.000019471
>>>>> },
>>>>> "csum_lat": {
>>>>> "avgcount": 28020392,
>>>>> "sum": 149.587819041,
>>>>> "avgtime": 0.000005338
>>>>> },
>>>>> "compress_success_count": 0,
>>>>> "compress_rejected_count": 0,
>>>>> "write_pad_bytes": 352923605,
>>>>> "deferred_write_ops": 24373340,
>>>>> "deferred_write_bytes": 216791842816,
>>>>> "write_penalty_read_ops": 8062366,
>>>>> "bluestore_allocated": 3765566013440,
>>>>> "bluestore_stored": 4186255221852,
>>>>> "bluestore_compressed": 39981379040,
>>>>> "bluestore_compressed_allocated": 73748348928,
>>>>> "bluestore_compressed_original": 165041381376,
>>>>> "bluestore_onodes": 104232,
>>>>> "bluestore_onode_hits": 71206874,
>>>>> "bluestore_onode_misses": 1217914,
>>>>> "bluestore_onode_shard_hits": 260183292,
>>>>> "bluestore_onode_shard_misses": 22851573,
>>>>> "bluestore_extents": 3394513,
>>>>> "bluestore_blobs": 2773587,
>>>>> "bluestore_buffers": 0,
>>>>> "bluestore_buffer_bytes": 0,
>>>>> "bluestore_buffer_hit_bytes": 62026011221,
>>>>> "bluestore_buffer_miss_bytes": 995233669922,
>>>>> "bluestore_write_big": 5648815,
>>>>> "bluestore_write_big_bytes": 552502214656,
>>>>> "bluestore_write_big_blobs": 12440992,
>>>>> "bluestore_write_small": 35883770,
>>>>> "bluestore_write_small_bytes": 223436965719,
>>>>> "bluestore_write_small_unused": 408125,
>>>>> "bluestore_write_small_deferred": 34961455,
>>>>> "bluestore_write_small_pre_read": 34961455,
>>>>> "bluestore_write_small_new": 514190,
>>>>> "bluestore_txc": 30484924,
>>>>> "bluestore_onode_reshard": 5144189,
>>>>> "bluestore_blob_split": 60104,
>>>>> "bluestore_extent_compress": 53347252,
>>>>> "bluestore_gc_merged": 21142528,
>>>>> "bluestore_read_eio": 0,
>>>>> "bluestore_fragmentation_micros": 67
>>>>> },
>>>>> "finisher-defered_finisher": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "finisher-finisher-0": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 26625163,
>>>>> "sum": 1057.506990951,
>>>>> "avgtime": 0.000039718
>>>>> }
>>>>> },
>>>>> "finisher-objecter-finisher-0": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.0::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.0::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.1::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.1::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.2::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.2::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.3::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.3::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.4::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.4::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.5::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.5::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.6::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.6::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.7::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.7::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "objecter": {
>>>>> "op_active": 0,
>>>>> "op_laggy": 0,
>>>>> "op_send": 0,
>>>>> "op_send_bytes": 0,
>>>>> "op_resend": 0,
>>>>> "op_reply": 0,
>>>>> "op": 0,
>>>>> "op_r": 0,
>>>>> "op_w": 0,
>>>>> "op_rmw": 0,
>>>>> "op_pg": 0,
>>>>> "osdop_stat": 0,
>>>>> "osdop_create": 0,
>>>>> "osdop_read": 0,
>>>>> "osdop_write": 0,
>>>>> "osdop_writefull": 0,
>>>>> "osdop_writesame": 0,
>>>>> "osdop_append": 0,
>>>>> "osdop_zero": 0,
>>>>> "osdop_truncate": 0,
>>>>> "osdop_delete": 0,
>>>>> "osdop_mapext": 0,
>>>>> "osdop_sparse_read": 0,
>>>>> "osdop_clonerange": 0,
>>>>> "osdop_getxattr": 0,
>>>>> "osdop_setxattr": 0,
>>>>> "osdop_cmpxattr": 0,
>>>>> "osdop_rmxattr": 0,
>>>>> "osdop_resetxattrs": 0,
>>>>> "osdop_tmap_up": 0,
>>>>> "osdop_tmap_put": 0,
>>>>> "osdop_tmap_get": 0,
>>>>> "osdop_call": 0,
>>>>> "osdop_watch": 0,
>>>>> "osdop_notify": 0,
>>>>> "osdop_src_cmpxattr": 0,
>>>>> "osdop_pgls": 0,
>>>>> "osdop_pgls_filter": 0,
>>>>> "osdop_other": 0,
>>>>> "linger_active": 0,
>>>>> "linger_send": 0,
>>>>> "linger_resend": 0,
>>>>> "linger_ping": 0,
>>>>> "poolop_active": 0,
>>>>> "poolop_send": 0,
>>>>> "poolop_resend": 0,
>>>>> "poolstat_active": 0,
>>>>> "poolstat_send": 0,
>>>>> "poolstat_resend": 0,
>>>>> "statfs_active": 0,
>>>>> "statfs_send": 0,
>>>>> "statfs_resend": 0,
>>>>> "command_active": 0,
>>>>> "command_send": 0,
>>>>> "command_resend": 0,
>>>>> "map_epoch": 105913,
>>>>> "map_full": 0,
>>>>> "map_inc": 828,
>>>>> "osd_sessions": 0,
>>>>> "osd_session_open": 0,
>>>>> "osd_session_close": 0,
>>>>> "osd_laggy": 0,
>>>>> "omap_wr": 0,
>>>>> "omap_rd": 0,
>>>>> "omap_del": 0
>>>>> },
>>>>> "osd": {
>>>>> "op_wip": 0,
>>>>> "op": 16758102,
>>>>> "op_in_bytes": 238398820586,
>>>>> "op_out_bytes": 165484999463,
>>>>> "op_latency": {
>>>>> "avgcount": 16758102,
>>>>> "sum": 38242.481640842,
>>>>> "avgtime": 0.002282029
>>>>> },
>>>>> "op_process_latency": {
>>>>> "avgcount": 16758102,
>>>>> "sum": 28644.906310687,
>>>>> "avgtime": 0.001709316
>>>>> },
>>>>> "op_prepare_latency": {
>>>>> "avgcount": 16761367,
>>>>> "sum": 3489.856599934,
>>>>> "avgtime": 0.000208208
>>>>> },
>>>>> "op_r": 6188565,
>>>>> "op_r_out_bytes": 165484999463,
>>>>> "op_r_latency": {
>>>>> "avgcount": 6188565,
>>>>> "sum": 4507.365756792,
>>>>> "avgtime": 0.000728337
>>>>> },
>>>>> "op_r_process_latency": {
>>>>> "avgcount": 6188565,
>>>>> "sum": 942.363063429,
>>>>> "avgtime": 0.000152274
>>>>> },
>>>>> "op_r_prepare_latency": {
>>>>> "avgcount": 6188644,
>>>>> "sum": 982.866710389,
>>>>> "avgtime": 0.000158817
>>>>> },
>>>>> "op_w": 10546037,
>>>>> "op_w_in_bytes": 238334329494,
>>>>> "op_w_latency": {
>>>>> "avgcount": 10546037,
>>>>> "sum": 33160.719998316,
>>>>> "avgtime": 0.003144377
>>>>> },
>>>>> "op_w_process_latency": {
>>>>> "avgcount": 10546037,
>>>>> "sum": 27668.702029030,
>>>>> "avgtime": 0.002623611
>>>>> },
>>>>> "op_w_prepare_latency": {
>>>>> "avgcount": 10548652,
>>>>> "sum": 2499.688609173,
>>>>> "avgtime": 0.000236967
>>>>> },
>>>>> "op_rw": 23500,
>>>>> "op_rw_in_bytes": 64491092,
>>>>> "op_rw_out_bytes": 0,
>>>>> "op_rw_latency": {
>>>>> "avgcount": 23500,
>>>>> "sum": 574.395885734,
>>>>> "avgtime": 0.024442378
>>>>> },
>>>>> "op_rw_process_latency": {
>>>>> "avgcount": 23500,
>>>>> "sum": 33.841218228,
>>>>> "avgtime": 0.001440051
>>>>> },
>>>>> "op_rw_prepare_latency": {
>>>>> "avgcount": 24071,
>>>>> "sum": 7.301280372,
>>>>> "avgtime": 0.000303322
>>>>> },
>>>>> "op_before_queue_op_lat": {
>>>>> "avgcount": 57892986,
>>>>> "sum": 1502.117718889,
>>>>> "avgtime": 0.000025946
>>>>> },
>>>>> "op_before_dequeue_op_lat": {
>>>>> "avgcount": 58091683,
>>>>> "sum": 45194.453254037,
>>>>> "avgtime": 0.000777984
>>>>> },
>>>>> "subop": 19784758,
>>>>> "subop_in_bytes": 547174969754,
>>>>> "subop_latency": {
>>>>> "avgcount": 19784758,
>>>>> "sum": 13019.714424060,
>>>>> "avgtime": 0.000658067
>>>>> },
>>>>> "subop_w": 19784758,
>>>>> "subop_w_in_bytes": 547174969754,
>>>>> "subop_w_latency": {
>>>>> "avgcount": 19784758,
>>>>> "sum": 13019.714424060,
>>>>> "avgtime": 0.000658067
>>>>> },
>>>>> "subop_pull": 0,
>>>>> "subop_pull_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "subop_push": 0,
>>>>> "subop_push_in_bytes": 0,
>>>>> "subop_push_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "pull": 0,
>>>>> "push": 2003,
>>>>> "push_out_bytes": 5560009728,
>>>>> "recovery_ops": 1940,
>>>>> "loadavg": 118,
>>>>> "buffer_bytes": 0,
>>>>> "history_alloc_Mbytes": 0,
>>>>> "history_alloc_num": 0,
>>>>> "cached_crc": 0,
>>>>> "cached_crc_adjusted": 0,
>>>>> "missed_crc": 0,
>>>>> "numpg": 243,
>>>>> "numpg_primary": 82,
>>>>> "numpg_replica": 161,
>>>>> "numpg_stray": 0,
>>>>> "numpg_removing": 0,
>>>>> "heartbeat_to_peers": 10,
>>>>> "map_messages": 7013,
>>>>> "map_message_epochs": 7143,
>>>>> "map_message_epoch_dups": 6315,
>>>>> "messages_delayed_for_map": 0,
>>>>> "osd_map_cache_hit": 203309,
>>>>> "osd_map_cache_miss": 33,
>>>>> "osd_map_cache_miss_low": 0,
>>>>> "osd_map_cache_miss_low_avg": {
>>>>> "avgcount": 0,
>>>>> "sum": 0
>>>>> },
>>>>> "osd_map_bl_cache_hit": 47012,
>>>>> "osd_map_bl_cache_miss": 1681,
>>>>> "stat_bytes": 6401248198656,
>>>>> "stat_bytes_used": 3777979072512,
>>>>> "stat_bytes_avail": 2623269126144,
>>>>> "copyfrom": 0,
>>>>> "tier_promote": 0,
>>>>> "tier_flush": 0,
>>>>> "tier_flush_fail": 0,
>>>>> "tier_try_flush": 0,
>>>>> "tier_try_flush_fail": 0,
>>>>> "tier_evict": 0,
>>>>> "tier_whiteout": 1631,
>>>>> "tier_dirty": 22360,
>>>>> "tier_clean": 0,
>>>>> "tier_delay": 0,
>>>>> "tier_proxy_read": 0,
>>>>> "tier_proxy_write": 0,
>>>>> "agent_wake": 0,
>>>>> "agent_skip": 0,
>>>>> "agent_flush": 0,
>>>>> "agent_evict": 0,
>>>>> "object_ctx_cache_hit": 16311156,
>>>>> "object_ctx_cache_total": 17426393,
>>>>> "op_cache_hit": 0,
>>>>> "osd_tier_flush_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_tier_promote_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_tier_r_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_pg_info": 30483113,
>>>>> "osd_pg_fastinfo": 29619885,
>>>>> "osd_pg_biginfo": 81703
>>>>> },
>>>>> "recoverystate_perf": {
>>>>> "initial_latency": {
>>>>> "avgcount": 243,
>>>>> "sum": 6.869296500,
>>>>> "avgtime": 0.028268709
>>>>> },
>>>>> "started_latency": {
>>>>> "avgcount": 1125,
>>>>> "sum": 13551384.917335850,
>>>>> "avgtime": 12045.675482076
>>>>> },
>>>>> "reset_latency": {
>>>>> "avgcount": 1368,
>>>>> "sum": 1101.727799040,
>>>>> "avgtime": 0.805356578
>>>>> },
>>>>> "start_latency": {
>>>>> "avgcount": 1368,
>>>>> "sum": 0.002014799,
>>>>> "avgtime": 0.000001472
>>>>> },
>>>>> "primary_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 4575560.638823428,
>>>>> "avgtime": 9024.774435549
>>>>> },
>>>>> "peering_latency": {
>>>>> "avgcount": 550,
>>>>> "sum": 499.372283616,
>>>>> "avgtime": 0.907949606
>>>>> },
>>>>> "backfilling_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "waitremotebackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "waitlocalbackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "notbackfilling_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "repnotrecovering_latency": {
>>>>> "avgcount": 1009,
>>>>> "sum": 8975301.082274411,
>>>>> "avgtime": 8895.243887288
>>>>> },
>>>>> "repwaitrecoveryreserved_latency": {
>>>>> "avgcount": 420,
>>>>> "sum": 99.846056520,
>>>>> "avgtime": 0.237728706
>>>>> },
>>>>> "repwaitbackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "reprecovering_latency": {
>>>>> "avgcount": 420,
>>>>> "sum": 241.682764382,
>>>>> "avgtime": 0.575435153
>>>>> },
>>>>> "activating_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 16.893347339,
>>>>> "avgtime": 0.033320211
>>>>> },
>>>>> "waitlocalrecoveryreserved_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 672.335512769,
>>>>> "avgtime": 3.378570415
>>>>> },
>>>>> "waitremoterecoveryreserved_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 213.536439363,
>>>>> "avgtime": 1.073047433
>>>>> },
>>>>> "recovering_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 79.007696479,
>>>>> "avgtime": 0.397023600
>>>>> },
>>>>> "recovered_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 14.000732748,
>>>>> "avgtime": 0.027614857
>>>>> },
>>>>> "clean_latency": {
>>>>> "avgcount": 395,
>>>>> "sum": 4574325.900371083,
>>>>> "avgtime": 11580.571899673
>>>>> },
>>>>> "active_latency": {
>>>>> "avgcount": 425,
>>>>> "sum": 4575107.630123680,
>>>>> "avgtime": 10764.959129702
>>>>> },
>>>>> "replicaactive_latency": {
>>>>> "avgcount": 589,
>>>>> "sum": 8975184.499049954,
>>>>> "avgtime": 15238.004242869
>>>>> },
>>>>> "stray_latency": {
>>>>> "avgcount": 818,
>>>>> "sum": 800.729455666,
>>>>> "avgtime": 0.978886865
>>>>> },
>>>>> "getinfo_latency": {
>>>>> "avgcount": 550,
>>>>> "sum": 15.085667048,
>>>>> "avgtime": 0.027428485
>>>>> },
>>>>> "getlog_latency": {
>>>>> "avgcount": 546,
>>>>> "sum": 3.482175693,
>>>>> "avgtime": 0.006377611
>>>>> },
>>>>> "waitactingchange_latency": {
>>>>> "avgcount": 39,
>>>>> "sum": 35.444551284,
>>>>> "avgtime": 0.908834648
>>>>> },
>>>>> "incomplete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "down_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "getmissing_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 6.702129624,
>>>>> "avgtime": 0.013219190
>>>>> },
>>>>> "waitupthru_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 474.098261727,
>>>>> "avgtime": 0.935105052
>>>>> },
>>>>> "notrecovering_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "rocksdb": {
>>>>> "get": 28320977,
>>>>> "submit_transaction": 30484924,
>>>>> "submit_transaction_sync": 26371957,
>>>>> "get_latency": {
>>>>> "avgcount": 28320977,
>>>>> "sum": 325.900908733,
>>>>> "avgtime": 0.000011507
>>>>> },
>>>>> "submit_latency": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 1835.888692371,
>>>>> "avgtime": 0.000060222
>>>>> },
>>>>> "submit_sync_latency": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 1431.555230628,
>>>>> "avgtime": 0.000054283
>>>>> },
>>>>> "compact": 0,
>>>>> "compact_range": 0,
>>>>> "compact_queue_merge": 0,
>>>>> "compact_queue_len": 0,
>>>>> "rocksdb_write_wal_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_memtable_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_delay_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_pre_and_post_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time,
>>> until restart
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>>>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>>>>> (but I have bluestore_fragmentation_micros)
>>>>>> ok, this is the same
>>>>>>
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>>>>>> "How fragmented bluestore free space is (free extents / max
>>> possible number of free extents) * 1000");
>>>>>>
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and
>>> latency,
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
>>>>> it? The same for other OSDs?
>>>>>
>>>>> This proves some issue with the allocator - generally fragmentation
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals
>>>>> aren't properly merged in run-time.
>>>>>
>>>>> On the other side I'm not completely sure that latency degradation is
>>>>> caused by that - fragmentation growth is relatively small - I don't see
>>>>> how this might impact performance that high.
>>>>>
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command
>>>>> output on admin socket) reports? Do you have any historic data?
>>>>>
>>>>> If not may I have current output and say a couple more samples with
>>>>> 8-12 hours interval?
>>>>>
>>>>>
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such
>>> plans
>>>>> before that but I'll discuss this at BlueStore meeting shortly.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de>
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time,
>>> until restart
>>>>>> Thanks Igor,
>>>>>>
>>>>>>>> Could you please collect BlueStore performance counters right
>>> after OSD
>>>>>>>> startup and once you get high latency.
>>>>>>>>
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>>>> I'm already monitoring with
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all
>>> counters)
>>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>>>
>>>>>> (but I have bluestore_fragmentation_micros)
>>>>>>
>>>>>>
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>>>>> patch to track latency and some other internal allocator's
>>> paramter to
>>>>>>>> make sure it's degraded and learn more details.
>>>>>> Sorry, It's a critical production cluster, I can't test on it :(
>>>>>> But I have a test cluster, maybe I can try to put some load on it,
>>> and try to reproduce.
>>>>>>
>>>>>>
>>>>>>>> More vigorous fix would be to backport bitmap allocator from
>>> Nautilus
>>>>>>>> and try the difference...
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
>>>>>> perf results of new bitmap allocator seem very promising from what
>>> I've seen in PR.
>>>>>>
>>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe,
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users"
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time,
>>> until restart
>>>>>> Hi Alexandre,
>>>>>>
>>>>>> looks like a bug in StupidAllocator.
>>>>>>
>>>>>> Could you please collect BlueStore performance counters right after
>>> OSD
>>>>>> startup and once you get high latency.
>>>>>>
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>>>>
>>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>>> patch to track latency and some other internal allocator's paramter to
>>>>>> make sure it's degraded and learn more details.
>>>>>>
>>>>>>
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>>>> and try the difference...
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>>
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>>>>>>> Hi again,
>>>>>>>
>>>>>>> I speak too fast, the problem has occured again, so it's not
>>> tcmalloc cache size related.
>>>>>>>
>>>>>>> I have notice something using a simple "perf top",
>>>>>>>
>>>>>>> each time I have this problem (I have seen exactly 4 times the
>>> same behaviour),
>>>>>>> when latency is bad, perf top give me :
>>>>>>>
>>>>>>> StupidAllocator::_aligned_len
>>>>>>> and
>>>>>>>
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned
>>> long, unsigned long, std::less<unsigned long>, mempoo
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const,
>>> unsigned long>&, std::pair<unsigned long
>>>>>>> const, unsigned long>*>::increment_slow()
>>>>>>>
>>>>>>> (around 10-20% time for both)
>>>>>>>
>>>>>>>
>>>>>>> when latency is good, I don't see them at all.
>>>>>>>
>>>>>>>
>>>>>>> I have used the Mark wallclock profiler, here the results:
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>>>>>>
>>>>>>>
>>>>>>> here an extract of the thread with btree::btree_iterator &&
>>> StupidAllocator::_aligned_len
>>>>>>>
>>>>>>> + 100.00% clone
>>>>>>> + 100.00% start_thread
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
>>> ThreadPool::TPHandle&)
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>>>>>> | + 70.00%
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
>>> ThreadPool::TPHandle&)
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 68.00%
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 68.00%
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 67.00% non-virtual thunk to
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction,
>>> std::allocator<ObjectStore::Transaction> >&,
>>> boost::intrusive_ptr<OpRequest>)
>>>>>>> | | | + 67.00%
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
>>> std::vector<ObjectStore::Transaction,
>>> std::allocator<ObjectStore::Transaction> >&,
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>>>>>>> | | | + 66.00%
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>>> ObjectStore::Transaction*)
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*,
>>> boost::intrusive_ptr<BlueStore::Collection>&,
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long,
>>> ceph::buffer::list&, unsigned int)
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*,
>>> boost::intrusive_ptr<BlueStore::Collection>&,
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long,
>>> ceph::buffer::list&, unsigned int)
>>>>>>> | | | | + 65.00%
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*,
>>> boost::intrusive_ptr<BlueStore::Collection>,
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long,
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t,
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long,
>>> unsigned long, long, unsigned long*, unsigned int*)
>>>>>>> | | | | | | + 34.00%
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned
>>> long, unsigned long, std::less<unsigned long>,
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const,
>>> unsigned long>&, std::pair<unsigned long const, unsigned
>>> long>*>::increment_slow()
>>>>>>> | | | | | | + 26.00%
>>> StupidAllocator::_aligned_len(interval_set<unsigned long,
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>,
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned
>>> long const, unsigned long> >, 256> >::iterator, unsigned long)
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users"
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over
>>> time, until restart
>>>>>>> Hi,
>>>>>>>
>>>>>>> some news:
>>>>>>>
>>>>>>> I have tried with different transparent hugepage values (madvise,
>>> never) : no change
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>>>>>>
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait
>>> some more days to be sure)
>>>>>>>
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days)
>>> on my big nvme drives (6TB),
>>>>>>> my others clusters user 1,6TB ssd.
>>>>>>>
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if
>>> it's helping.
>>>>>>>
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with
>>> glibc >= 2.26 (which have also thread cache) ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Alexandre
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "aderumier" <aderumier@odiso.com>
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users"
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over
>>> time, until restart
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>>>>> op_r_latency but instead op_latency?
>>>>>>>>>
>>>>>>>>> Also why do you monitor op_w_process_latency? but not
>>> op_r_process_latency?
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot
>>> of graphs).
>>>>>>> I just don't see latency difference on reads. (or they are very
>>> very small vs the write latency increase)
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users"
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over
>>> time, until restart
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>>>>>>> Hi Stefan,
>>>>>>>>
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to
>>> tcmalloc
>>>>>>>>>> like suggested. This report makes me a little nervous about my
>>> change.
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug.
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare)
>>>>>>>> I need to compare with bigger latencies
>>>>>>>>
>>>>>>>> here an example, when all osd at 20-50ms before restart, then
>>> after restart (at 21:15), 1ms
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png
>>>>>>>>
>>>>>>>> I observe the latency in my guest vm too, on disks iowait.
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>>>>>>
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds.
>>> Which
>>>>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>>>>> here my influxdb queries:
>>>>>>>>
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>>>>>>
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"),
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph"
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter
>>> GROUP BY time($interval), "host", "id" fill(previous)
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"),
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id"
>>> fill(previous)
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"),
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id"
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id"
>>> fill(previous)
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>>> op_r_latency but instead op_latency?
>>>>>>>
>>>>>>> Also why do you monitor op_w_process_latency? but not
>>> op_r_process_latency?
>>>>>>> greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> ----- Mail original -----
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil"
>>> <sage@newdream.net>
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over
>>> time, until restart
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> here some new results,
>>>>>>>>> different osd/ different cluster
>>>>>>>>>
>>>>>>>>> before osd restart latency was between 2-5ms
>>>>>>>>> after osd restart is around 1-1.5ms
>>>>>>>>>
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>>>>>>
>>>>>>>>>  From what I see in diff, the biggest difference is in tcmalloc,
>>> but maybe I'm wrong.
>>>>>>>>> (I'm using tcmalloc 2.5-2.2)
>>>>>>>> currently i'm in the process of switching back from jemalloc to
>>> tcmalloc
>>>>>>>> like suggested. This report makes me a little nervous about my
>>> change.
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>>>>>
>>>>>>>> I would like to check if i see the same behaviour.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>> ----- Mail original -----
>>>>>>>>> De: "Sage Weil" <sage@newdream.net>
>>>>>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until
>>> restart
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU
>>> time is
>>>>>>>>> going on one of the OSDs wth a high latency?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>>>>>>
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or
>>> nvme drivers,
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd +
>>> snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>>>>>> When the osd are refreshly started, the commit latency is
>>> between 0,5-1ms.
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by
>>> day), until reaching crazy
>>>>>>>>>> values like 20-200ms.
>>>>>>>>>>
>>>>>>>>>> Some example graphs:
>>>>>>>>>>
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>>>>>>
>>>>>>>>>> All osds have this behaviour, in all clusters.
>>>>>>>>>>
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be
>>> full loaded)
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>>>>>>
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a
>>> bluestore memory bug ?
>>>>>>>>>> Any Hints for counters/logs to check ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Alexandre
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
>>>> Hi Igor,
>>>>
>>>> Thanks again for helping !
>>>>
>>>>
>>>>
>>>> I have upgrade to last mimic this weekend, and with new autotune memory,
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB)
>>>>
>>>>
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours,
>>>> here the reports for osd.0:
>>>>
>>>> http://odisoweb1.odiso.net/perfanalysis/
>>>>
>>>>
>>>> osd has been started the 12-02-2019 at 08:00
>>>>
>>>> first report after 1h running
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>>>>
>>>>
>>>>
>>>> report after 24 before counter resets
>>>>
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>>>>
>>>> report 1h after counter reset
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>>>>
>>>>
>>>>
>>>>
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
>>>> Then after that, slowly decreasing.
>>>>
>>>>
>>>> Another strange thing,
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G
>>>>
>>>>
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time.
>>>>
>>>> ----- Mail original -----
>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>> À: "Alexandre Derumier" <aderumier@odiso.com>
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 11 Février 2019 12:03:17
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>>>>> another mempool dump after 1h run. (latency ok)
>>>>>
>>>>> Biggest difference:
>>>>>
>>>>> before restart
>>>>> -------------
>>>>> "bluestore_cache_other": {
>>>>> "items": 48661920,
>>>>> "bytes": 1539544228
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 54,
>>>>> "bytes": 643072
>>>>> },
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory)
>>>>>
>>>>>
>>>>> After restart
>>>>> -------------
>>>>> "bluestore_cache_other": {
>>>>> "items": 12432298,
>>>>> "bytes": 500834899
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 40084,
>>>>> "bytes": 1056235520
>>>>> },
>>>>>
>>>> This is fine as cache is warming after restart and some rebalancing
>>>> between data and metadata might occur.
>>>>
>>>> What relates to allocator and most probably to fragmentation growth is :
>>>>
>>>> "bluestore_alloc": {
>>>> "items": 165053952,
>>>> "bytes": 165053952
>>>> },
>>>>
>>>> which had been higher before the reset (if I got these dumps' order
>>>> properly)
>>>>
>>>> "bluestore_alloc": {
>>>> "items": 210243456,
>>>> "bytes": 210243456
>>>> },
>>>>
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge
>>>> latency increase...
>>>>
>>>> Do you have perf counters dump after the restart?
>>>>
>>>> Could you collect some more dumps - for both mempool and perf counters?
>>>>
>>>> So ideally I'd like to have:
>>>>
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK)
>>>>
>>>> 2) mempool/perf counters dumps in 24+ hours after restart
>>>>
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD
>>>> restart) and dump mempool/perf counters again.
>>>>
>>>> So we'll be able to learn both allocator mem usage growth and operation
>>>> latency distribution for the following periods:
>>>>
>>>> a) 1st hour after restart
>>>>
>>>> b) 25th hour.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>> full mempool dump after restart
>>>>> -------------------------------
>>>>>
>>>>> {
>>>>> "mempool": {
>>>>> "by_pool": {
>>>>> "bloom_filter": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_alloc": {
>>>>> "items": 165053952,
>>>>> "bytes": 165053952
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 40084,
>>>>> "bytes": 1056235520
>>>>> },
>>>>> "bluestore_cache_onode": {
>>>>> "items": 22225,
>>>>> "bytes": 14935200
>>>>> },
>>>>> "bluestore_cache_other": {
>>>>> "items": 12432298,
>>>>> "bytes": 500834899
>>>>> },
>>>>> "bluestore_fsck": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_txc": {
>>>>> "items": 11,
>>>>> "bytes": 8184
>>>>> },
>>>>> "bluestore_writing_deferred": {
>>>>> "items": 5047,
>>>>> "bytes": 22673736
>>>>> },
>>>>> "bluestore_writing": {
>>>>> "items": 91,
>>>>> "bytes": 1662976
>>>>> },
>>>>> "bluefs": {
>>>>> "items": 1907,
>>>>> "bytes": 95600
>>>>> },
>>>>> "buffer_anon": {
>>>>> "items": 19664,
>>>>> "bytes": 25486050
>>>>> },
>>>>> "buffer_meta": {
>>>>> "items": 46189,
>>>>> "bytes": 2956096
>>>>> },
>>>>> "osd": {
>>>>> "items": 243,
>>>>> "bytes": 3089016
>>>>> },
>>>>> "osd_mapbl": {
>>>>> "items": 17,
>>>>> "bytes": 214366
>>>>> },
>>>>> "osd_pglog": {
>>>>> "items": 889673,
>>>>> "bytes": 367160400
>>>>> },
>>>>> "osdmap": {
>>>>> "items": 3803,
>>>>> "bytes": 224552
>>>>> },
>>>>> "osdmap_mapping": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "pgmap": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "mds_co": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_1": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_2": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> }
>>>>> },
>>>>> "total": {
>>>>> "items": 178515204,
>>>>> "bytes": 2160630547
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "aderumier" <aderumier@odiso.com>
>>>>> À: "Igor Fedotov" <ifedotov@suse.de>
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>
>>>>> I'm just seeing
>>>>>
>>>>> StupidAllocator::_aligned_len
>>>>> and
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>>>>
>>>>> on 1 osd, both 10%.
>>>>>
>>>>> here the dump_mempools
>>>>>
>>>>> {
>>>>> "mempool": {
>>>>> "by_pool": {
>>>>> "bloom_filter": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_alloc": {
>>>>> "items": 210243456,
>>>>> "bytes": 210243456
>>>>> },
>>>>> "bluestore_cache_data": {
>>>>> "items": 54,
>>>>> "bytes": 643072
>>>>> },
>>>>> "bluestore_cache_onode": {
>>>>> "items": 105637,
>>>>> "bytes": 70988064
>>>>> },
>>>>> "bluestore_cache_other": {
>>>>> "items": 48661920,
>>>>> "bytes": 1539544228
>>>>> },
>>>>> "bluestore_fsck": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "bluestore_txc": {
>>>>> "items": 12,
>>>>> "bytes": 8928
>>>>> },
>>>>> "bluestore_writing_deferred": {
>>>>> "items": 406,
>>>>> "bytes": 4792868
>>>>> },
>>>>> "bluestore_writing": {
>>>>> "items": 66,
>>>>> "bytes": 1085440
>>>>> },
>>>>> "bluefs": {
>>>>> "items": 1882,
>>>>> "bytes": 93600
>>>>> },
>>>>> "buffer_anon": {
>>>>> "items": 138986,
>>>>> "bytes": 24983701
>>>>> },
>>>>> "buffer_meta": {
>>>>> "items": 544,
>>>>> "bytes": 34816
>>>>> },
>>>>> "osd": {
>>>>> "items": 243,
>>>>> "bytes": 3089016
>>>>> },
>>>>> "osd_mapbl": {
>>>>> "items": 36,
>>>>> "bytes": 179308
>>>>> },
>>>>> "osd_pglog": {
>>>>> "items": 952564,
>>>>> "bytes": 372459684
>>>>> },
>>>>> "osdmap": {
>>>>> "items": 3639,
>>>>> "bytes": 224664
>>>>> },
>>>>> "osdmap_mapping": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "pgmap": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "mds_co": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_1": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> },
>>>>> "unittest_2": {
>>>>> "items": 0,
>>>>> "bytes": 0
>>>>> }
>>>>> },
>>>>> "total": {
>>>>> "items": 260109445,
>>>>> "bytes": 2228370845
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> and the perf dump
>>>>>
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump
>>>>> {
>>>>> "AsyncMessenger::Worker-0": {
>>>>> "msgr_recv_messages": 22948570,
>>>>> "msgr_send_messages": 22561570,
>>>>> "msgr_recv_bytes": 333085080271,
>>>>> "msgr_send_bytes": 261798871204,
>>>>> "msgr_created_connections": 6152,
>>>>> "msgr_active_connections": 2701,
>>>>> "msgr_running_total_time": 1055.197867330,
>>>>> "msgr_running_send_time": 352.764480121,
>>>>> "msgr_running_recv_time": 499.206831955,
>>>>> "msgr_running_fast_dispatch_time": 130.982201607
>>>>> },
>>>>> "AsyncMessenger::Worker-1": {
>>>>> "msgr_recv_messages": 18801593,
>>>>> "msgr_send_messages": 18430264,
>>>>> "msgr_recv_bytes": 306871760934,
>>>>> "msgr_send_bytes": 192789048666,
>>>>> "msgr_created_connections": 5773,
>>>>> "msgr_active_connections": 2721,
>>>>> "msgr_running_total_time": 816.821076305,
>>>>> "msgr_running_send_time": 261.353228926,
>>>>> "msgr_running_recv_time": 394.035587911,
>>>>> "msgr_running_fast_dispatch_time": 104.012155720
>>>>> },
>>>>> "AsyncMessenger::Worker-2": {
>>>>> "msgr_recv_messages": 18463400,
>>>>> "msgr_send_messages": 18105856,
>>>>> "msgr_recv_bytes": 187425453590,
>>>>> "msgr_send_bytes": 220735102555,
>>>>> "msgr_created_connections": 5897,
>>>>> "msgr_active_connections": 2605,
>>>>> "msgr_running_total_time": 807.186854324,
>>>>> "msgr_running_send_time": 296.834435839,
>>>>> "msgr_running_recv_time": 351.364389691,
>>>>> "msgr_running_fast_dispatch_time": 101.215776792
>>>>> },
>>>>> "bluefs": {
>>>>> "gift_bytes": 0,
>>>>> "reclaim_bytes": 0,
>>>>> "db_total_bytes": 256050724864,
>>>>> "db_used_bytes": 12413042688,
>>>>> "wal_total_bytes": 0,
>>>>> "wal_used_bytes": 0,
>>>>> "slow_total_bytes": 0,
>>>>> "slow_used_bytes": 0,
>>>>> "num_files": 209,
>>>>> "log_bytes": 10383360,
>>>>> "log_compactions": 14,
>>>>> "logged_bytes": 336498688,
>>>>> "files_written_wal": 2,
>>>>> "files_written_sst": 4499,
>>>>> "bytes_written_wal": 417989099783,
>>>>> "bytes_written_sst": 213188750209
>>>>> },
>>>>> "bluestore": {
>>>>> "kv_flush_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 26.734038497,
>>>>> "avgtime": 0.000001013
>>>>> },
>>>>> "kv_commit_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 3397.491150603,
>>>>> "avgtime": 0.000128829
>>>>> },
>>>>> "kv_lat": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 3424.225189100,
>>>>> "avgtime": 0.000129843
>>>>> },
>>>>> "state_prepare_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3689.542105337,
>>>>> "avgtime": 0.000121028
>>>>> },
>>>>> "state_aio_wait_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 509.864546111,
>>>>> "avgtime": 0.000016725
>>>>> },
>>>>> "state_io_done_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 24.534052953,
>>>>> "avgtime": 0.000000804
>>>>> },
>>>>> "state_kv_queued_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3488.338424238,
>>>>> "avgtime": 0.000114428
>>>>> },
>>>>> "state_kv_commiting_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 5660.437003432,
>>>>> "avgtime": 0.000185679
>>>>> },
>>>>> "state_kv_done_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 7.763511500,
>>>>> "avgtime": 0.000000254
>>>>> },
>>>>> "state_deferred_queued_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 666071.296856696,
>>>>> "avgtime": 0.025281557
>>>>> },
>>>>> "state_deferred_aio_wait_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 1755.660547071,
>>>>> "avgtime": 0.000066638
>>>>> },
>>>>> "state_deferred_cleanup_lat": {
>>>>> "avgcount": 26346134,
>>>>> "sum": 185465.151653703,
>>>>> "avgtime": 0.007039558
>>>>> },
>>>>> "state_finishing_lat": {
>>>>> "avgcount": 30484920,
>>>>> "sum": 3.046847481,
>>>>> "avgtime": 0.000000099
>>>>> },
>>>>> "state_done_lat": {
>>>>> "avgcount": 30484920,
>>>>> "sum": 13193.362685280,
>>>>> "avgtime": 0.000432783
>>>>> },
>>>>> "throttle_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 14.634269979,
>>>>> "avgtime": 0.000000480
>>>>> },
>>>>> "submit_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 3873.883076148,
>>>>> "avgtime": 0.000127075
>>>>> },
>>>>> "commit_lat": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 13376.492317331,
>>>>> "avgtime": 0.000438790
>>>>> },
>>>>> "read_lat": {
>>>>> "avgcount": 5873923,
>>>>> "sum": 1817.167582057,
>>>>> "avgtime": 0.000309361
>>>>> },
>>>>> "read_onode_meta_lat": {
>>>>> "avgcount": 19608201,
>>>>> "sum": 146.770464482,
>>>>> "avgtime": 0.000007485
>>>>> },
>>>>> "read_wait_aio_lat": {
>>>>> "avgcount": 13734278,
>>>>> "sum": 2532.578077242,
>>>>> "avgtime": 0.000184398
>>>>> },
>>>>> "compress_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "decompress_lat": {
>>>>> "avgcount": 1346945,
>>>>> "sum": 26.227575896,
>>>>> "avgtime": 0.000019471
>>>>> },
>>>>> "csum_lat": {
>>>>> "avgcount": 28020392,
>>>>> "sum": 149.587819041,
>>>>> "avgtime": 0.000005338
>>>>> },
>>>>> "compress_success_count": 0,
>>>>> "compress_rejected_count": 0,
>>>>> "write_pad_bytes": 352923605,
>>>>> "deferred_write_ops": 24373340,
>>>>> "deferred_write_bytes": 216791842816,
>>>>> "write_penalty_read_ops": 8062366,
>>>>> "bluestore_allocated": 3765566013440,
>>>>> "bluestore_stored": 4186255221852,
>>>>> "bluestore_compressed": 39981379040,
>>>>> "bluestore_compressed_allocated": 73748348928,
>>>>> "bluestore_compressed_original": 165041381376,
>>>>> "bluestore_onodes": 104232,
>>>>> "bluestore_onode_hits": 71206874,
>>>>> "bluestore_onode_misses": 1217914,
>>>>> "bluestore_onode_shard_hits": 260183292,
>>>>> "bluestore_onode_shard_misses": 22851573,
>>>>> "bluestore_extents": 3394513,
>>>>> "bluestore_blobs": 2773587,
>>>>> "bluestore_buffers": 0,
>>>>> "bluestore_buffer_bytes": 0,
>>>>> "bluestore_buffer_hit_bytes": 62026011221,
>>>>> "bluestore_buffer_miss_bytes": 995233669922,
>>>>> "bluestore_write_big": 5648815,
>>>>> "bluestore_write_big_bytes": 552502214656,
>>>>> "bluestore_write_big_blobs": 12440992,
>>>>> "bluestore_write_small": 35883770,
>>>>> "bluestore_write_small_bytes": 223436965719,
>>>>> "bluestore_write_small_unused": 408125,
>>>>> "bluestore_write_small_deferred": 34961455,
>>>>> "bluestore_write_small_pre_read": 34961455,
>>>>> "bluestore_write_small_new": 514190,
>>>>> "bluestore_txc": 30484924,
>>>>> "bluestore_onode_reshard": 5144189,
>>>>> "bluestore_blob_split": 60104,
>>>>> "bluestore_extent_compress": 53347252,
>>>>> "bluestore_gc_merged": 21142528,
>>>>> "bluestore_read_eio": 0,
>>>>> "bluestore_fragmentation_micros": 67
>>>>> },
>>>>> "finisher-defered_finisher": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "finisher-finisher-0": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 26625163,
>>>>> "sum": 1057.506990951,
>>>>> "avgtime": 0.000039718
>>>>> }
>>>>> },
>>>>> "finisher-objecter-finisher-0": {
>>>>> "queue_len": 0,
>>>>> "complete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.0::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.0::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.1::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.1::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.2::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.2::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.3::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.3::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.4::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.4::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.5::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.5::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.6::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.6::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.7::sdata_wait_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "mutex-OSDShard.7::shard_lock": {
>>>>> "wait": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "objecter": {
>>>>> "op_active": 0,
>>>>> "op_laggy": 0,
>>>>> "op_send": 0,
>>>>> "op_send_bytes": 0,
>>>>> "op_resend": 0,
>>>>> "op_reply": 0,
>>>>> "op": 0,
>>>>> "op_r": 0,
>>>>> "op_w": 0,
>>>>> "op_rmw": 0,
>>>>> "op_pg": 0,
>>>>> "osdop_stat": 0,
>>>>> "osdop_create": 0,
>>>>> "osdop_read": 0,
>>>>> "osdop_write": 0,
>>>>> "osdop_writefull": 0,
>>>>> "osdop_writesame": 0,
>>>>> "osdop_append": 0,
>>>>> "osdop_zero": 0,
>>>>> "osdop_truncate": 0,
>>>>> "osdop_delete": 0,
>>>>> "osdop_mapext": 0,
>>>>> "osdop_sparse_read": 0,
>>>>> "osdop_clonerange": 0,
>>>>> "osdop_getxattr": 0,
>>>>> "osdop_setxattr": 0,
>>>>> "osdop_cmpxattr": 0,
>>>>> "osdop_rmxattr": 0,
>>>>> "osdop_resetxattrs": 0,
>>>>> "osdop_tmap_up": 0,
>>>>> "osdop_tmap_put": 0,
>>>>> "osdop_tmap_get": 0,
>>>>> "osdop_call": 0,
>>>>> "osdop_watch": 0,
>>>>> "osdop_notify": 0,
>>>>> "osdop_src_cmpxattr": 0,
>>>>> "osdop_pgls": 0,
>>>>> "osdop_pgls_filter": 0,
>>>>> "osdop_other": 0,
>>>>> "linger_active": 0,
>>>>> "linger_send": 0,
>>>>> "linger_resend": 0,
>>>>> "linger_ping": 0,
>>>>> "poolop_active": 0,
>>>>> "poolop_send": 0,
>>>>> "poolop_resend": 0,
>>>>> "poolstat_active": 0,
>>>>> "poolstat_send": 0,
>>>>> "poolstat_resend": 0,
>>>>> "statfs_active": 0,
>>>>> "statfs_send": 0,
>>>>> "statfs_resend": 0,
>>>>> "command_active": 0,
>>>>> "command_send": 0,
>>>>> "command_resend": 0,
>>>>> "map_epoch": 105913,
>>>>> "map_full": 0,
>>>>> "map_inc": 828,
>>>>> "osd_sessions": 0,
>>>>> "osd_session_open": 0,
>>>>> "osd_session_close": 0,
>>>>> "osd_laggy": 0,
>>>>> "omap_wr": 0,
>>>>> "omap_rd": 0,
>>>>> "omap_del": 0
>>>>> },
>>>>> "osd": {
>>>>> "op_wip": 0,
>>>>> "op": 16758102,
>>>>> "op_in_bytes": 238398820586,
>>>>> "op_out_bytes": 165484999463,
>>>>> "op_latency": {
>>>>> "avgcount": 16758102,
>>>>> "sum": 38242.481640842,
>>>>> "avgtime": 0.002282029
>>>>> },
>>>>> "op_process_latency": {
>>>>> "avgcount": 16758102,
>>>>> "sum": 28644.906310687,
>>>>> "avgtime": 0.001709316
>>>>> },
>>>>> "op_prepare_latency": {
>>>>> "avgcount": 16761367,
>>>>> "sum": 3489.856599934,
>>>>> "avgtime": 0.000208208
>>>>> },
>>>>> "op_r": 6188565,
>>>>> "op_r_out_bytes": 165484999463,
>>>>> "op_r_latency": {
>>>>> "avgcount": 6188565,
>>>>> "sum": 4507.365756792,
>>>>> "avgtime": 0.000728337
>>>>> },
>>>>> "op_r_process_latency": {
>>>>> "avgcount": 6188565,
>>>>> "sum": 942.363063429,
>>>>> "avgtime": 0.000152274
>>>>> },
>>>>> "op_r_prepare_latency": {
>>>>> "avgcount": 6188644,
>>>>> "sum": 982.866710389,
>>>>> "avgtime": 0.000158817
>>>>> },
>>>>> "op_w": 10546037,
>>>>> "op_w_in_bytes": 238334329494,
>>>>> "op_w_latency": {
>>>>> "avgcount": 10546037,
>>>>> "sum": 33160.719998316,
>>>>> "avgtime": 0.003144377
>>>>> },
>>>>> "op_w_process_latency": {
>>>>> "avgcount": 10546037,
>>>>> "sum": 27668.702029030,
>>>>> "avgtime": 0.002623611
>>>>> },
>>>>> "op_w_prepare_latency": {
>>>>> "avgcount": 10548652,
>>>>> "sum": 2499.688609173,
>>>>> "avgtime": 0.000236967
>>>>> },
>>>>> "op_rw": 23500,
>>>>> "op_rw_in_bytes": 64491092,
>>>>> "op_rw_out_bytes": 0,
>>>>> "op_rw_latency": {
>>>>> "avgcount": 23500,
>>>>> "sum": 574.395885734,
>>>>> "avgtime": 0.024442378
>>>>> },
>>>>> "op_rw_process_latency": {
>>>>> "avgcount": 23500,
>>>>> "sum": 33.841218228,
>>>>> "avgtime": 0.001440051
>>>>> },
>>>>> "op_rw_prepare_latency": {
>>>>> "avgcount": 24071,
>>>>> "sum": 7.301280372,
>>>>> "avgtime": 0.000303322
>>>>> },
>>>>> "op_before_queue_op_lat": {
>>>>> "avgcount": 57892986,
>>>>> "sum": 1502.117718889,
>>>>> "avgtime": 0.000025946
>>>>> },
>>>>> "op_before_dequeue_op_lat": {
>>>>> "avgcount": 58091683,
>>>>> "sum": 45194.453254037,
>>>>> "avgtime": 0.000777984
>>>>> },
>>>>> "subop": 19784758,
>>>>> "subop_in_bytes": 547174969754,
>>>>> "subop_latency": {
>>>>> "avgcount": 19784758,
>>>>> "sum": 13019.714424060,
>>>>> "avgtime": 0.000658067
>>>>> },
>>>>> "subop_w": 19784758,
>>>>> "subop_w_in_bytes": 547174969754,
>>>>> "subop_w_latency": {
>>>>> "avgcount": 19784758,
>>>>> "sum": 13019.714424060,
>>>>> "avgtime": 0.000658067
>>>>> },
>>>>> "subop_pull": 0,
>>>>> "subop_pull_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "subop_push": 0,
>>>>> "subop_push_in_bytes": 0,
>>>>> "subop_push_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "pull": 0,
>>>>> "push": 2003,
>>>>> "push_out_bytes": 5560009728,
>>>>> "recovery_ops": 1940,
>>>>> "loadavg": 118,
>>>>> "buffer_bytes": 0,
>>>>> "history_alloc_Mbytes": 0,
>>>>> "history_alloc_num": 0,
>>>>> "cached_crc": 0,
>>>>> "cached_crc_adjusted": 0,
>>>>> "missed_crc": 0,
>>>>> "numpg": 243,
>>>>> "numpg_primary": 82,
>>>>> "numpg_replica": 161,
>>>>> "numpg_stray": 0,
>>>>> "numpg_removing": 0,
>>>>> "heartbeat_to_peers": 10,
>>>>> "map_messages": 7013,
>>>>> "map_message_epochs": 7143,
>>>>> "map_message_epoch_dups": 6315,
>>>>> "messages_delayed_for_map": 0,
>>>>> "osd_map_cache_hit": 203309,
>>>>> "osd_map_cache_miss": 33,
>>>>> "osd_map_cache_miss_low": 0,
>>>>> "osd_map_cache_miss_low_avg": {
>>>>> "avgcount": 0,
>>>>> "sum": 0
>>>>> },
>>>>> "osd_map_bl_cache_hit": 47012,
>>>>> "osd_map_bl_cache_miss": 1681,
>>>>> "stat_bytes": 6401248198656,
>>>>> "stat_bytes_used": 3777979072512,
>>>>> "stat_bytes_avail": 2623269126144,
>>>>> "copyfrom": 0,
>>>>> "tier_promote": 0,
>>>>> "tier_flush": 0,
>>>>> "tier_flush_fail": 0,
>>>>> "tier_try_flush": 0,
>>>>> "tier_try_flush_fail": 0,
>>>>> "tier_evict": 0,
>>>>> "tier_whiteout": 1631,
>>>>> "tier_dirty": 22360,
>>>>> "tier_clean": 0,
>>>>> "tier_delay": 0,
>>>>> "tier_proxy_read": 0,
>>>>> "tier_proxy_write": 0,
>>>>> "agent_wake": 0,
>>>>> "agent_skip": 0,
>>>>> "agent_flush": 0,
>>>>> "agent_evict": 0,
>>>>> "object_ctx_cache_hit": 16311156,
>>>>> "object_ctx_cache_total": 17426393,
>>>>> "op_cache_hit": 0,
>>>>> "osd_tier_flush_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_tier_promote_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_tier_r_lat": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "osd_pg_info": 30483113,
>>>>> "osd_pg_fastinfo": 29619885,
>>>>> "osd_pg_biginfo": 81703
>>>>> },
>>>>> "recoverystate_perf": {
>>>>> "initial_latency": {
>>>>> "avgcount": 243,
>>>>> "sum": 6.869296500,
>>>>> "avgtime": 0.028268709
>>>>> },
>>>>> "started_latency": {
>>>>> "avgcount": 1125,
>>>>> "sum": 13551384.917335850,
>>>>> "avgtime": 12045.675482076
>>>>> },
>>>>> "reset_latency": {
>>>>> "avgcount": 1368,
>>>>> "sum": 1101.727799040,
>>>>> "avgtime": 0.805356578
>>>>> },
>>>>> "start_latency": {
>>>>> "avgcount": 1368,
>>>>> "sum": 0.002014799,
>>>>> "avgtime": 0.000001472
>>>>> },
>>>>> "primary_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 4575560.638823428,
>>>>> "avgtime": 9024.774435549
>>>>> },
>>>>> "peering_latency": {
>>>>> "avgcount": 550,
>>>>> "sum": 499.372283616,
>>>>> "avgtime": 0.907949606
>>>>> },
>>>>> "backfilling_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "waitremotebackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "waitlocalbackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "notbackfilling_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "repnotrecovering_latency": {
>>>>> "avgcount": 1009,
>>>>> "sum": 8975301.082274411,
>>>>> "avgtime": 8895.243887288
>>>>> },
>>>>> "repwaitrecoveryreserved_latency": {
>>>>> "avgcount": 420,
>>>>> "sum": 99.846056520,
>>>>> "avgtime": 0.237728706
>>>>> },
>>>>> "repwaitbackfillreserved_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "reprecovering_latency": {
>>>>> "avgcount": 420,
>>>>> "sum": 241.682764382,
>>>>> "avgtime": 0.575435153
>>>>> },
>>>>> "activating_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 16.893347339,
>>>>> "avgtime": 0.033320211
>>>>> },
>>>>> "waitlocalrecoveryreserved_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 672.335512769,
>>>>> "avgtime": 3.378570415
>>>>> },
>>>>> "waitremoterecoveryreserved_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 213.536439363,
>>>>> "avgtime": 1.073047433
>>>>> },
>>>>> "recovering_latency": {
>>>>> "avgcount": 199,
>>>>> "sum": 79.007696479,
>>>>> "avgtime": 0.397023600
>>>>> },
>>>>> "recovered_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 14.000732748,
>>>>> "avgtime": 0.027614857
>>>>> },
>>>>> "clean_latency": {
>>>>> "avgcount": 395,
>>>>> "sum": 4574325.900371083,
>>>>> "avgtime": 11580.571899673
>>>>> },
>>>>> "active_latency": {
>>>>> "avgcount": 425,
>>>>> "sum": 4575107.630123680,
>>>>> "avgtime": 10764.959129702
>>>>> },
>>>>> "replicaactive_latency": {
>>>>> "avgcount": 589,
>>>>> "sum": 8975184.499049954,
>>>>> "avgtime": 15238.004242869
>>>>> },
>>>>> "stray_latency": {
>>>>> "avgcount": 818,
>>>>> "sum": 800.729455666,
>>>>> "avgtime": 0.978886865
>>>>> },
>>>>> "getinfo_latency": {
>>>>> "avgcount": 550,
>>>>> "sum": 15.085667048,
>>>>> "avgtime": 0.027428485
>>>>> },
>>>>> "getlog_latency": {
>>>>> "avgcount": 546,
>>>>> "sum": 3.482175693,
>>>>> "avgtime": 0.006377611
>>>>> },
>>>>> "waitactingchange_latency": {
>>>>> "avgcount": 39,
>>>>> "sum": 35.444551284,
>>>>> "avgtime": 0.908834648
>>>>> },
>>>>> "incomplete_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "down_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "getmissing_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 6.702129624,
>>>>> "avgtime": 0.013219190
>>>>> },
>>>>> "waitupthru_latency": {
>>>>> "avgcount": 507,
>>>>> "sum": 474.098261727,
>>>>> "avgtime": 0.935105052
>>>>> },
>>>>> "notrecovering_latency": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> },
>>>>> "rocksdb": {
>>>>> "get": 28320977,
>>>>> "submit_transaction": 30484924,
>>>>> "submit_transaction_sync": 26371957,
>>>>> "get_latency": {
>>>>> "avgcount": 28320977,
>>>>> "sum": 325.900908733,
>>>>> "avgtime": 0.000011507
>>>>> },
>>>>> "submit_latency": {
>>>>> "avgcount": 30484924,
>>>>> "sum": 1835.888692371,
>>>>> "avgtime": 0.000060222
>>>>> },
>>>>> "submit_sync_latency": {
>>>>> "avgcount": 26371957,
>>>>> "sum": 1431.555230628,
>>>>> "avgtime": 0.000054283
>>>>> },
>>>>> "compact": 0,
>>>>> "compact_range": 0,
>>>>> "compact_queue_merge": 0,
>>>>> "compact_queue_len": 0,
>>>>> "rocksdb_write_wal_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_memtable_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_delay_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> },
>>>>> "rocksdb_write_pre_and_post_time": {
>>>>> "avgcount": 0,
>>>>> "sum": 0.000000000,
>>>>> "avgtime": 0.000000000
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:
>>>>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>>>>> (but I have bluestore_fragmentation_micros)
>>>>>> ok, this is the same
>>>>>>
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000");
>>>>>>
>>>>>>
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency,
>>>>>>
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
>>>>> it? The same for other OSDs?
>>>>>
>>>>> This proves some issue with the allocator - generally fragmentation
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals
>>>>> aren't properly merged in run-time.
>>>>>
>>>>> On the other side I'm not completely sure that latency degradation is
>>>>> caused by that - fragmentation growth is relatively small - I don't see
>>>>> how this might impact performance that high.
>>>>>
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command
>>>>> output on admin socket) reports? Do you have any historic data?
>>>>>
>>>>> If not may I have current output and say a couple more samples with
>>>>> 8-12 hours interval?
>>>>>
>>>>>
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans
>>>>> before that but I'll discuss this at BlueStore meeting shortly.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de>
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>
>>>>>> Thanks Igor,
>>>>>>
>>>>>>>> Could you please collect BlueStore performance counters right after OSD
>>>>>>>> startup and once you get high latency.
>>>>>>>>
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>>>> I'm already monitoring with
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters)
>>>>>>
>>>>>> but I don't see l_bluestore_fragmentation counter.
>>>>>>
>>>>>> (but I have bluestore_fragmentation_micros)
>>>>>>
>>>>>>
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>>>>> patch to track latency and some other internal allocator's paramter to
>>>>>>>> make sure it's degraded and learn more details.
>>>>>> Sorry, It's a critical production cluster, I can't test on it :(
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>>>>>> and try the difference...
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus)
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Mail original -----
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de>
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>
>>>>>> Hi Alexandre,
>>>>>>
>>>>>> looks like a bug in StupidAllocator.
>>>>>>
>>>>>> Could you please collect BlueStore performance counters right after OSD
>>>>>> startup and once you get high latency.
>>>>>>
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest.
>>>>>>
>>>>>> Also if you're able to rebuild the code I can probably make a simple
>>>>>> patch to track latency and some other internal allocator's paramter to
>>>>>> make sure it's degraded and learn more details.
>>>>>>
>>>>>>
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus
>>>>>> and try the difference...
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>>
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:
>>>>>>> Hi again,
>>>>>>>
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related.
>>>>>>>
>>>>>>>
>>>>>>> I have notice something using a simple "perf top",
>>>>>>>
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour),
>>>>>>>
>>>>>>> when latency is bad, perf top give me :
>>>>>>>
>>>>>>> StupidAllocator::_aligned_len
>>>>>>> and
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long
>>>>>>> const, unsigned long>*>::increment_slow()
>>>>>>>
>>>>>>> (around 10-20% time for both)
>>>>>>>
>>>>>>>
>>>>>>> when latency is good, I don't see them at all.
>>>>>>>
>>>>>>>
>>>>>>> I have used the Mark wallclock profiler, here the results:
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt
>>>>>>>
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt
>>>>>>>
>>>>>>>
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len
>>>>>>>
>>>>>>>
>>>>>>> + 100.00% clone
>>>>>>> + 100.00% start_thread
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*)
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow()
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com>
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> some news:
>>>>>>>
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change
>>>>>>>
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change
>>>>>>>
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure)
>>>>>>>
>>>>>>>
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB),
>>>>>>> my others clusters user 1,6TB ssd.
>>>>>>>
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping.
>>>>>>>
>>>>>>>
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Alexandre
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "aderumier" <aderumier@odiso.com>
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>>
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>>>>> op_r_latency but instead op_latency?
>>>>>>>>>
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).
>>>>>>>
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER:
>>>>>>>> Hi Stefan,
>>>>>>>>
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>>>>>>> like suggested. This report makes me a little nervous about my change.
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug.
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare)
>>>>>>>> I need to compare with bigger latencies
>>>>>>>>
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png
>>>>>>>>
>>>>>>>> I observe the latency in my guest vm too, on disks iowait.
>>>>>>>>
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png
>>>>>>>>
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>>>>> here my influxdb queries:
>>>>>>>>
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second.
>>>>>>>>
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>>>>
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>>>>
>>>>>>>>
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous)
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not
>>>>>>> op_r_latency but instead op_latency?
>>>>>>>
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency?
>>>>>>>
>>>>>>> greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> ----- Mail original -----
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net>
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> here some new results,
>>>>>>>>> different osd/ different cluster
>>>>>>>>>
>>>>>>>>> before osd restart latency was between 2-5ms
>>>>>>>>> after osd restart is around 1-1.5ms
>>>>>>>>>
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt
>>>>>>>>>
>>>>>>>>>  From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
>>>>>>>>> (I'm using tcmalloc 2.5-2.2)
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc
>>>>>>>> like suggested. This report makes me a little nervous about my change.
>>>>>>>>
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which
>>>>>>>> exact values out of the daemon do you use for bluestore?
>>>>>>>>
>>>>>>>> I would like to check if i see the same behaviour.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>> ----- Mail original -----
>>>>>>>>> De: "Sage Weil" <sage@newdream.net>
>>>>>>>>> À: "aderumier" <aderumier@odiso.com>
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart
>>>>>>>>>
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is
>>>>>>>>> going on one of the OSDs wth a high latency?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters,
>>>>>>>>>>
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup
>>>>>>>>>>
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms.
>>>>>>>>>>
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy
>>>>>>>>>> values like 20-200ms.
>>>>>>>>>>
>>>>>>>>>> Some example graphs:
>>>>>>>>>>
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png
>>>>>>>>>>
>>>>>>>>>> All osds have this behaviour, in all clusters.
>>>>>>>>>>
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded)
>>>>>>>>>>
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms.
>>>>>>>>>>
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ?
>>>>>>>>>>
>>>>>>>>>> Any Hints for counters/logs to check ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Alexandre
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <76764043-4d0d-bb46-2e2e-0b4261963a98-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                                 ` <76764043-4d0d-bb46-2e2e-0b4261963a98-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-19 16:03                                                                                                   ` Alexandre DERUMIER
       [not found]                                                                                                     ` <121987882.59219.1550592238495.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-19 16:03 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>>
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency.

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup:
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory).
- disabling transparent hugepage

Since 24h, latencies are still low (between 0.7-1.2ms).

I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB).

I'll send more stats tomorrow.

Alexandre


----- Mail original -----
De: "Igor Fedotov" <ifedotov@suse.de>
À: "Alexandre Derumier" <aderumier@odiso.com>, "Wido den Hollander" <wido@42on.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 19 Février 2019 11:12:43
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> ----- Mail original ----- 
>> De: "Wido den Hollander" <wido@42on.com> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs. 
>>> 0.003073475 (last hour) 
>>> 
>>> and OSD read ops, osd_r_latency: 
>>> 
>>> 0.000408595 (first hour) 
>>> 0.000709031 (24 hours) 
>>> 0.004979540 (last hour) 
>>> 
>>> What's interesting is that such latency differences aren't observed at 
>>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>>> rocksdb one. 
>>> 
>>> Which probably means that the issue is rather somewhere above BlueStore. 
>>> 
>>> Suggest to proceed with perf dumps collection to see if the picture 
>>> stays the same. 
>>> 
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>>> decrease in RSS report is a known artifact that seems to be safe. 
>>> 
>>> Thanks, 
>>> Igor 
>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to 
>>> see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>>> around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>>> still at 8G 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>>> track them over time. 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>>> take all the memory) 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max 
>>> possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>>> latency, 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>>> plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right 
>>> after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>>> counters) 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's 
>>> paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>>> and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>>> Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what 
>>> I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after 
>>> OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not 
>>> tcmalloc cache size related. 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the 
>>> same behaviour), 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && 
>>> StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>>> ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% 
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to 
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% 
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>>> std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% 
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>, 
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long const, unsigned 
>>> long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% 
>>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, 
>>> never) : no change 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>>> some more days to be sure) 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>>> on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>>> it's helping. 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>>> of graphs). 
>>>>>>> I just don't see latency difference on reads. (or they are very 
>>> very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>>> after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>>> Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>>> <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>>> but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>>> restart 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>>> time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>>> nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> When the osd are refreshly started, the commit latency is 
>>> between 0,5-1ms. 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>>> day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>>> full loaded) 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>>> bluestore memory bug ? 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>> 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>> 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>> 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>> 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>> 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>> 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>> 
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>> 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> 
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>> 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>> 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <121987882.59219.1550592238495.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                                     ` <121987882.59219.1550592238495.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-20 10:39                                                                                                       ` Alexandre DERUMIER
       [not found]                                                                                                         ` <190289279.94469.1550659174801.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-20 10:39 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

Hi,

I have hit the bug again, but this time only on 1 osd

here some graphs:
http://odisoweb1.odiso.net/osd8.png

latency was good until 01:00

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be normal),
after that latency is slowing increasing from 1ms to 3-5ms

after osd restart, I'm between 0.7-1ms


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 19 Février 2019 17:03:58
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "Alexandre Derumier" <aderumier@odiso.com>, "Wido den Hollander" <wido@42on.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> ----- Mail original ----- 
>> De: "Wido den Hollander" <wido@42on.com> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs. 
>>> 0.003073475 (last hour) 
>>> 
>>> and OSD read ops, osd_r_latency: 
>>> 
>>> 0.000408595 (first hour) 
>>> 0.000709031 (24 hours) 
>>> 0.004979540 (last hour) 
>>> 
>>> What's interesting is that such latency differences aren't observed at 
>>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>>> rocksdb one. 
>>> 
>>> Which probably means that the issue is rather somewhere above BlueStore. 
>>> 
>>> Suggest to proceed with perf dumps collection to see if the picture 
>>> stays the same. 
>>> 
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>>> decrease in RSS report is a known artifact that seems to be safe. 
>>> 
>>> Thanks, 
>>> Igor 
>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to 
>>> see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>>> around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>>> still at 8G 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>>> track them over time. 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>>> take all the memory) 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max 
>>> possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>>> latency, 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>>> plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right 
>>> after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>>> counters) 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's 
>>> paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>>> and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>>> Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what 
>>> I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after 
>>> OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not 
>>> tcmalloc cache size related. 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the 
>>> same behaviour), 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && 
>>> StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>>> ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% 
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to 
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% 
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>>> std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% 
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>, 
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long const, unsigned 
>>> long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% 
>>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, 
>>> never) : no change 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>>> some more days to be sure) 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>>> on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>>> it's helping. 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>>> of graphs). 
>>>>>>> I just don't see latency difference on reads. (or they are very 
>>> very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>>> after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>>> Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>>> <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>>> but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>>> restart 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>>> time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>>> nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> When the osd are refreshly started, the commit latency is 
>>> between 0,5-1ms. 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>>> day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>>> full loaded) 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>>> bluestore memory bug ? 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>> 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>> 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>> 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>> 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>> 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>> 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>> 
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>> 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> 
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>> 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>> 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <190289279.94469.1550659174801.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                                         ` <190289279.94469.1550659174801.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-20 11:09                                                                                                           ` Alexandre DERUMIER
       [not found]                                                                                                             ` <1938718399.96269.1550660948828.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-20 11:09 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

Something interesting,

when I have restarted osd.8 at 11:20,

I'm seeing another osd.1 where latency is decreasing exactly at the same time. (without restart of this osd).

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 20 Février 2019 11:39:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "Alexandre Derumier" <aderumier@odiso.com>, "Wido den Hollander" <wido@42on.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> ----- Mail original ----- 
>> De: "Wido den Hollander" <wido@42on.com> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs. 
>>> 0.003073475 (last hour) 
>>> 
>>> and OSD read ops, osd_r_latency: 
>>> 
>>> 0.000408595 (first hour) 
>>> 0.000709031 (24 hours) 
>>> 0.004979540 (last hour) 
>>> 
>>> What's interesting is that such latency differences aren't observed at 
>>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>>> rocksdb one. 
>>> 
>>> Which probably means that the issue is rather somewhere above BlueStore. 
>>> 
>>> Suggest to proceed with perf dumps collection to see if the picture 
>>> stays the same. 
>>> 
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>>> decrease in RSS report is a known artifact that seems to be safe. 
>>> 
>>> Thanks, 
>>> Igor 
>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to 
>>> see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>>> around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>>> still at 8G 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>>> track them over time. 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>>> take all the memory) 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max 
>>> possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>>> latency, 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>>> plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right 
>>> after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>>> counters) 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's 
>>> paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>>> and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>>> Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what 
>>> I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after 
>>> OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not 
>>> tcmalloc cache size related. 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the 
>>> same behaviour), 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && 
>>> StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>>> ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% 
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to 
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% 
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>>> std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% 
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>, 
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long const, unsigned 
>>> long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% 
>>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, 
>>> never) : no change 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>>> some more days to be sure) 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>>> on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>>> it's helping. 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>>> of graphs). 
>>>>>>> I just don't see latency difference on reads. (or they are very 
>>> very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>>> after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>>> Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>>> <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>>> but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>>> restart 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>>> time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>>> nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> When the osd are refreshly started, the commit latency is 
>>> between 0,5-1ms. 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>>> day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>>> full loaded) 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>>> bluestore memory bug ? 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>> 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>> 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>> 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>> 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>> 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>> 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>> 
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>> 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> 
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>> 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>> 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1938718399.96269.1550660948828.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                                             ` <1938718399.96269.1550660948828.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-20 13:43                                                                                                               ` Alexandre DERUMIER
       [not found]                                                                                                                 ` <1979343949.99892.1550670199633.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-20 13:43 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

on osd.8, at 01:20 when latency begin to increase, I have a scrub running

2019-02-20 01:16:08.851 7f84d24d9700  0 log_channel(cluster) log [DBG] : 5.52 scrub starts
2019-02-20 01:17:18.019 7f84ce4d1700  0 log_channel(cluster) log [DBG] : 5.52 scrub ok
2019-02-20 01:20:31.944 7f84f036e700  0 -- 10.5.0.106:6820/2900 >> 10.5.0.79:0/2442367265 conn(0x7e120300 :6820 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2019-02-20 01:28:35.421 7f84d34db700  0 log_channel(cluster) log [DBG] : 5.c8 scrub starts
2019-02-20 01:29:45.553 7f84cf4d3700  0 log_channel(cluster) log [DBG] : 5.c8 scrub ok
2019-02-20 01:32:45.737 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 scrub starts
2019-02-20 01:33:56.137 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 scrub ok


I'll try to do test disabling scrubing (currently it's running the night between 01:00-05:00)

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 20 Février 2019 12:09:08
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Something interesting, 

when I have restarted osd.8 at 11:20, 

I'm seeing another osd.1 where latency is decreasing exactly at the same time. (without restart of this osd). 

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 20 Février 2019 11:39:34 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "Alexandre Derumier" <aderumier@odiso.com>, "Wido den Hollander" <wido@42on.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> ----- Mail original ----- 
>> De: "Wido den Hollander" <wido@42on.com> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs. 
>>> 0.003073475 (last hour) 
>>> 
>>> and OSD read ops, osd_r_latency: 
>>> 
>>> 0.000408595 (first hour) 
>>> 0.000709031 (24 hours) 
>>> 0.004979540 (last hour) 
>>> 
>>> What's interesting is that such latency differences aren't observed at 
>>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>>> rocksdb one. 
>>> 
>>> Which probably means that the issue is rather somewhere above BlueStore. 
>>> 
>>> Suggest to proceed with perf dumps collection to see if the picture 
>>> stays the same. 
>>> 
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>>> decrease in RSS report is a known artifact that seems to be safe. 
>>> 
>>> Thanks, 
>>> Igor 
>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to 
>>> see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>>> around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>>> still at 8G 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>>> track them over time. 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>>> take all the memory) 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max 
>>> possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>>> latency, 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>>> plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right 
>>> after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>>> counters) 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's 
>>> paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>>> and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>>> Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what 
>>> I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after 
>>> OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not 
>>> tcmalloc cache size related. 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the 
>>> same behaviour), 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && 
>>> StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>>> ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% 
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to 
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% 
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>>> std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% 
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>, 
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long const, unsigned 
>>> long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% 
>>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, 
>>> never) : no change 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>>> some more days to be sure) 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>>> on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>>> it's helping. 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>>> of graphs). 
>>>>>>> I just don't see latency difference on reads. (or they are very 
>>> very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>>> after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>>> Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>>> <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>>> but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>>> restart 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>>> time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>>> nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> When the osd are refreshly started, the commit latency is 
>>> between 0,5-1ms. 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>>> day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>>> full loaded) 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>>> bluestore memory bug ? 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>> 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>> 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>> 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>> 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>> 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>> 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>> 
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>> 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> 
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>> 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>> 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <1979343949.99892.1550670199633.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                                                 ` <1979343949.99892.1550670199633.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
@ 2019-02-21 16:27                                                                                                                   ` Alexandre DERUMIER
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-02-21 16:27 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-users, ceph-devel

disabling scrubing don't help.

After some analysis, it seem that the buffer increase is when backup are running the night
(snapshot, rbd-diff export/import, delete snapshot).
so this seem normal.



I have just done another test, with lowering osd_target_memory to 2G.

and this is very strange, 
after 3h running, I never seen so low latency (0.3-0.4ms  vs 0.7-1 ms).

Are we sure that latency don't increase when we have more objects in differents buffers ?
(maybe lookup take more time ?)


here some dump of current osd.0, with 0.2ms latency


#  ceph daemon osd.0 dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 20814720,
                "bytes": 20814720
            },
            "bluestore_cache_data": {
                "items": 10013,
                "bytes": 155406336
            },
            "bluestore_cache_onode": {
                "items": 19938,
                "bytes": 13398336
            },
            "bluestore_cache_other": {
                "items": 6537134,
                "bytes": 294022739
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 4,
                "bytes": 2976
            },
            "bluestore_writing_deferred": {
                "items": 87,
                "bytes": 349495
            },
            "bluestore_writing": {
                "items": 11,
                "bytes": 45056
            },
            "bluefs": {
                "items": 1057,
                "bytes": 39840
            },
            "buffer_anon": {
                "items": 23731,
                "bytes": 6014669
            },
            "buffer_meta": {
                "items": 10115,
                "bytes": 647360
            },
            "osd": {
                "items": 130,
                "bytes": 1652560
            },
            "osd_mapbl": {
                "items": 39,
                "bytes": 196038
            },
            "osd_pglog": {
                "items": 503242,
                "bytes": 198648328
            },
            "osdmap": {
                "items": 6597,
                "bytes": 266832
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 27926818,
            "bytes": 691505285
        }
    }
}


# ceph daemon osd.0 perf dump
{
    "AsyncMessenger::Worker-0": {
        "msgr_recv_messages": 2557489,
        "msgr_send_messages": 2533141,
        "msgr_recv_bytes": 25264135555,
        "msgr_send_bytes": 47847305751,
        "msgr_created_connections": 676,
        "msgr_active_connections": 365,
        "msgr_running_total_time": 97.965132543,
        "msgr_running_send_time": 38.022118647,
        "msgr_running_recv_time": 38.882905041,
        "msgr_running_fast_dispatch_time": 13.429645439
    },
    "AsyncMessenger::Worker-1": {
        "msgr_recv_messages": 1001909,
        "msgr_send_messages": 986004,
        "msgr_recv_bytes": 24618390855,
        "msgr_send_bytes": 13914036810,
        "msgr_created_connections": 872,
        "msgr_active_connections": 557,
        "msgr_running_total_time": 56.609283292,
        "msgr_running_send_time": 16.813572315,
        "msgr_running_recv_time": 29.344681543,
        "msgr_running_fast_dispatch_time": 6.702812008
    },
    "AsyncMessenger::Worker-2": {
        "msgr_recv_messages": 1135772,
        "msgr_send_messages": 1120243,
        "msgr_recv_bytes": 26963500670,
        "msgr_send_bytes": 16789461917,
        "msgr_created_connections": 610,
        "msgr_active_connections": 360,
        "msgr_running_total_time": 63.853954690,
        "msgr_running_send_time": 19.841675961,
        "msgr_running_recv_time": 32.433031830,
        "msgr_running_fast_dispatch_time": 7.293915661
    },
    "bluefs": {
        "gift_bytes": 0,
        "reclaim_bytes": 0,
        "db_total_bytes": 128003866624,
        "db_used_bytes": 4123000832,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 0,
        "slow_used_bytes": 0,
        "num_files": 77,
        "log_bytes": 9093120,
        "log_compactions": 2,
        "logged_bytes": 210989056,
        "files_written_wal": 2,
        "files_written_sst": 237,
        "bytes_written_wal": 21125355408,
        "bytes_written_sst": 11368467488
    },
    "bluestore": {
        "kv_flush_lat": {
            "avgcount": 1789133,
            "sum": 1.951295111,
            "avgtime": 0.000001090
        },
        "kv_commit_lat": {
            "avgcount": 1789133,
            "sum": 180.819752608,
            "avgtime": 0.000101065
        },
        "kv_lat": {
            "avgcount": 1789133,
            "sum": 182.771047719,
            "avgtime": 0.000102156
        },
        "state_prepare_lat": {
            "avgcount": 1876202,
            "sum": 167.455854249,
            "avgtime": 0.000089252
        },
        "state_aio_wait_lat": {
            "avgcount": 1876202,
            "sum": 43.313698143,
            "avgtime": 0.000023085
        },
        "state_io_done_lat": {
            "avgcount": 1876202,
            "sum": 2.113144402,
            "avgtime": 0.000001126
        },
        "state_kv_queued_lat": {
            "avgcount": 1876202,
            "sum": 83.261967214,
            "avgtime": 0.000044377
        },
        "state_kv_commiting_lat": {
            "avgcount": 1876202,
            "sum": 213.357616341,
            "avgtime": 0.000113717
        },
        "state_kv_done_lat": {
            "avgcount": 1876202,
            "sum": 0.602926028,
            "avgtime": 0.000000321
        },
        "state_deferred_queued_lat": {
            "avgcount": 1619364,
            "sum": 85184.639501267,
            "avgtime": 0.052603762
        },
        "state_deferred_aio_wait_lat": {
            "avgcount": 1619364,
            "sum": 112.044269827,
            "avgtime": 0.000069190
        },
        "state_deferred_cleanup_lat": {
            "avgcount": 1619364,
            "sum": 20995.833517937,
            "avgtime": 0.012965481
        },
        "state_finishing_lat": {
            "avgcount": 1876188,
            "sum": 0.204556717,
            "avgtime": 0.000000109
        },
        "state_done_lat": {
            "avgcount": 1876188,
            "sum": 2109.986993627,
            "avgtime": 0.001124613
        },
        "throttle_lat": {
            "avgcount": 1876202,
            "sum": 0.868563450,
            "avgtime": 0.000000462
        },
        "submit_lat": {
            "avgcount": 1876202,
            "sum": 182.836610559,
            "avgtime": 0.000097450
        },
        "commit_lat": {
            "avgcount": 1876202,
            "sum": 509.787789482,
            "avgtime": 0.000271712
        },
        "read_lat": {
            "avgcount": 1441152,
            "sum": 170.377365108,
            "avgtime": 0.000118223
        },
        "read_onode_meta_lat": {
            "avgcount": 3548665,
            "sum": 1.705679589,
            "avgtime": 0.000000480
        },
        "read_wait_aio_lat": {
            "avgcount": 2107513,
            "sum": 210.183431662,
            "avgtime": 0.000099730
        },
        "compress_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "decompress_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "csum_lat": {
            "avgcount": 1462432,
            "sum": 5.431547650,
            "avgtime": 0.000003714
        },
        "compress_success_count": 0,
        "compress_rejected_count": 0,
        "write_pad_bytes": 17154172,
        "deferred_write_ops": 1347835,
        "deferred_write_bytes": 12240068608,
        "write_penalty_read_ops": 666365,
        "bluestore_allocated": 1845050884096,
        "bluestore_stored": 1954031783171,
        "bluestore_compressed": 0,
        "bluestore_compressed_allocated": 0,
        "bluestore_compressed_original": 0,
        "bluestore_onodes": 19955,
        "bluestore_onode_hits": 5259374,
        "bluestore_onode_misses": 38227,
        "bluestore_onode_shard_hits": 4668782,
        "bluestore_onode_shard_misses": 109639,
        "bluestore_extents": 826739,
        "bluestore_blobs": 698021,
        "bluestore_buffers": 10095,
        "bluestore_buffer_bytes": 155717632,
        "bluestore_buffer_hit_bytes": 10550018805,
        "bluestore_buffer_miss_bytes": 25468580806,
        "bluestore_write_big": 358557,
        "bluestore_write_big_bytes": 60145041408,
        "bluestore_write_big_blobs": 1153323,
        "bluestore_write_small": 2131832,
        "bluestore_write_small_bytes": 12582989208,
        "bluestore_write_small_unused": 24433,
        "bluestore_write_small_deferred": 2072022,
        "bluestore_write_small_pre_read": 2072022,
        "bluestore_write_small_new": 35377,
        "bluestore_txc": 1876202,
        "bluestore_onode_reshard": 33902,
        "bluestore_blob_split": 221,
        "bluestore_extent_compress": 3298474,
        "bluestore_gc_merged": 0,
        "bluestore_read_eio": 0,
        "bluestore_reads_with_retries": 0,
        "bluestore_fragmentation_micros": 13
    },
    "finisher-defered_finisher": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "finisher-finisher-0": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 1804127,
            "sum": 48.740572944,
            "avgtime": 0.000027016
        }
    },
    "finisher-objecter-finisher-0": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.0::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.0::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.1::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.1::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.2::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.2::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.3::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.3::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.4::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.4::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.5::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.5::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.6::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.6::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.7::sdata_wait_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "mutex-OSDShard.7::shard_lock": {
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "objecter": {
        "op_active": 0,
        "op_laggy": 0,
        "op_send": 0,
        "op_send_bytes": 0,
        "op_resend": 0,
        "op_reply": 0,
        "op": 0,
        "op_r": 0,
        "op_w": 0,
        "op_rmw": 0,
        "op_pg": 0,
        "osdop_stat": 0,
        "osdop_create": 0,
        "osdop_read": 0,
        "osdop_write": 0,
        "osdop_writefull": 0,
        "osdop_writesame": 0,
        "osdop_append": 0,
        "osdop_zero": 0,
        "osdop_truncate": 0,
        "osdop_delete": 0,
        "osdop_mapext": 0,
        "osdop_sparse_read": 0,
        "osdop_clonerange": 0,
        "osdop_getxattr": 0,
        "osdop_setxattr": 0,
        "osdop_cmpxattr": 0,
        "osdop_rmxattr": 0,
        "osdop_resetxattrs": 0,
        "osdop_tmap_up": 0,
        "osdop_tmap_put": 0,
        "osdop_tmap_get": 0,
        "osdop_call": 0,
        "osdop_watch": 0,
        "osdop_notify": 0,
        "osdop_src_cmpxattr": 0,
        "osdop_pgls": 0,
        "osdop_pgls_filter": 0,
        "osdop_other": 0,
        "linger_active": 0,
        "linger_send": 0,
        "linger_resend": 0,
        "linger_ping": 0,
        "poolop_active": 0,
        "poolop_send": 0,
        "poolop_resend": 0,
        "poolstat_active": 0,
        "poolstat_send": 0,
        "poolstat_resend": 0,
        "statfs_active": 0,
        "statfs_send": 0,
        "statfs_resend": 0,
        "command_active": 0,
        "command_send": 0,
        "command_resend": 0,
        "map_epoch": 127779,
        "map_full": 0,
        "map_inc": 90,
        "osd_sessions": 0,
        "osd_session_open": 0,
        "osd_session_close": 0,
        "osd_laggy": 0,
        "omap_wr": 0,
        "omap_rd": 0,
        "omap_del": 0
    },
    "osd": {
        "op_wip": 0,
        "op": 2013806,
        "op_in_bytes": 20556717224,
        "op_out_bytes": 34677030904,
        "op_latency": {
            "avgcount": 2013806,
            "sum": 1078.246283925,
            "avgtime": 0.000535427
        },
        "op_process_latency": {
            "avgcount": 2013806,
            "sum": 633.116694830,
            "avgtime": 0.000314388
        },
        "op_prepare_latency": {
            "avgcount": 2013890,
            "sum": 324.773977236,
            "avgtime": 0.000161266
        },
        "op_r": 1505501,
        "op_r_out_bytes": 34677030904,
        "op_r_latency": {
            "avgcount": 1505501,
            "sum": 513.764361891,
            "avgtime": 0.000341258
        },
        "op_r_process_latency": {
            "avgcount": 1505501,
            "sum": 190.181935259,
            "avgtime": 0.000126324
        },
        "op_r_prepare_latency": {
            "avgcount": 1505558,
            "sum": 198.971770997,
            "avgtime": 0.000132158
        },
        "op_w": 507748,
        "op_w_in_bytes": 20555028420,
        "op_w_latency": {
            "avgcount": 507748,
            "sum": 563.090245527,
            "avgtime": 0.001108995
        },
        "op_w_process_latency": {
            "avgcount": 507748,
            "sum": 442.256649477,
            "avgtime": 0.000871016
        },
        "op_w_prepare_latency": {
            "avgcount": 507748,
            "sum": 125.468116188,
            "avgtime": 0.000247107
        },
        "op_rw": 557,
        "op_rw_in_bytes": 1688804,
        "op_rw_out_bytes": 0,
        "op_rw_latency": {
            "avgcount": 557,
            "sum": 1.391676507,
            "avgtime": 0.002498521
        },
        "op_rw_process_latency": {
            "avgcount": 557,
            "sum": 0.678110094,
            "avgtime": 0.001217432
        },
        "op_rw_prepare_latency": {
            "avgcount": 584,
            "sum": 0.334090051,
            "avgtime": 0.000572072
        },
        "op_before_queue_op_lat": {
            "avgcount": 4384663,
            "sum": 138.696342460,
            "avgtime": 0.000031632
        },
        "op_before_dequeue_op_lat": {
            "avgcount": 4385319,
            "sum": 642.355724412,
            "avgtime": 0.000146478
        },
        "subop": 1360827,
        "subop_in_bytes": 52033083038,
        "subop_latency": {
            "avgcount": 1360827,
            "sum": 561.053285535,
            "avgtime": 0.000412288
        },
        "subop_w": 1360827,
        "subop_w_in_bytes": 52033083038,
        "subop_w_latency": {
            "avgcount": 1360827,
            "sum": 561.053285535,
            "avgtime": 0.000412288
        },
        "subop_pull": 0,
        "subop_pull_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "subop_push": 0,
        "subop_push_in_bytes": 0,
        "subop_push_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "pull": 0,
        "push": 0,
        "push_out_bytes": 0,
        "recovery_ops": 133,
        "loadavg": 142,
        "buffer_bytes": 0,
        "history_alloc_Mbytes": 0,
        "history_alloc_num": 0,
        "cached_crc": 0,
        "cached_crc_adjusted": 0,
        "missed_crc": 0,
        "numpg": 130,
        "numpg_primary": 52,
        "numpg_replica": 78,
        "numpg_stray": 0,
        "numpg_removing": 0,
        "heartbeat_to_peers": 17,
        "map_messages": 1308,
        "map_message_epochs": 1317,
        "map_message_epoch_dups": 1227,
        "messages_delayed_for_map": 0,
        "osd_map_cache_hit": 12097,
        "osd_map_cache_miss": 27,
        "osd_map_cache_miss_low": 0,
        "osd_map_cache_miss_low_avg": {
            "avgcount": 0,
            "sum": 0
        },
        "osd_map_bl_cache_hit": 2993,
        "osd_map_bl_cache_miss": 201,
        "stat_bytes": 3200084082688,
        "stat_bytes_used": 1849173377024,
        "stat_bytes_avail": 1350910705664,
        "copyfrom": 0,
        "tier_promote": 0,
        "tier_flush": 0,
        "tier_flush_fail": 0,
        "tier_try_flush": 0,
        "tier_try_flush_fail": 0,
        "tier_evict": 0,
        "tier_whiteout": 66,
        "tier_dirty": 3016,
        "tier_clean": 0,
        "tier_delay": 0,
        "tier_proxy_read": 0,
        "tier_proxy_write": 0,
        "agent_wake": 0,
        "agent_skip": 0,
        "agent_flush": 0,
        "agent_evict": 0,
        "object_ctx_cache_hit": 1989670,
        "object_ctx_cache_total": 2017363,
        "op_cache_hit": 0,
        "osd_tier_flush_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_tier_promote_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_tier_r_lat": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "osd_pg_info": 1876006,
        "osd_pg_fastinfo": 1822291,
        "osd_pg_biginfo": 5524
    },
    "recoverystate_perf": {
        "initial_latency": {
            "avgcount": 130,
            "sum": 4.322786154,
            "avgtime": 0.033252201
        },
        "started_latency": {
            "avgcount": 32,
            "sum": 16.075945549,
            "avgtime": 0.502373298
        },
        "reset_latency": {
            "avgcount": 162,
            "sum": 298.484347831,
            "avgtime": 1.842495974
        },
        "start_latency": {
            "avgcount": 162,
            "sum": 0.000352294,
            "avgtime": 0.000002174
        },
        "primary_latency": {
            "avgcount": 16,
            "sum": 3.601677363,
            "avgtime": 0.225104835
        },
        "peering_latency": {
            "avgcount": 68,
            "sum": 55.339698902,
            "avgtime": 0.813819101
        },
        "backfilling_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "waitremotebackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "waitlocalbackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "notbackfilling_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "repnotrecovering_latency": {
            "avgcount": 65,
            "sum": 501.110311451,
            "avgtime": 7.709389406
        },
        "repwaitrecoveryreserved_latency": {
            "avgcount": 64,
            "sum": 4.359683576,
            "avgtime": 0.068120055
        },
        "repwaitbackfillreserved_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "reprecovering_latency": {
            "avgcount": 64,
            "sum": 3.249195235,
            "avgtime": 0.050768675
        },
        "activating_latency": {
            "avgcount": 52,
            "sum": 1.686665368,
            "avgtime": 0.032435872
        },
        "waitlocalrecoveryreserved_latency": {
            "avgcount": 32,
            "sum": 96.754219086,
            "avgtime": 3.023569346
        },
        "waitremoterecoveryreserved_latency": {
            "avgcount": 32,
            "sum": 4.120144914,
            "avgtime": 0.128754528
        },
        "recovering_latency": {
            "avgcount": 32,
            "sum": 2.095248969,
            "avgtime": 0.065476530
        },
        "recovered_latency": {
            "avgcount": 52,
            "sum": 0.000304260,
            "avgtime": 0.000005851
        },
        "clean_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "active_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "replicaactive_latency": {
            "avgcount": 1,
            "sum": 7.287655289,
            "avgtime": 7.287655289
        },
        "stray_latency": {
            "avgcount": 94,
            "sum": 84.075226820,
            "avgtime": 0.894417306
        },
        "getinfo_latency": {
            "avgcount": 68,
            "sum": 3.931055694,
            "avgtime": 0.057809642
        },
        "getlog_latency": {
            "avgcount": 52,
            "sum": 0.712006388,
            "avgtime": 0.013692430
        },
        "waitactingchange_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "incomplete_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "down_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "getmissing_latency": {
            "avgcount": 52,
            "sum": 0.000239823,
            "avgtime": 0.000004611
        },
        "waitupthru_latency": {
            "avgcount": 52,
            "sum": 50.695811005,
            "avgtime": 0.974919442
        },
        "notrecovering_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "rocksdb": {
        "get": 314083,
        "submit_transaction": 1876202,
        "submit_transaction_sync": 1789133,
        "get_latency": {
            "avgcount": 314083,
            "sum": 8.789246629,
            "avgtime": 0.000027983
        },
        "submit_latency": {
            "avgcount": 1876202,
            "sum": 73.821809569,
            "avgtime": 0.000039346
        },
        "submit_sync_latency": {
            "avgcount": 1789133,
            "sum": 97.339320959,
            "avgtime": 0.000054405
        },
        "compact": 0,
        "compact_range": 0,
        "compact_queue_merge": 0,
        "compact_queue_len": 0,
        "rocksdb_write_wal_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_memtable_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_delay_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },
        "rocksdb_write_pre_and_post_time": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    }
}


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Igor Fedotov" <ifedotov@suse.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 20 Février 2019 14:43:19
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

on osd.8, at 01:20 when latency begin to increase, I have a scrub running 

2019-02-20 01:16:08.851 7f84d24d9700 0 log_channel(cluster) log [DBG] : 5.52 scrub starts 
2019-02-20 01:17:18.019 7f84ce4d1700 0 log_channel(cluster) log [DBG] : 5.52 scrub ok 
2019-02-20 01:20:31.944 7f84f036e700 0 -- 10.5.0.106:6820/2900 >> 10.5.0.79:0/2442367265 conn(0x7e120300 :6820 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1) 
2019-02-20 01:28:35.421 7f84d34db700 0 log_channel(cluster) log [DBG] : 5.c8 scrub starts 
2019-02-20 01:29:45.553 7f84cf4d3700 0 log_channel(cluster) log [DBG] : 5.c8 scrub ok 
2019-02-20 01:32:45.737 7f84d14d7700 0 log_channel(cluster) log [DBG] : 5.c4 scrub starts 
2019-02-20 01:33:56.137 7f84d14d7700 0 log_channel(cluster) log [DBG] : 5.c4 scrub ok 


I'll try to do test disabling scrubing (currently it's running the night between 01:00-05:00) 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 20 Février 2019 12:09:08 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Something interesting, 

when I have restarted osd.8 at 11:20, 

I'm seeing another osd.1 where latency is decreasing exactly at the same time. (without restart of this osd). 

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 20 Février 2019 11:39:34 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Igor Fedotov" <ifedotov@suse.de> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


----- Mail original ----- 
De: "Igor Fedotov" <ifedotov@suse.de> 
À: "Alexandre Derumier" <aderumier@odiso.com>, "Wido den Hollander" <wido@42on.com> 
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> ----- Mail original ----- 
> De: "Wido den Hollander" <wido@42on.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> ----- Mail original ----- 
>> De: "Wido den Hollander" <wido@42on.com> 
>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs. 
>>> 0.003073475 (last hour) 
>>> 
>>> and OSD read ops, osd_r_latency: 
>>> 
>>> 0.000408595 (first hour) 
>>> 0.000709031 (24 hours) 
>>> 0.004979540 (last hour) 
>>> 
>>> What's interesting is that such latency differences aren't observed at 
>>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>>> rocksdb one. 
>>> 
>>> Which probably means that the issue is rather somewhere above BlueStore. 
>>> 
>>> Suggest to proceed with perf dumps collection to see if the picture 
>>> stays the same. 
>>> 
>>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>>> decrease in RSS report is a known artifact that seems to be safe. 
>>> 
>>> Thanks, 
>>> Igor 
>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to 
>>> see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>>> around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is 
>>> still at 8G 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to 
>>> track them over time. 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other 
>>> take all the memory) 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max 
>>> possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
>>> latency, 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
>>> plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark 
>>> Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right 
>>> after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
>>> counters) 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's 
>>> paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, 
>>> and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from 
>>> Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what 
>>> I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, 
>>> Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
>>> until restart 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after 
>>> OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not 
>>> tcmalloc cache size related. 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the 
>>> same behaviour), 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && 
>>> StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
>>> ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% 
>>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>> ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% 
>>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to 
>>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% 
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
>>> std::vector<ObjectStore::Transaction, 
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
>>> ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% 
>>> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
>>> boost::intrusive_ptr<BlueStore::Collection>, 
>>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
>>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
>>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
>>> unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
>>> long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
>>> unsigned long>&, std::pair<unsigned long const, unsigned 
>>> long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% 
>>> StupidAllocator::_aligned_len(interval_set<unsigned long, 
>>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
>>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
>>> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, 
>>> never) : no change 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
>>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
>>> some more days to be sure) 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
>>> on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
>>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
>>> it's helping. 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
>>> glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
>>> of graphs). 
>>>>>>> I just don't see latency difference on reads. (or they are very 
>>> very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" 
>>> <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then 
>>> after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
>>> Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
>>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
>>> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
>>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
>>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
>>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
>>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
>>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
>>> fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
>>> op_r_process_latency? 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" 
>>> <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over 
>>> time, until restart 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
>>> but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
>>> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
>>> change. 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
>>> restart 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
>>> time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
>>> nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
>>> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> When the osd are refreshly started, the commit latency is 
>>> between 0,5-1ms. 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
>>> day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
>>> full loaded) 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
>>> bluestore memory bug ? 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>> 
>>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>>> Hi Igor, 
>>>> 
>>>> Thanks again for helping ! 
>>>> 
>>>> 
>>>> 
>>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>> 
>>>> 
>>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>>>> here the reports for osd.0: 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>> 
>>>> 
>>>> osd has been started the 12-02-2019 at 08:00 
>>>> 
>>>> first report after 1h running 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> report after 24 before counter resets 
>>>> 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>> 
>>>> report 1h after counter reset 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>>> Then after that, slowly decreasing. 
>>>> 
>>>> 
>>>> Another strange thing, 
>>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>>>> 
>>>> 
>>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>> À: "Alexandre Derumier" <aderumier@odiso.com> 
>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 11 Février 2019 12:03:17 
>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>> 
>>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>>>> another mempool dump after 1h run. (latency ok) 
>>>>> 
>>>>> Biggest difference: 
>>>>> 
>>>>> before restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>>>> 
>>>>> 
>>>>> After restart 
>>>>> ------------- 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> 
>>>> This is fine as cache is warming after restart and some rebalancing 
>>>> between data and metadata might occur. 
>>>> 
>>>> What relates to allocator and most probably to fragmentation growth is : 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 165053952, 
>>>> "bytes": 165053952 
>>>> }, 
>>>> 
>>>> which had been higher before the reset (if I got these dumps' order 
>>>> properly) 
>>>> 
>>>> "bluestore_alloc": { 
>>>> "items": 210243456, 
>>>> "bytes": 210243456 
>>>> }, 
>>>> 
>>>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>>>> latency increase... 
>>>> 
>>>> Do you have perf counters dump after the restart? 
>>>> 
>>>> Could you collect some more dumps - for both mempool and perf counters? 
>>>> 
>>>> So ideally I'd like to have: 
>>>> 
>>>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>>>> 
>>>> 2) mempool/perf counters dumps in 24+ hours after restart 
>>>> 
>>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>>>> restart) and dump mempool/perf counters again. 
>>>> 
>>>> So we'll be able to learn both allocator mem usage growth and operation 
>>>> latency distribution for the following periods: 
>>>> 
>>>> a) 1st hour after restart 
>>>> 
>>>> b) 25th hour. 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>>> full mempool dump after restart 
>>>>> ------------------------------- 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 165053952, 
>>>>> "bytes": 165053952 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 40084, 
>>>>> "bytes": 1056235520 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 22225, 
>>>>> "bytes": 14935200 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 12432298, 
>>>>> "bytes": 500834899 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 11, 
>>>>> "bytes": 8184 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 5047, 
>>>>> "bytes": 22673736 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 91, 
>>>>> "bytes": 1662976 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1907, 
>>>>> "bytes": 95600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 19664, 
>>>>> "bytes": 25486050 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 46189, 
>>>>> "bytes": 2956096 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 17, 
>>>>> "bytes": 214366 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 889673, 
>>>>> "bytes": 367160400 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3803, 
>>>>> "bytes": 224552 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 178515204, 
>>>>> "bytes": 2160630547 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> I'm just seeing 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> 
>>>>> on 1 osd, both 10%. 
>>>>> 
>>>>> here the dump_mempools 
>>>>> 
>>>>> { 
>>>>> "mempool": { 
>>>>> "by_pool": { 
>>>>> "bloom_filter": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_alloc": { 
>>>>> "items": 210243456, 
>>>>> "bytes": 210243456 
>>>>> }, 
>>>>> "bluestore_cache_data": { 
>>>>> "items": 54, 
>>>>> "bytes": 643072 
>>>>> }, 
>>>>> "bluestore_cache_onode": { 
>>>>> "items": 105637, 
>>>>> "bytes": 70988064 
>>>>> }, 
>>>>> "bluestore_cache_other": { 
>>>>> "items": 48661920, 
>>>>> "bytes": 1539544228 
>>>>> }, 
>>>>> "bluestore_fsck": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "bluestore_txc": { 
>>>>> "items": 12, 
>>>>> "bytes": 8928 
>>>>> }, 
>>>>> "bluestore_writing_deferred": { 
>>>>> "items": 406, 
>>>>> "bytes": 4792868 
>>>>> }, 
>>>>> "bluestore_writing": { 
>>>>> "items": 66, 
>>>>> "bytes": 1085440 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "items": 1882, 
>>>>> "bytes": 93600 
>>>>> }, 
>>>>> "buffer_anon": { 
>>>>> "items": 138986, 
>>>>> "bytes": 24983701 
>>>>> }, 
>>>>> "buffer_meta": { 
>>>>> "items": 544, 
>>>>> "bytes": 34816 
>>>>> }, 
>>>>> "osd": { 
>>>>> "items": 243, 
>>>>> "bytes": 3089016 
>>>>> }, 
>>>>> "osd_mapbl": { 
>>>>> "items": 36, 
>>>>> "bytes": 179308 
>>>>> }, 
>>>>> "osd_pglog": { 
>>>>> "items": 952564, 
>>>>> "bytes": 372459684 
>>>>> }, 
>>>>> "osdmap": { 
>>>>> "items": 3639, 
>>>>> "bytes": 224664 
>>>>> }, 
>>>>> "osdmap_mapping": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "pgmap": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "mds_co": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_1": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> }, 
>>>>> "unittest_2": { 
>>>>> "items": 0, 
>>>>> "bytes": 0 
>>>>> } 
>>>>> }, 
>>>>> "total": { 
>>>>> "items": 260109445, 
>>>>> "bytes": 2228370845 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> 
>>>>> and the perf dump 
>>>>> 
>>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>>>> { 
>>>>> "AsyncMessenger::Worker-0": { 
>>>>> "msgr_recv_messages": 22948570, 
>>>>> "msgr_send_messages": 22561570, 
>>>>> "msgr_recv_bytes": 333085080271, 
>>>>> "msgr_send_bytes": 261798871204, 
>>>>> "msgr_created_connections": 6152, 
>>>>> "msgr_active_connections": 2701, 
>>>>> "msgr_running_total_time": 1055.197867330, 
>>>>> "msgr_running_send_time": 352.764480121, 
>>>>> "msgr_running_recv_time": 499.206831955, 
>>>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-1": { 
>>>>> "msgr_recv_messages": 18801593, 
>>>>> "msgr_send_messages": 18430264, 
>>>>> "msgr_recv_bytes": 306871760934, 
>>>>> "msgr_send_bytes": 192789048666, 
>>>>> "msgr_created_connections": 5773, 
>>>>> "msgr_active_connections": 2721, 
>>>>> "msgr_running_total_time": 816.821076305, 
>>>>> "msgr_running_send_time": 261.353228926, 
>>>>> "msgr_running_recv_time": 394.035587911, 
>>>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>>>> }, 
>>>>> "AsyncMessenger::Worker-2": { 
>>>>> "msgr_recv_messages": 18463400, 
>>>>> "msgr_send_messages": 18105856, 
>>>>> "msgr_recv_bytes": 187425453590, 
>>>>> "msgr_send_bytes": 220735102555, 
>>>>> "msgr_created_connections": 5897, 
>>>>> "msgr_active_connections": 2605, 
>>>>> "msgr_running_total_time": 807.186854324, 
>>>>> "msgr_running_send_time": 296.834435839, 
>>>>> "msgr_running_recv_time": 351.364389691, 
>>>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>>>> }, 
>>>>> "bluefs": { 
>>>>> "gift_bytes": 0, 
>>>>> "reclaim_bytes": 0, 
>>>>> "db_total_bytes": 256050724864, 
>>>>> "db_used_bytes": 12413042688, 
>>>>> "wal_total_bytes": 0, 
>>>>> "wal_used_bytes": 0, 
>>>>> "slow_total_bytes": 0, 
>>>>> "slow_used_bytes": 0, 
>>>>> "num_files": 209, 
>>>>> "log_bytes": 10383360, 
>>>>> "log_compactions": 14, 
>>>>> "logged_bytes": 336498688, 
>>>>> "files_written_wal": 2, 
>>>>> "files_written_sst": 4499, 
>>>>> "bytes_written_wal": 417989099783, 
>>>>> "bytes_written_sst": 213188750209 
>>>>> }, 
>>>>> "bluestore": { 
>>>>> "kv_flush_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 26.734038497, 
>>>>> "avgtime": 0.000001013 
>>>>> }, 
>>>>> "kv_commit_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3397.491150603, 
>>>>> "avgtime": 0.000128829 
>>>>> }, 
>>>>> "kv_lat": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 3424.225189100, 
>>>>> "avgtime": 0.000129843 
>>>>> }, 
>>>>> "state_prepare_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3689.542105337, 
>>>>> "avgtime": 0.000121028 
>>>>> }, 
>>>>> "state_aio_wait_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 509.864546111, 
>>>>> "avgtime": 0.000016725 
>>>>> }, 
>>>>> "state_io_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 24.534052953, 
>>>>> "avgtime": 0.000000804 
>>>>> }, 
>>>>> "state_kv_queued_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3488.338424238, 
>>>>> "avgtime": 0.000114428 
>>>>> }, 
>>>>> "state_kv_commiting_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 5660.437003432, 
>>>>> "avgtime": 0.000185679 
>>>>> }, 
>>>>> "state_kv_done_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 7.763511500, 
>>>>> "avgtime": 0.000000254 
>>>>> }, 
>>>>> "state_deferred_queued_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 666071.296856696, 
>>>>> "avgtime": 0.025281557 
>>>>> }, 
>>>>> "state_deferred_aio_wait_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 1755.660547071, 
>>>>> "avgtime": 0.000066638 
>>>>> }, 
>>>>> "state_deferred_cleanup_lat": { 
>>>>> "avgcount": 26346134, 
>>>>> "sum": 185465.151653703, 
>>>>> "avgtime": 0.007039558 
>>>>> }, 
>>>>> "state_finishing_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 3.046847481, 
>>>>> "avgtime": 0.000000099 
>>>>> }, 
>>>>> "state_done_lat": { 
>>>>> "avgcount": 30484920, 
>>>>> "sum": 13193.362685280, 
>>>>> "avgtime": 0.000432783 
>>>>> }, 
>>>>> "throttle_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 14.634269979, 
>>>>> "avgtime": 0.000000480 
>>>>> }, 
>>>>> "submit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 3873.883076148, 
>>>>> "avgtime": 0.000127075 
>>>>> }, 
>>>>> "commit_lat": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 13376.492317331, 
>>>>> "avgtime": 0.000438790 
>>>>> }, 
>>>>> "read_lat": { 
>>>>> "avgcount": 5873923, 
>>>>> "sum": 1817.167582057, 
>>>>> "avgtime": 0.000309361 
>>>>> }, 
>>>>> "read_onode_meta_lat": { 
>>>>> "avgcount": 19608201, 
>>>>> "sum": 146.770464482, 
>>>>> "avgtime": 0.000007485 
>>>>> }, 
>>>>> "read_wait_aio_lat": { 
>>>>> "avgcount": 13734278, 
>>>>> "sum": 2532.578077242, 
>>>>> "avgtime": 0.000184398 
>>>>> }, 
>>>>> "compress_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "decompress_lat": { 
>>>>> "avgcount": 1346945, 
>>>>> "sum": 26.227575896, 
>>>>> "avgtime": 0.000019471 
>>>>> }, 
>>>>> "csum_lat": { 
>>>>> "avgcount": 28020392, 
>>>>> "sum": 149.587819041, 
>>>>> "avgtime": 0.000005338 
>>>>> }, 
>>>>> "compress_success_count": 0, 
>>>>> "compress_rejected_count": 0, 
>>>>> "write_pad_bytes": 352923605, 
>>>>> "deferred_write_ops": 24373340, 
>>>>> "deferred_write_bytes": 216791842816, 
>>>>> "write_penalty_read_ops": 8062366, 
>>>>> "bluestore_allocated": 3765566013440, 
>>>>> "bluestore_stored": 4186255221852, 
>>>>> "bluestore_compressed": 39981379040, 
>>>>> "bluestore_compressed_allocated": 73748348928, 
>>>>> "bluestore_compressed_original": 165041381376, 
>>>>> "bluestore_onodes": 104232, 
>>>>> "bluestore_onode_hits": 71206874, 
>>>>> "bluestore_onode_misses": 1217914, 
>>>>> "bluestore_onode_shard_hits": 260183292, 
>>>>> "bluestore_onode_shard_misses": 22851573, 
>>>>> "bluestore_extents": 3394513, 
>>>>> "bluestore_blobs": 2773587, 
>>>>> "bluestore_buffers": 0, 
>>>>> "bluestore_buffer_bytes": 0, 
>>>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>>>> "bluestore_write_big": 5648815, 
>>>>> "bluestore_write_big_bytes": 552502214656, 
>>>>> "bluestore_write_big_blobs": 12440992, 
>>>>> "bluestore_write_small": 35883770, 
>>>>> "bluestore_write_small_bytes": 223436965719, 
>>>>> "bluestore_write_small_unused": 408125, 
>>>>> "bluestore_write_small_deferred": 34961455, 
>>>>> "bluestore_write_small_pre_read": 34961455, 
>>>>> "bluestore_write_small_new": 514190, 
>>>>> "bluestore_txc": 30484924, 
>>>>> "bluestore_onode_reshard": 5144189, 
>>>>> "bluestore_blob_split": 60104, 
>>>>> "bluestore_extent_compress": 53347252, 
>>>>> "bluestore_gc_merged": 21142528, 
>>>>> "bluestore_read_eio": 0, 
>>>>> "bluestore_fragmentation_micros": 67 
>>>>> }, 
>>>>> "finisher-defered_finisher": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "finisher-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 26625163, 
>>>>> "sum": 1057.506990951, 
>>>>> "avgtime": 0.000039718 
>>>>> } 
>>>>> }, 
>>>>> "finisher-objecter-finisher-0": { 
>>>>> "queue_len": 0, 
>>>>> "complete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.0::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.1::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.2::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.3::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.4::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.5::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.6::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "mutex-OSDShard.7::shard_lock": { 
>>>>> "wait": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "objecter": { 
>>>>> "op_active": 0, 
>>>>> "op_laggy": 0, 
>>>>> "op_send": 0, 
>>>>> "op_send_bytes": 0, 
>>>>> "op_resend": 0, 
>>>>> "op_reply": 0, 
>>>>> "op": 0, 
>>>>> "op_r": 0, 
>>>>> "op_w": 0, 
>>>>> "op_rmw": 0, 
>>>>> "op_pg": 0, 
>>>>> "osdop_stat": 0, 
>>>>> "osdop_create": 0, 
>>>>> "osdop_read": 0, 
>>>>> "osdop_write": 0, 
>>>>> "osdop_writefull": 0, 
>>>>> "osdop_writesame": 0, 
>>>>> "osdop_append": 0, 
>>>>> "osdop_zero": 0, 
>>>>> "osdop_truncate": 0, 
>>>>> "osdop_delete": 0, 
>>>>> "osdop_mapext": 0, 
>>>>> "osdop_sparse_read": 0, 
>>>>> "osdop_clonerange": 0, 
>>>>> "osdop_getxattr": 0, 
>>>>> "osdop_setxattr": 0, 
>>>>> "osdop_cmpxattr": 0, 
>>>>> "osdop_rmxattr": 0, 
>>>>> "osdop_resetxattrs": 0, 
>>>>> "osdop_tmap_up": 0, 
>>>>> "osdop_tmap_put": 0, 
>>>>> "osdop_tmap_get": 0, 
>>>>> "osdop_call": 0, 
>>>>> "osdop_watch": 0, 
>>>>> "osdop_notify": 0, 
>>>>> "osdop_src_cmpxattr": 0, 
>>>>> "osdop_pgls": 0, 
>>>>> "osdop_pgls_filter": 0, 
>>>>> "osdop_other": 0, 
>>>>> "linger_active": 0, 
>>>>> "linger_send": 0, 
>>>>> "linger_resend": 0, 
>>>>> "linger_ping": 0, 
>>>>> "poolop_active": 0, 
>>>>> "poolop_send": 0, 
>>>>> "poolop_resend": 0, 
>>>>> "poolstat_active": 0, 
>>>>> "poolstat_send": 0, 
>>>>> "poolstat_resend": 0, 
>>>>> "statfs_active": 0, 
>>>>> "statfs_send": 0, 
>>>>> "statfs_resend": 0, 
>>>>> "command_active": 0, 
>>>>> "command_send": 0, 
>>>>> "command_resend": 0, 
>>>>> "map_epoch": 105913, 
>>>>> "map_full": 0, 
>>>>> "map_inc": 828, 
>>>>> "osd_sessions": 0, 
>>>>> "osd_session_open": 0, 
>>>>> "osd_session_close": 0, 
>>>>> "osd_laggy": 0, 
>>>>> "omap_wr": 0, 
>>>>> "omap_rd": 0, 
>>>>> "omap_del": 0 
>>>>> }, 
>>>>> "osd": { 
>>>>> "op_wip": 0, 
>>>>> "op": 16758102, 
>>>>> "op_in_bytes": 238398820586, 
>>>>> "op_out_bytes": 165484999463, 
>>>>> "op_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 38242.481640842, 
>>>>> "avgtime": 0.002282029 
>>>>> }, 
>>>>> "op_process_latency": { 
>>>>> "avgcount": 16758102, 
>>>>> "sum": 28644.906310687, 
>>>>> "avgtime": 0.001709316 
>>>>> }, 
>>>>> "op_prepare_latency": { 
>>>>> "avgcount": 16761367, 
>>>>> "sum": 3489.856599934, 
>>>>> "avgtime": 0.000208208 
>>>>> }, 
>>>>> "op_r": 6188565, 
>>>>> "op_r_out_bytes": 165484999463, 
>>>>> "op_r_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 4507.365756792, 
>>>>> "avgtime": 0.000728337 
>>>>> }, 
>>>>> "op_r_process_latency": { 
>>>>> "avgcount": 6188565, 
>>>>> "sum": 942.363063429, 
>>>>> "avgtime": 0.000152274 
>>>>> }, 
>>>>> "op_r_prepare_latency": { 
>>>>> "avgcount": 6188644, 
>>>>> "sum": 982.866710389, 
>>>>> "avgtime": 0.000158817 
>>>>> }, 
>>>>> "op_w": 10546037, 
>>>>> "op_w_in_bytes": 238334329494, 
>>>>> "op_w_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 33160.719998316, 
>>>>> "avgtime": 0.003144377 
>>>>> }, 
>>>>> "op_w_process_latency": { 
>>>>> "avgcount": 10546037, 
>>>>> "sum": 27668.702029030, 
>>>>> "avgtime": 0.002623611 
>>>>> }, 
>>>>> "op_w_prepare_latency": { 
>>>>> "avgcount": 10548652, 
>>>>> "sum": 2499.688609173, 
>>>>> "avgtime": 0.000236967 
>>>>> }, 
>>>>> "op_rw": 23500, 
>>>>> "op_rw_in_bytes": 64491092, 
>>>>> "op_rw_out_bytes": 0, 
>>>>> "op_rw_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 574.395885734, 
>>>>> "avgtime": 0.024442378 
>>>>> }, 
>>>>> "op_rw_process_latency": { 
>>>>> "avgcount": 23500, 
>>>>> "sum": 33.841218228, 
>>>>> "avgtime": 0.001440051 
>>>>> }, 
>>>>> "op_rw_prepare_latency": { 
>>>>> "avgcount": 24071, 
>>>>> "sum": 7.301280372, 
>>>>> "avgtime": 0.000303322 
>>>>> }, 
>>>>> "op_before_queue_op_lat": { 
>>>>> "avgcount": 57892986, 
>>>>> "sum": 1502.117718889, 
>>>>> "avgtime": 0.000025946 
>>>>> }, 
>>>>> "op_before_dequeue_op_lat": { 
>>>>> "avgcount": 58091683, 
>>>>> "sum": 45194.453254037, 
>>>>> "avgtime": 0.000777984 
>>>>> }, 
>>>>> "subop": 19784758, 
>>>>> "subop_in_bytes": 547174969754, 
>>>>> "subop_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_w": 19784758, 
>>>>> "subop_w_in_bytes": 547174969754, 
>>>>> "subop_w_latency": { 
>>>>> "avgcount": 19784758, 
>>>>> "sum": 13019.714424060, 
>>>>> "avgtime": 0.000658067 
>>>>> }, 
>>>>> "subop_pull": 0, 
>>>>> "subop_pull_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "subop_push": 0, 
>>>>> "subop_push_in_bytes": 0, 
>>>>> "subop_push_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "pull": 0, 
>>>>> "push": 2003, 
>>>>> "push_out_bytes": 5560009728, 
>>>>> "recovery_ops": 1940, 
>>>>> "loadavg": 118, 
>>>>> "buffer_bytes": 0, 
>>>>> "history_alloc_Mbytes": 0, 
>>>>> "history_alloc_num": 0, 
>>>>> "cached_crc": 0, 
>>>>> "cached_crc_adjusted": 0, 
>>>>> "missed_crc": 0, 
>>>>> "numpg": 243, 
>>>>> "numpg_primary": 82, 
>>>>> "numpg_replica": 161, 
>>>>> "numpg_stray": 0, 
>>>>> "numpg_removing": 0, 
>>>>> "heartbeat_to_peers": 10, 
>>>>> "map_messages": 7013, 
>>>>> "map_message_epochs": 7143, 
>>>>> "map_message_epoch_dups": 6315, 
>>>>> "messages_delayed_for_map": 0, 
>>>>> "osd_map_cache_hit": 203309, 
>>>>> "osd_map_cache_miss": 33, 
>>>>> "osd_map_cache_miss_low": 0, 
>>>>> "osd_map_cache_miss_low_avg": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0 
>>>>> }, 
>>>>> "osd_map_bl_cache_hit": 47012, 
>>>>> "osd_map_bl_cache_miss": 1681, 
>>>>> "stat_bytes": 6401248198656, 
>>>>> "stat_bytes_used": 3777979072512, 
>>>>> "stat_bytes_avail": 2623269126144, 
>>>>> "copyfrom": 0, 
>>>>> "tier_promote": 0, 
>>>>> "tier_flush": 0, 
>>>>> "tier_flush_fail": 0, 
>>>>> "tier_try_flush": 0, 
>>>>> "tier_try_flush_fail": 0, 
>>>>> "tier_evict": 0, 
>>>>> "tier_whiteout": 1631, 
>>>>> "tier_dirty": 22360, 
>>>>> "tier_clean": 0, 
>>>>> "tier_delay": 0, 
>>>>> "tier_proxy_read": 0, 
>>>>> "tier_proxy_write": 0, 
>>>>> "agent_wake": 0, 
>>>>> "agent_skip": 0, 
>>>>> "agent_flush": 0, 
>>>>> "agent_evict": 0, 
>>>>> "object_ctx_cache_hit": 16311156, 
>>>>> "object_ctx_cache_total": 17426393, 
>>>>> "op_cache_hit": 0, 
>>>>> "osd_tier_flush_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_promote_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_tier_r_lat": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "osd_pg_info": 30483113, 
>>>>> "osd_pg_fastinfo": 29619885, 
>>>>> "osd_pg_biginfo": 81703 
>>>>> }, 
>>>>> "recoverystate_perf": { 
>>>>> "initial_latency": { 
>>>>> "avgcount": 243, 
>>>>> "sum": 6.869296500, 
>>>>> "avgtime": 0.028268709 
>>>>> }, 
>>>>> "started_latency": { 
>>>>> "avgcount": 1125, 
>>>>> "sum": 13551384.917335850, 
>>>>> "avgtime": 12045.675482076 
>>>>> }, 
>>>>> "reset_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 1101.727799040, 
>>>>> "avgtime": 0.805356578 
>>>>> }, 
>>>>> "start_latency": { 
>>>>> "avgcount": 1368, 
>>>>> "sum": 0.002014799, 
>>>>> "avgtime": 0.000001472 
>>>>> }, 
>>>>> "primary_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 4575560.638823428, 
>>>>> "avgtime": 9024.774435549 
>>>>> }, 
>>>>> "peering_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 499.372283616, 
>>>>> "avgtime": 0.907949606 
>>>>> }, 
>>>>> "backfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitremotebackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "waitlocalbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "notbackfilling_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "repnotrecovering_latency": { 
>>>>> "avgcount": 1009, 
>>>>> "sum": 8975301.082274411, 
>>>>> "avgtime": 8895.243887288 
>>>>> }, 
>>>>> "repwaitrecoveryreserved_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 99.846056520, 
>>>>> "avgtime": 0.237728706 
>>>>> }, 
>>>>> "repwaitbackfillreserved_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "reprecovering_latency": { 
>>>>> "avgcount": 420, 
>>>>> "sum": 241.682764382, 
>>>>> "avgtime": 0.575435153 
>>>>> }, 
>>>>> "activating_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 16.893347339, 
>>>>> "avgtime": 0.033320211 
>>>>> }, 
>>>>> "waitlocalrecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 672.335512769, 
>>>>> "avgtime": 3.378570415 
>>>>> }, 
>>>>> "waitremoterecoveryreserved_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 213.536439363, 
>>>>> "avgtime": 1.073047433 
>>>>> }, 
>>>>> "recovering_latency": { 
>>>>> "avgcount": 199, 
>>>>> "sum": 79.007696479, 
>>>>> "avgtime": 0.397023600 
>>>>> }, 
>>>>> "recovered_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 14.000732748, 
>>>>> "avgtime": 0.027614857 
>>>>> }, 
>>>>> "clean_latency": { 
>>>>> "avgcount": 395, 
>>>>> "sum": 4574325.900371083, 
>>>>> "avgtime": 11580.571899673 
>>>>> }, 
>>>>> "active_latency": { 
>>>>> "avgcount": 425, 
>>>>> "sum": 4575107.630123680, 
>>>>> "avgtime": 10764.959129702 
>>>>> }, 
>>>>> "replicaactive_latency": { 
>>>>> "avgcount": 589, 
>>>>> "sum": 8975184.499049954, 
>>>>> "avgtime": 15238.004242869 
>>>>> }, 
>>>>> "stray_latency": { 
>>>>> "avgcount": 818, 
>>>>> "sum": 800.729455666, 
>>>>> "avgtime": 0.978886865 
>>>>> }, 
>>>>> "getinfo_latency": { 
>>>>> "avgcount": 550, 
>>>>> "sum": 15.085667048, 
>>>>> "avgtime": 0.027428485 
>>>>> }, 
>>>>> "getlog_latency": { 
>>>>> "avgcount": 546, 
>>>>> "sum": 3.482175693, 
>>>>> "avgtime": 0.006377611 
>>>>> }, 
>>>>> "waitactingchange_latency": { 
>>>>> "avgcount": 39, 
>>>>> "sum": 35.444551284, 
>>>>> "avgtime": 0.908834648 
>>>>> }, 
>>>>> "incomplete_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "down_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "getmissing_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 6.702129624, 
>>>>> "avgtime": 0.013219190 
>>>>> }, 
>>>>> "waitupthru_latency": { 
>>>>> "avgcount": 507, 
>>>>> "sum": 474.098261727, 
>>>>> "avgtime": 0.935105052 
>>>>> }, 
>>>>> "notrecovering_latency": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> }, 
>>>>> "rocksdb": { 
>>>>> "get": 28320977, 
>>>>> "submit_transaction": 30484924, 
>>>>> "submit_transaction_sync": 26371957, 
>>>>> "get_latency": { 
>>>>> "avgcount": 28320977, 
>>>>> "sum": 325.900908733, 
>>>>> "avgtime": 0.000011507 
>>>>> }, 
>>>>> "submit_latency": { 
>>>>> "avgcount": 30484924, 
>>>>> "sum": 1835.888692371, 
>>>>> "avgtime": 0.000060222 
>>>>> }, 
>>>>> "submit_sync_latency": { 
>>>>> "avgcount": 26371957, 
>>>>> "sum": 1431.555230628, 
>>>>> "avgtime": 0.000054283 
>>>>> }, 
>>>>> "compact": 0, 
>>>>> "compact_range": 0, 
>>>>> "compact_queue_merge": 0, 
>>>>> "compact_queue_len": 0, 
>>>>> "rocksdb_write_wal_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_memtable_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_delay_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> }, 
>>>>> "rocksdb_write_pre_and_post_time": { 
>>>>> "avgcount": 0, 
>>>>> "sum": 0.000000000, 
>>>>> "avgtime": 0.000000000 
>>>>> } 
>>>>> } 
>>>>> } 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> ok, this is the same 
>>>>>> 
>>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>>>> 
>>>>>> 
>>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>>>> it? The same for other OSDs? 
>>>>> 
>>>>> This proves some issue with the allocator - generally fragmentation 
>>>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>>>> aren't properly merged in run-time. 
>>>>> 
>>>>> On the other side I'm not completely sure that latency degradation is 
>>>>> caused by that - fragmentation growth is relatively small - I don't see 
>>>>> how this might impact performance that high. 
>>>>> 
>>>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>>>> output on admin socket) reports? Do you have any historic data? 
>>>>> 
>>>>> If not may I have current output and say a couple more samples with 
>>>>> 8-12 hours interval? 
>>>>> 
>>>>> 
>>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Igor 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>> À: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com>, "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Thanks Igor, 
>>>>>> 
>>>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>>>> startup and once you get high latency. 
>>>>>>>> 
>>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> I'm already monitoring with 
>>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>>>> 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>>>> 
>>>>>> 
>>>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>>>> make sure it's degraded and learn more details. 
>>>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>>>> and try the difference... 
>>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Igor Fedotov" <ifedotov@suse.de> 
>>>>>> À: "Alexandre Derumier" <aderumier@odiso.com>, "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mnelson@redhat.com> 
>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> looks like a bug in StupidAllocator. 
>>>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>>>> 
>>>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Igor 
>>>>>> 
>>>>>> 
>>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>>>> Hi again, 
>>>>>>> 
>>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have notice something using a simple "perf top", 
>>>>>>> 
>>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>>>> 
>>>>>>> when latency is bad, perf top give me : 
>>>>>>> 
>>>>>>> StupidAllocator::_aligned_len 
>>>>>>> and 
>>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>>>> const, unsigned long>*>::increment_slow() 
>>>>>>> 
>>>>>>> (around 10-20% time for both) 
>>>>>>> 
>>>>>>> 
>>>>>>> when latency is good, I don't see them at all. 
>>>>>>> 
>>>>>>> 
>>>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>>>> 
>>>>>>> 
>>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>>>> 
>>>>>>> 
>>>>>>> + 100.00% clone 
>>>>>>> + 100.00% start_thread 
>>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Alexandre Derumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> some news: 
>>>>>>> 
>>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>>>> 
>>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>>>> 
>>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>>>> my others clusters user 1,6TB ssd. 
>>>>>>> 
>>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>>>> 
>>>>>>> 
>>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards, 
>>>>>>> 
>>>>>>> Alexandre 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" <aderumier@odiso.com> 
>>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>>>> op_r_latency but instead op_latency? 
>>>>>>>>> 
>>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>>>> 
>>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>> Cc: "Sage Weil" <sage@newdream.net>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>>>> Hi Stefan, 
>>>>>>>> 
>>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>>>> I need to compare with bigger latencies 
>>>>>>>> 
>>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>>>> 
>>>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>>>> 
>>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> here my influxdb queries: 
>>>>>>>> 
>>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>>>> 
>>>>>>> greets, 
>>>>>>> Stefan 
>>>>>>> 
>>>>>>>> ----- Mail original ----- 
>>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
>>>>>>>> À: "aderumier" <aderumier@odiso.com>, "Sage Weil" <sage@newdream.net> 
>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> here some new results, 
>>>>>>>>> different osd/ different cluster 
>>>>>>>>> 
>>>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>>>> 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>>>> 
>>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>>>> 
>>>>>>>> I would like to check if i see the same behaviour. 
>>>>>>>> 
>>>>>>>> Greets, 
>>>>>>>> Stefan 
>>>>>>>> 
>>>>>>>>> ----- Mail original ----- 
>>>>>>>>> De: "Sage Weil" <sage@newdream.net> 
>>>>>>>>> À: "aderumier" <aderumier@odiso.com> 
>>>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
>>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>>>> 
>>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>>>> 
>>>>>>>>> Thanks! 
>>>>>>>>> sage 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>>>> 
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>>>> 
>>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>>>> 
>>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>>>> values like 20-200ms. 
>>>>>>>>>> 
>>>>>>>>>> Some example graphs: 
>>>>>>>>>> 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>>>> 
>>>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>>>> 
>>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>>>> 
>>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>>>> 
>>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>>>> 
>>>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> 
>>>>>>>>>> Alexandre 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________ 
>>>>>>>>> ceph-users mailing list 
>>>>>>>>> ceph-users@lists.ceph.com 
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                 ` <056c13b4-fbcf-787f-cfbe-bb37044161f8-fspyXLx8qC4@public.gmane.org>
  2019-02-15 13:54                                                                                   ` Alexandre DERUMIER
@ 2019-02-28 20:57                                                                                   ` Stefan Kooman
       [not found]                                                                                     ` <20190228205705.GB31731-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Stefan Kooman @ 2019-02-28 20:57 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-users, ceph-devel

Quoting Wido den Hollander (wido-fspyXLx8qC4@public.gmane.org):
 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
> OSDs as well. Over time their latency increased until we started to
> notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info-68+x73Hep80@public.gmane.org

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <20190228205705.GB31731-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                     ` <20190228205705.GB31731-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
@ 2019-02-28 22:00                                                                                       ` Igor Fedotov
       [not found]                                                                                         ` <392d66bb-5647-9b19-c17b-5259f4ed6749-l3A5Bk7waGM@public.gmane.org>
  2019-03-01  8:29                                                                                       ` Alexandre DERUMIER
  1 sibling, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-28 22:00 UTC (permalink / raw)
  To: Stefan Kooman, Wido den Hollander; +Cc: ceph-users, ceph-devel

Wondering if somebody would be able to apply simple patch that 
periodically resets StupidAllocator?

Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:
> Quoting Wido den Hollander (wido-fspyXLx8qC4@public.gmane.org):
>   
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
>> OSDs as well. Over time their latency increased until we started to
>> notice I/O-wait inside VMs.
> On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
> guess. After restarting the OSD servers the latency would drop to normal
> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>
> Reboots were finished at ~ 19:00.
>
> Gr. Stefan
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <392d66bb-5647-9b19-c17b-5259f4ed6749-l3A5Bk7waGM@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                         ` <392d66bb-5647-9b19-c17b-5259f4ed6749-l3A5Bk7waGM@public.gmane.org>
@ 2019-02-28 22:01                                                                                           ` Igor Fedotov
       [not found]                                                                                             ` <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA@mail.gmail.com>
  0 siblings, 1 reply; 42+ messages in thread
From: Igor Fedotov @ 2019-02-28 22:01 UTC (permalink / raw)
  To: Stefan Kooman, Wido den Hollander; +Cc: ceph-users, ceph-devel

Also I think it makes sense to create a ticket at this point. Any 
volunteers?

On 3/1/2019 1:00 AM, Igor Fedotov wrote:
> Wondering if somebody would be able to apply simple patch that 
> periodically resets StupidAllocator?
>
> Just to verify/disprove the hypothesis it's allocator relateted
>
> On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>> Quoting Wido den Hollander (wido-fspyXLx8qC4@public.gmane.org):
>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
>>> OSDs as well. Over time their latency increased until we started to
>>> notice I/O-wait inside VMs.
>> On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
>> guess. After restarting the OSD servers the latency would drop to normal
>> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>>
>> Reboots were finished at ~ 19:00.
>>
>> Gr. Stefan
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA@mail.gmail.com>]

[parent not found: <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                               ` <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-03-01 10:24                                                                                                 ` Igor Fedotov
  2019-03-01 10:26                                                                                                 ` Igor Fedotov
  1 sibling, 0 replies; 42+ messages in thread
From: Igor Fedotov @ 2019-03-01 10:24 UTC (permalink / raw)
  To: Xiaoxi Chen; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 3853 bytes --]

Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.

And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.

Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.


Thanks,

Igor

-------- Forwarded Message --------

Subject: 	High CPU in StupidAllocator
Date: 	Tue, 12 Feb 2019 10:24:37 +0100
From: 	Adam Kupczyk <akupczyk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: 	IGOR FEDOTOV <ifed75-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>



Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
StupidAllocator::allocate_int().
This comes from loops:
while (p != free[bin].end()) {
     if (_aligned_len(p, alloc_unit) >= want_size) {
       goto found;
     }
     ++p;
}

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.

I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk



On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
> igor，
>    I can test the patch if we have a package.
>    My enviroment and workload can consistently reproduce the latency  
> 2-3 days after restarting.
>     Sage tells me to try bitmap allocator to make sure stupid 
> allocator is the bad guy. I have some osds in luminous +bitmap and 
> some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
> more time to be sure.
>      The perf ,log and admin socket analysis lead to the theory that 
> in alloc_int the loop sometimes take long time wkth allocator locks 
> held. Which blocks release part called from _txc_finish in 
> kv_finalize_thread, this thread is also the one to calculate 
> state_kv_committing_lat and overall commit_lat. You can find from 
> admin socket that state_done_latency has similar trend as commit_latency.
>     But we cannot find a theory to.explain why reboot helps, the 
> allocator btree will be rebuild from freelist manager and.it.should be 
> exactly. the same as it is prior to reboot.  Anything related with pg 
> recovery?
>
>    Anyway, as I have a live env and workload, I am more than willing 
> to work with you for further investigatiom
>
> -Xiaoxi
>
> Igor Fedotov <ifedotov-l3A5Bk7waGM@public.gmane.org <mailto:ifedotov-l3A5Bk7waGM@public.gmane.org>> 于 
> 2019年3月1日周五 上午6:21写道：
>
>     Also I think it makes sense to create a ticket at this point. Any
>     volunteers?
>
>     On 3/1/2019 1:00 AM, Igor Fedotov wrote:
>     > Wondering if somebody would be able to apply simple patch that
>     > periodically resets StupidAllocator?
>     >
>     > Just to verify/disprove the hypothesis it's allocator relateted
>     >
>     > On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>     >> Quoting Wido den Hollander (wido-fspyXLx8qC4@public.gmane.org <mailto:wido-fspyXLx8qC4@public.gmane.org>):
>     >>> Just wanted to chime in, I've seen this with
>     Luminous+BlueStore+NVMe
>     >>> OSDs as well. Over time their latency increased until we
>     started to
>     >>> notice I/O-wait inside VMs.
>     >> On a Luminous 12.2.8 cluster with only SSDs we also hit this
>     issue I
>     >> guess. After restarting the OSD servers the latency would drop
>     to normal
>     >> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>     >>
>     >> Reboots were finished at ~ 19:00.
>     >>
>     >> Gr. Stefan
>     >>
>

[-- Attachment #1.2: Type: text/html, Size: 7138 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                               ` <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2019-03-01 10:24                                                                                                 ` Igor Fedotov
@ 2019-03-01 10:26                                                                                                 ` Igor Fedotov
  1 sibling, 0 replies; 42+ messages in thread
From: Igor Fedotov @ 2019-03-01 10:26 UTC (permalink / raw)
  To: Xiaoxi Chen; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 3941 bytes --]

resending, not sure the prev email reached the mailing list...


Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.

And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.

Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.


Thanks,

Igor

-------- Forwarded Message --------
Subject:     High CPU in StupidAllocator
Date:     Tue, 12 Feb 2019 10:24:37 +0100
From:     Adam Kupczyk <akupczyk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To:     IGOR FEDOTOV <ifed75-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
StupidAllocator::allocate_int().
This comes from loops:
while (p != free[bin].end()) {
     if (_aligned_len(p, alloc_unit) >= want_size) {
       goto found;
     }
     ++p;
}

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.

I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk



On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
> igor，
>    I can test the patch if we have a package.
>    My enviroment and workload can consistently reproduce the latency  
> 2-3 days after restarting.
>     Sage tells me to try bitmap allocator to make sure stupid 
> allocator is the bad guy. I have some osds in luminous +bitmap and 
> some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
> more time to be sure.
>      The perf ,log and admin socket analysis lead to the theory that 
> in alloc_int the loop sometimes take long time wkth allocator locks 
> held. Which blocks release part called from _txc_finish in 
> kv_finalize_thread, this thread is also the one to calculate 
> state_kv_committing_lat and overall commit_lat. You can find from 
> admin socket that state_done_latency has similar trend as commit_latency.
>     But we cannot find a theory to.explain why reboot helps, the 
> allocator btree will be rebuild from freelist manager and.it.should be 
> exactly. the same as it is prior to reboot.  Anything related with pg 
> recovery?
>
>    Anyway, as I have a live env and workload, I am more than willing 
> to work with you for further investigatiom
>
> -Xiaoxi
>
> Igor Fedotov <ifedotov-l3A5Bk7waGM@public.gmane.org <mailto:ifedotov-l3A5Bk7waGM@public.gmane.org>> 于 
> 2019年3月1日周五 上午6:21写道：
>
>     Also I think it makes sense to create a ticket at this point. Any
>     volunteers?
>
>     On 3/1/2019 1:00 AM, Igor Fedotov wrote:
>     > Wondering if somebody would be able to apply simple patch that
>     > periodically resets StupidAllocator?
>     >
>     > Just to verify/disprove the hypothesis it's allocator relateted
>     >
>     > On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>     >> Quoting Wido den Hollander (wido-fspyXLx8qC4@public.gmane.org <mailto:wido-fspyXLx8qC4@public.gmane.org>):
>     >>> Just wanted to chime in, I've seen this with
>     Luminous+BlueStore+NVMe
>     >>> OSDs as well. Over time their latency increased until we
>     started to
>     >>> notice I/O-wait inside VMs.
>     >> On a Luminous 12.2.8 cluster with only SSDs we also hit this
>     issue I
>     >> guess. After restarting the OSD servers the latency would drop
>     to normal
>     >> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>     >>
>     >> Reboots were finished at ~ 19:00.
>     >>
>     >> Gr. Stefan
>     >>
>

[-- Attachment #1.2: Type: text/html, Size: 6597 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]                                                                                     ` <20190228205705.GB31731-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
  2019-02-28 22:00                                                                                       ` Igor Fedotov
@ 2019-03-01  8:29                                                                                       ` Alexandre DERUMIER
  1 sibling, 0 replies; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-03-01  8:29 UTC (permalink / raw)
  To: Stefan Kooman; +Cc: ceph-users, ceph-devel

Hi,

some news, it seem that it's finally stable for me since 1week. (around 0,7ms of commit latency average)

http://odisoweb1.odiso.net/osdstable.png

The biggest change is the 18/02, where I have finished to rebuild all my osd, with 2 osd of 3TB for 1NVME 6TB.

(previously I only have done it on 1 node, so maybe with replication I didn't see the benefit)

I have also push bluestore_cache_kv_max to 1G, and keep osd_target_memory to default, and disable THP.

Differents buffers seem to be more constant too.  

But clearly, 2 x smaller 3TB osd with 3G osd_target_memory  vs 1 big osd 6TB with 6G osd_target_memory have a differents behaviour.
(maybe fragmentation, maybe rocksdb, maybe number of objects in cache, I really don't known)

----- Mail original -----
De: "Stefan Kooman" <stefan@bit.nl>
À: "Wido den Hollander" <wido@42on.com>
Cc: "aderumier" <aderumier@odiso.com>, "Igor Fedotov" <ifedotov@suse.de>, "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Jeudi 28 Février 2019 21:57:05
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart

Quoting Wido den Hollander (wido@42on.com): 

> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I 
guess. After restarting the OSD servers the latency would drop to normal 
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj 

Reboots were finished at ~ 19:00. 

Gr. Stefan 

-- 
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 
| GPG: 0xD14839C6 +31 318 648 688 / info@bit.nl 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ceph osd commit latency increase over time, until restart
       [not found]             ` <1548181710.219518.1548833599717.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  2019-01-30  7:45               ` Stefan Priebe - Profihost AG
@ 2019-01-30 13:33               ` Sage Weil
       [not found]                 ` <alpine.DEB.2.11.1901301331580.5535-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Sage Weil @ 2019-01-30 13:33 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-users, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2659 bytes --]

On Wed, 30 Jan 2019, Alexandre DERUMIER wrote:
> Hi,
> 
> here some new results,
> different osd/ different cluster
> 
> before osd restart latency was between 2-5ms
> after osd restart is around 1-1.5ms
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt  (2-5ms)
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
> http://odisoweb1.odiso.net/cephperf2/diff.txt

I don't see any smoking gun here... :/

The main difference between a warm OSD and a cold one is that on startup 
the bluestore cache is empty.  You might try setting the bluestore cache 
size to something much smaller and see if that has an effect on the CPU 
utilization?

Note that this doesn't necessarily mean that's what you want.  Maybe the 
reason why the CPU utilization is higher is because the cache is warm and 
the OSD is serving more requests per second...

sage



> 
> >From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
> 
> (I'm using tcmalloc 2.5-2.2)
> 
> 
> ----- Mail original -----
> De: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> Cc: "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Envoyé: Vendredi 25 Janvier 2019 10:49:02
> Objet: Re: ceph osd commit latency increase over time, until restart
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
> > 
> > Hi, 
> > 
> > I have a strange behaviour of my osd, on multiple clusters, 
> > 
> > All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> > workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> > 
> > When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> > 
> > But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> > values like 20-200ms. 
> > 
> > Some example graphs: 
> > 
> > http://odisoweb1.odiso.net/osdlatency1.png 
> > http://odisoweb1.odiso.net/osdlatency2.png 
> > 
> > All osds have this behaviour, in all clusters. 
> > 
> > The latency of physical disks is ok. (Clusters are far to be full loaded) 
> > 
> > And if I restart the osd, the latency come back to 0,5-1ms. 
> > 
> > That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> > 
> > Any Hints for counters/logs to check ? 
> > 
> > 
> > Regards, 
> > 
> > Alexandre 
> > 
> > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

[parent not found: <alpine.DEB.2.11.1901301331580.5535-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>]

* Re: ceph osd commit latency increase over time, until restart
       [not found]                 ` <alpine.DEB.2.11.1901301331580.5535-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2019-01-30 13:45                   ` Alexandre DERUMIER
  0 siblings, 0 replies; 42+ messages in thread
From: Alexandre DERUMIER @ 2019-01-30 13:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

>>I don't see any smoking gun here... :/ 

I need to test to compare when latency are going very high, but I need to wait more days/weeks.


>>The main difference between a warm OSD and a cold one is that on startup 
>>the bluestore cache is empty. You might try setting the bluestore cache 
>>size to something much smaller and see if that has an effect on the CPU 
>>utilization? 

I will try to test. I also wonder if the new auto memory tuning from Mark could help too ?
(I'm still on mimic 13.2.1, planning to update to 13.2.5 next month)

also, could check some bluestore related counters ? (onodes, rocksdb,bluestore cache....)

>>Note that this doesn't necessarily mean that's what you want. Maybe the 
>>reason why the CPU utilization is higher is because the cache is warm and 
>>the OSD is serving more requests per second... 

Well, currently, the server is really quiet

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           2,00   515,00   48,00 1182,00   304,00 11216,00    18,73     0,01    0,00    0,00    0,00   0,01   1,20

%Cpu(s):  1,5 us,  1,0 sy,  0,0 ni, 97,2 id,  0,2 wa,  0,0 hi,  0,1 si,  0,0 st

And this is only with writes, not reads



----- Mail original -----
De: "Sage Weil" <sage@newdream.net>
À: "aderumier" <aderumier@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 30 Janvier 2019 14:33:23
Objet: Re: ceph osd commit latency increase over time, until restart

On Wed, 30 Jan 2019, Alexandre DERUMIER wrote: 
> Hi, 
> 
> here some new results, 
> different osd/ different cluster 
> 
> before osd restart latency was between 2-5ms 
> after osd restart is around 1-1.5ms 
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
> http://odisoweb1.odiso.net/cephperf2/diff.txt 

I don't see any smoking gun here... :/ 

The main difference between a warm OSD and a cold one is that on startup 
the bluestore cache is empty. You might try setting the bluestore cache 
size to something much smaller and see if that has an effect on the CPU 
utilization? 

Note that this doesn't necessarily mean that's what you want. Maybe the 
reason why the CPU utilization is higher is because the cache is warm and 
the OSD is serving more requests per second... 

sage 



> 
> >From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
> 
> (I'm using tcmalloc 2.5-2.2) 
> 
> 
> ----- Mail original ----- 
> De: "Sage Weil" <sage@newdream.net> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
> > 
> > Hi, 
> > 
> > I have a strange behaviour of my osd, on multiple clusters, 
> > 
> > All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> > workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
> > 
> > When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> > 
> > But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
> > values like 20-200ms. 
> > 
> > Some example graphs: 
> > 
> > http://odisoweb1.odiso.net/osdlatency1.png 
> > http://odisoweb1.odiso.net/osdlatency2.png 
> > 
> > All osds have this behaviour, in all clusters. 
> > 
> > The latency of physical disks is ok. (Clusters are far to be full loaded) 
> > 
> > And if I restart the osd, the latency come back to 0,5-1ms. 
> > 
> > That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
> > 
> > Any Hints for counters/logs to check ? 
> > 
> > 
> > Regards, 
> > 
> > Alexandre 
> > 
> > 
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-03-01 10:26 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <395511117.2665.1548405853447.JavaMail.zimbra@oxygem.tv>
     [not found] ` <395511117.2665.1548405853447.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-25  9:14   ` ceph osd commit latency increase over time, until restart Alexandre DERUMIER
     [not found]     ` <387140705.12275.1548407699184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-25  9:49       ` Sage Weil
     [not found]         ` <alpine.DEB.2.11.1901250948390.1384-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2019-01-25 10:06           ` Alexandre DERUMIER
     [not found]             ` <837655257.15253.1548410811958.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-25 16:32               ` Alexandre DERUMIER
     [not found]                 ` <787014196.28895.1548433922173.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-25 16:40                   ` Alexandre DERUMIER
2019-01-30  7:33           ` Alexandre DERUMIER
     [not found]             ` <1548181710.219518.1548833599717.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-30  7:45               ` Stefan Priebe - Profihost AG
     [not found]                 ` <e81456d6-8361-5ca5-2b98-7a90948c0218-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
2019-01-30 13:59                   ` Alexandre DERUMIER
     [not found]                     ` <317086845.245472.1548856741512.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-01-30 18:50                       ` Stefan Priebe - Profihost AG
     [not found]                         ` <85320911-75f8-0e9d-af71-151391839153-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
2019-01-30 18:58                           ` Alexandre DERUMIER
     [not found]                             ` <1814646360.255765.1548874695212.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-04  8:38                               ` Alexandre DERUMIER
     [not found]                                 ` <494474215.139609.1549269491013.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-04 14:17                                   ` Alexandre DERUMIER
     [not found]                                     ` <229754897.167048.1549289833437.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-04 14:51                                       ` Igor Fedotov
     [not found]                                         ` <0ab7d2b9-3611-c380-cbf6-c39cec0e673d-l3A5Bk7waGM@public.gmane.org>
2019-02-04 15:04                                           ` Alexandre DERUMIER
     [not found]                                             ` <1323366475.173629.1549292678511.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-04 15:40                                               ` Alexandre DERUMIER
     [not found]                                                 ` <2062110719.174905.1549294821422.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-05 17:56                                                   ` Igor Fedotov
     [not found]                                                     ` <d4558d4b-b1c9-211a-626a-0c14df3e29b9-l3A5Bk7waGM@public.gmane.org>
2019-02-08 15:08                                                       ` Alexandre DERUMIER
2019-02-08 15:14                                                       ` Alexandre DERUMIER
     [not found]                                                         ` <825077993.841032.1549638894023.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-08 15:57                                                           ` Alexandre DERUMIER
     [not found]                                                             ` <2132634351.842536.1549641461010.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-11 11:03                                                               ` Igor Fedotov
     [not found]                                                                 ` <c26e0eca-1a1c-3354-bff6-4560e3aea4c5-l3A5Bk7waGM@public.gmane.org>
2019-02-13  8:42                                                                   ` Alexandre DERUMIER
     [not found]                                                                     ` <1554220830.1076801.1550047328269.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2019-02-15 12:46                                                                       ` Igor Fedotov
2019-02-15 12:47                                                                       ` Igor Fedotov
     [not found]                                                                         ` <f97b81e4-265d-cd8e-3053-321d988720c4-l3A5Bk7waGM@public.gmane.org>
2019-02-15 13:31                                                                           ` Alexandre DERUMIER
     [not found]                                                                             ` <19368722.1223708.1550237472044.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-15 13:50                                                                               ` Wido den Hollander
     [not found]                                                                                 ` <056c13b4-fbcf-787f-cfbe-bb37044161f8-fspyXLx8qC4@public.gmane.org>
2019-02-15 13:54                                                                                   ` Alexandre DERUMIER
     [not found]                                                                                     ` <1345632100.1225626.1550238886648.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-15 13:59                                                                                       ` Wido den Hollander
     [not found]                                                                                         ` <fdd3eaa2-567b-8e02-aadb-64a19c78bc23-fspyXLx8qC4@public.gmane.org>
2019-02-16  8:29                                                                                           ` Alexandre DERUMIER
     [not found]                                                                                             ` <622347904.1243911.1550305749920.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-19 10:12                                                                                               ` Igor Fedotov
     [not found]                                                                                                 ` <76764043-4d0d-bb46-2e2e-0b4261963a98-l3A5Bk7waGM@public.gmane.org>
2019-02-19 16:03                                                                                                   ` Alexandre DERUMIER
     [not found]                                                                                                     ` <121987882.59219.1550592238495.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-20 10:39                                                                                                       ` Alexandre DERUMIER
     [not found]                                                                                                         ` <190289279.94469.1550659174801.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-20 11:09                                                                                                           ` Alexandre DERUMIER
     [not found]                                                                                                             ` <1938718399.96269.1550660948828.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-20 13:43                                                                                                               ` Alexandre DERUMIER
     [not found]                                                                                                                 ` <1979343949.99892.1550670199633.JavaMail.zimbra-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
2019-02-21 16:27                                                                                                                   ` Alexandre DERUMIER
2019-02-28 20:57                                                                                   ` Stefan Kooman
     [not found]                                                                                     ` <20190228205705.GB31731-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
2019-02-28 22:00                                                                                       ` Igor Fedotov
     [not found]                                                                                         ` <392d66bb-5647-9b19-c17b-5259f4ed6749-l3A5Bk7waGM@public.gmane.org>
2019-02-28 22:01                                                                                           ` Igor Fedotov
     [not found]                                                                                             ` <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA@mail.gmail.com>
     [not found]                                                                                               ` <CAEYCsVJRqJDsS7iMXuk68ecFpPS9_qivuNPihXhy7E55o+GvoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-03-01 10:24                                                                                                 ` Igor Fedotov
2019-03-01 10:26                                                                                                 ` Igor Fedotov
2019-03-01  8:29                                                                                       ` Alexandre DERUMIER
2019-01-30 13:33               ` Sage Weil
     [not found]                 ` <alpine.DEB.2.11.1901301331580.5535-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2019-01-30 13:45                   ` Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.