rbd_cache, limiting read on high iops around 40k

All of lore.kernel.org
 help / color / mirror / Atom feed

* rbd_cache, limiting read on high iops around 40k
@ 2015-06-09  5:51 Alexandre DERUMIER
       [not found] ` <1684793881.1564583.1433829106394.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09  5:51 UTC (permalink / raw)
  To: ceph-devel, ceph-users

Hi,

I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
and rbd_cache=true seem to limit the iops around 40k


no cache
--------
1 client - rbd_cache=false - 1osd : 38300 iops
1 client - rbd_cache=false - 2osd : 69073 iops
1 client - rbd_cache=false - 3osd : 78292 iops


cache
-----
1 client - rbd_cache=true - 1osd : 38100 iops
1 client - rbd_cache=true - 2osd : 42457 iops
1 client - rbd_cache=true - 3osd : 45823 iops



Is it expected ? 



fio result rbd_cache=false 3 osd
--------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.1.11
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun  9 07:48:42 2015
  read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
    slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
    clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
     lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
    clat percentiles (usec):
     |  1.00th=[  173],  5.00th=[  209], 10.00th=[  231], 20.00th=[  262],
     | 30.00th=[  282], 40.00th=[  302], 50.00th=[  322], 60.00th=[  346],
     | 70.00th=[  370], 80.00th=[  402], 90.00th=[  454], 95.00th=[  506],
     | 99.00th=[  628], 99.50th=[  692], 99.90th=[  860], 99.95th=[  948],
     | 99.99th=[ 1176]
    bw (KB  /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
    lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
    lat (msec) : 2=0.03%, 4=0.01%
  cpu          : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
     issued    : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec

Disk stats (read/write):
    dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%




fio result rbd_cache=true 3osd
------------------------------

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.1.11
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun  9 07:47:30 2015
  read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
    slat (usec): min=7, max=805, avg=21.26, stdev=15.84
    clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
     lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
    clat percentiles (usec):
     |  1.00th=[  227],  5.00th=[  274], 10.00th=[  306], 20.00th=[  350],
     | 30.00th=[  390], 40.00th=[  430], 50.00th=[  470], 60.00th=[  506],
     | 70.00th=[  548], 80.00th=[  596], 90.00th=[  660], 95.00th=[  724],
     | 99.00th=[  844], 99.50th=[  908], 99.90th=[ 1112], 99.95th=[ 1288],
     | 99.99th=[ 2192]
    bw (KB  /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
    lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
    lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
  cpu          : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec

Disk stats (read/write):
    dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
  sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1684793881.1564583.1433829106394.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found] ` <1684793881.1564583.1433829106394.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-09  7:21   ` pushpesh sharma
       [not found]     ` <CAMc8nAWo-jnAHS5cLw5gDt57T3vZpiN79vFXc=pz=+Cjm6Ra6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: pushpesh sharma @ 2015-06-09  7:21 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, ceph-users


[-- Attachment #1.1.1: Type: text/plain, Size: 6525 bytes --]

Hi Alexandre,

We have also seen something very similar on Hammer(0.94-1). We were doing
some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno).
Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For
some strange reason it was not able to scale 4K- RR iops on each VM beyond
35-40k. We tried adding more RBDs to single VM, but no luck. However
increasing number of VMs to 4 on a single hypervisor did scale to some
extent. After this there was no much benefit we got from adding more VMs.

Here is the trend we have seen, x-axis is number of hypervisor, each
hypervisor has 4 VM, each VM has 1 RBD:-




 VDbench is used as benchmarking tool. We were not saturating network and
CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and
that is where we were suspecting of some throttling effect. However  we
haven't setted any such limits from nova or kvm end. We tried some CPU
pinning and other KVM related tuning as well, but no luck.

We tried the same experiment on a bare metal. It was 4K RR IOPs were
scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling
beyond that point the numbers were actually degrading. (Single pipe more
congestion effect)

We never suspected that rbd cache enable could be detrimental to
performance. It would nice to route cause the problem if that is the case.


On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
wrote:

> Hi,
>
> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
> and rbd_cache=true seem to limit the iops around 40k
>
>
> no cache
> --------
> 1 client - rbd_cache=false - 1osd : 38300 iops
> 1 client - rbd_cache=false - 2osd : 69073 iops
> 1 client - rbd_cache=false - 3osd : 78292 iops
>
>
> cache
> -----
> 1 client - rbd_cache=true - 1osd : 38100 iops
> 1 client - rbd_cache=true - 2osd : 42457 iops
> 1 client - rbd_cache=true - 3osd : 45823 iops
>
>
>
> Is it expected ?
>
>
>
> fio result rbd_cache=false 3 osd
> --------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> fio-2.1.11
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun  9
> 07:48:42 2015
>   read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>     slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>     clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>      lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>     clat percentiles (usec):
>      |  1.00th=[  173],  5.00th=[  209], 10.00th=[  231], 20.00th=[  262],
>      | 30.00th=[  282], 40.00th=[  302], 50.00th=[  322], 60.00th=[  346],
>      | 70.00th=[  370], 80.00th=[  402], 90.00th=[  454], 95.00th=[  506],
>      | 99.00th=[  628], 99.50th=[  692], 99.90th=[  860], 99.95th=[  948],
>      | 99.99th=[ 1176]
>     bw (KB  /s): min=238856, max=360448, per=100.00%, avg=313402.34,
> stdev=25196.21
>     lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>     lat (msec) : 2=0.03%, 4=0.01%
>   cpu          : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s,
> mint=32698msec, maxt=32698msec
>
> Disk stats (read/write):
>     dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>   sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>
>
>
>
> fio result rbd_cache=true 3osd
> ------------------------------
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> fio-2.1.11
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun  9
> 07:47:30 2015
>   read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>     slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>     clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>      lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>     clat percentiles (usec):
>      |  1.00th=[  227],  5.00th=[  274], 10.00th=[  306], 20.00th=[  350],
>      | 30.00th=[  390], 40.00th=[  430], 50.00th=[  470], 60.00th=[  506],
>      | 70.00th=[  548], 80.00th=[  596], 90.00th=[  660], 95.00th=[  724],
>      | 99.00th=[  844], 99.50th=[  908], 99.90th=[ 1112], 99.95th=[ 1288],
>      | 99.99th=[ 2192]
>     bw (KB  /s): min=115280, max=204416, per=100.00%, avg=183315.10,
> stdev=15079.93
>     lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>     lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>   cpu          : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s,
> mint=55866msec, maxt=55866msec
>
> Disk stats (read/write):
>     dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%,
> aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>   sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
-Pushpesh

[-- Attachment #1.1.2: Type: text/html, Size: 7606 bytes --]

[-- Attachment #1.2: Scale.png --]
[-- Type: image/png, Size: 31172 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <CAMc8nAWo-jnAHS5cLw5gDt57T3vZpiN79vFXc=pz=+Cjm6Ra6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]     ` <CAMc8nAWo-jnAHS5cLw5gDt57T3vZpiN79vFXc=pz=+Cjm6Ra6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-09  7:28       ` Alexandre DERUMIER
  2015-06-09  8:36         ` [ceph-users] " Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09  7:28 UTC (permalink / raw)
  To: pushpesh sharma; +Cc: ceph-devel, ceph-users


[-- Attachment #1.1.1: Type: text/plain, Size: 7171 bytes --]

Hi,

>> We tried adding more RBDs to single VM, but no luck.

If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
It's working for me, I can scale with adding more disks.


My bench here are done with fio-rbd on host.
I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.


I just wonder why I don't have performance decrease around 30k iops with 1osd.

I'm going to see if this tracker
http://tracker.ceph.com/issues/11056

could be the cause.

(My master build was done some week ago)



----- Mail original -----
De: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
Cc: "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 09:21:04
Objet: Re: rbd_cache, limiting read on high iops around 40k

Hi Alexandre, 

We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 

Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 



 
VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 

We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 

We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 

On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote: 


Hi, 

I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
and rbd_cache=true seem to limit the iops around 40k 


no cache 
-------- 
1 client - rbd_cache=false - 1osd : 38300 iops 
1 client - rbd_cache=false - 2osd : 69073 iops 
1 client - rbd_cache=false - 3osd : 78292 iops 


cache 
----- 
1 client - rbd_cache=true - 1osd : 38100 iops 
1 client - rbd_cache=true - 2osd : 42457 iops 
1 client - rbd_cache=true - 3osd : 45823 iops 



Is it expected ? 



fio result rbd_cache=false 3 osd 
-------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
clat percentiles (usec): 
| 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
| 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
| 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
| 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
| 99.99th=[ 1176] 
bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
lat (msec) : 2=0.03%, 4=0.01% 
cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 

Disk stats (read/write): 
dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 




fio result rbd_cache=true 3osd 
------------------------------ 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
clat percentiles (usec): 
| 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
| 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
| 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
| 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
| 99.99th=[ 2192] 
bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 

Disk stats (read/write): 
dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 






-- 
-Pushpesh 


[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
  2015-06-09  7:28       ` Alexandre DERUMIER
@ 2015-06-09  8:36         ` Alexandre DERUMIER
       [not found]           ` <1897614581.1694878.1433838989184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09  8:36 UTC (permalink / raw)
  To: pushpesh sharma; +Cc: ceph-devel, ceph-users

It's seem that the limit is mainly going in high queue depth (+- > 16)

Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
rbd_cache is almost the same than without cache with queue depth <16


cache
-----
qd1: 1651
qd2: 3482
qd4: 7958
qd8: 17912
qd16: 36020
qd32: 42765
qd64: 46169

no cache
--------
qd1: 1748
qd2: 3570
qd4: 8356
qd8: 17732
qd16: 41396
qd32: 78633
qd64: 79063
qd128: 79550


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "pushpesh sharma" <pushpesh.eck@gmail.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 09:28:21
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi, 

>> We tried adding more RBDs to single VM, but no luck. 

If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
It's working for me, I can scale with adding more disks. 


My bench here are done with fio-rbd on host. 
I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 


I just wonder why I don't have performance decrease around 30k iops with 1osd. 

I'm going to see if this tracker 
http://tracker.ceph.com/issues/11056 

could be the cause. 

(My master build was done some week ago) 



----- Mail original ----- 
De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
Envoyé: Mardi 9 Juin 2015 09:21:04 
Objet: Re: rbd_cache, limiting read on high iops around 40k 

Hi Alexandre, 

We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 

Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 




VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 

We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 

We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 

On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 


Hi, 

I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
and rbd_cache=true seem to limit the iops around 40k 


no cache 
-------- 
1 client - rbd_cache=false - 1osd : 38300 iops 
1 client - rbd_cache=false - 2osd : 69073 iops 
1 client - rbd_cache=false - 3osd : 78292 iops 


cache 
----- 
1 client - rbd_cache=true - 1osd : 38100 iops 
1 client - rbd_cache=true - 2osd : 42457 iops 
1 client - rbd_cache=true - 3osd : 45823 iops 



Is it expected ? 



fio result rbd_cache=false 3 osd 
-------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
clat percentiles (usec): 
| 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
| 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
| 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
| 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
| 99.99th=[ 1176] 
bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
lat (msec) : 2=0.03%, 4=0.01% 
cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 

Disk stats (read/write): 
dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 




fio result rbd_cache=true 3osd 
------------------------------ 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
fio-2.1.11 
Starting 1 process 
rbd engine: RBD version: 0.1.9 
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
clat percentiles (usec): 
| 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
| 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
| 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
| 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
| 99.99th=[ 2192] 
bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 

Disk stats (read/write): 
dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 






-- 
-Pushpesh 


_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1897614581.1694878.1433838989184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]           ` <1897614581.1694878.1433838989184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-09 11:36             ` Mark Nelson
  2015-06-09 12:02               ` [ceph-users] " Alexandre DERUMIER
  2015-06-09 13:39               ` [ceph-users] " Jason Dillaman
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Nelson @ 2015-06-09 11:36 UTC (permalink / raw)
  To: Alexandre DERUMIER, pushpesh sharma; +Cc: ceph-devel, ceph-users

Hi All,

In the past we've hit some performance issues with RBD cache that we've 
fixed, but we've never really tried pushing a single VM beyond 40+K read 
IOPS in testing (or at least I never have).  I suspect there's a couple 
of possibilities as to why it might be slower, but perhaps joshd can 
chime in as he's more familiar with what that code looks like.

Frankly, I'm a little impressed that without RBD cache we can hit 80K 
IOPS from 1 VM!  How fast are the SSDs in those 3 OSDs?

Mark

On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
> It's seem that the limit is mainly going in high queue depth (+- > 16)
>
> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
> rbd_cache is almost the same than without cache with queue depth <16
>
>
> cache
> -----
> qd1: 1651
> qd2: 3482
> qd4: 7958
> qd8: 17912
> qd16: 36020
> qd32: 42765
> qd64: 46169
>
> no cache
> --------
> qd1: 1748
> qd2: 3570
> qd4: 8356
> qd8: 17732
> qd16: 41396
> qd32: 78633
> qd64: 79063
> qd128: 79550
>
>
> ----- Mail original -----
> De: "aderumier" <aderumier@odiso.com>
> À: "pushpesh sharma" <pushpesh.eck@gmail.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Mardi 9 Juin 2015 09:28:21
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi,
>
>>> We tried adding more RBDs to single VM, but no luck.
>
> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
> It's working for me, I can scale with adding more disks.
>
>
> My bench here are done with fio-rbd on host.
> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>
>
> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>
> I'm going to see if this tracker
> http://tracker.ceph.com/issues/11056
>
> could be the cause.
>
> (My master build was done some week ago)
>
>
>
> ----- Mail original -----
> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Mardi 9 Juin 2015 09:21:04
> Objet: Re: rbd_cache, limiting read on high iops around 40k
>
> Hi Alexandre,
>
> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>
> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>
>
>
>
> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>
> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>
> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>
> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>
>
> Hi,
>
> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
> and rbd_cache=true seem to limit the iops around 40k
>
>
> no cache
> --------
> 1 client - rbd_cache=false - 1osd : 38300 iops
> 1 client - rbd_cache=false - 2osd : 69073 iops
> 1 client - rbd_cache=false - 3osd : 78292 iops
>
>
> cache
> -----
> 1 client - rbd_cache=true - 1osd : 38100 iops
> 1 client - rbd_cache=true - 2osd : 42457 iops
> 1 client - rbd_cache=true - 3osd : 45823 iops
>
>
>
> Is it expected ?
>
>
>
> fio result rbd_cache=false 3 osd
> --------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
> fio-2.1.11
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
> clat percentiles (usec):
> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
> | 99.99th=[ 1176]
> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
> lat (msec) : 2=0.03%, 4=0.01%
> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>
> Disk stats (read/write):
> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>
>
>
>
> fio result rbd_cache=true 3osd
> ------------------------------
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
> fio-2.1.11
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
> clat percentiles (usec):
> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
> | 99.99th=[ 2192]
> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>
> Disk stats (read/write):
> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
  2015-06-09 11:36             ` Mark Nelson
@ 2015-06-09 12:02               ` Alexandre DERUMIER
       [not found]                 ` <1208111516.1790161.1433851367996.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  2015-06-09 13:39               ` [ceph-users] " Jason Dillaman
  1 sibling, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09 12:02 UTC (permalink / raw)
  To: Mark Nelson; +Cc: pushpesh sharma, ceph-devel, ceph-users

>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>IOPS from 1 VM!

Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
(I'm planning to send results in qemu soon)

>>How fast are the SSDs in those 3 OSDs? 

Theses results are with datas in buffer memory of osd nodes.

When reading fulling on ssd (intel s3500),

For 1 client, 

I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
I'm around 55k iops without cache and 38k iops with cache, with 3 osd.

with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.

(cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)



small tip : 
I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%

LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...

as a lot of time is spent in malloc/free 


(qemu support also tcmalloc since some months , I'll bench it too
  https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)



I'll try to send full bench results soon, from 1 to 18 ssd osd.




----- Mail original -----
De: "Mark Nelson" <mnelson@redhat.com>
À: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 13:36:31
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi All, 

In the past we've hit some performance issues with RBD cache that we've 
fixed, but we've never really tried pushing a single VM beyond 40+K read 
IOPS in testing (or at least I never have). I suspect there's a couple 
of possibilities as to why it might be slower, but perhaps joshd can 
chime in as he's more familiar with what that code looks like. 

Frankly, I'm a little impressed that without RBD cache we can hit 80K 
IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 

Mark 

On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
> It's seem that the limit is mainly going in high queue depth (+- > 16) 
> 
> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
> rbd_cache is almost the same than without cache with queue depth <16 
> 
> 
> cache 
> ----- 
> qd1: 1651 
> qd2: 3482 
> qd4: 7958 
> qd8: 17912 
> qd16: 36020 
> qd32: 42765 
> qd64: 46169 
> 
> no cache 
> -------- 
> qd1: 1748 
> qd2: 3570 
> qd4: 8356 
> qd8: 17732 
> qd16: 41396 
> qd32: 78633 
> qd64: 79063 
> qd128: 79550 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier@odiso.com> 
> À: "pushpesh sharma" <pushpesh.eck@gmail.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mardi 9 Juin 2015 09:28:21 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi, 
> 
>>> We tried adding more RBDs to single VM, but no luck. 
> 
> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
> It's working for me, I can scale with adding more disks. 
> 
> 
> My bench here are done with fio-rbd on host. 
> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
> 
> 
> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
> 
> I'm going to see if this tracker 
> http://tracker.ceph.com/issues/11056 
> 
> could be the cause. 
> 
> (My master build was done some week ago) 
> 
> 
> 
> ----- Mail original ----- 
> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mardi 9 Juin 2015 09:21:04 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> Hi Alexandre, 
> 
> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
> 
> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
> 
> 
> 
> 
> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
> 
> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
> 
> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
> 
> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
> 
> 
> Hi, 
> 
> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
> and rbd_cache=true seem to limit the iops around 40k 
> 
> 
> no cache 
> -------- 
> 1 client - rbd_cache=false - 1osd : 38300 iops 
> 1 client - rbd_cache=false - 2osd : 69073 iops 
> 1 client - rbd_cache=false - 3osd : 78292 iops 
> 
> 
> cache 
> ----- 
> 1 client - rbd_cache=true - 1osd : 38100 iops 
> 1 client - rbd_cache=true - 2osd : 42457 iops 
> 1 client - rbd_cache=true - 3osd : 45823 iops 
> 
> 
> 
> Is it expected ? 
> 
> 
> 
> fio result rbd_cache=false 3 osd 
> -------------------------------- 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> rbd engine: RBD version: 0.1.9 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
> clat percentiles (usec): 
> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
> | 99.99th=[ 1176] 
> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
> lat (msec) : 2=0.03%, 4=0.01% 
> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
> 
> Disk stats (read/write): 
> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
> 
> 
> 
> 
> fio result rbd_cache=true 3osd 
> ------------------------------ 
> 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> rbd engine: RBD version: 0.1.9 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
> clat percentiles (usec): 
> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
> | 99.99th=[ 2192] 
> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
> 
> Disk stats (read/write): 
> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1208111516.1790161.1433851367996.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                 ` <1208111516.1790161.1433851367996.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-09 16:00                   ` Robert LeBlanc
  2015-06-09 16:47                     ` [ceph-users] " Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Robert LeBlanc @ 2015-06-09 16:00 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, pushpesh sharma, ceph-users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I also saw a similar performance increase by using alternative memory
allocators. What I found was that Ceph OSDs performed well with either
tcmalloc or jemalloc (except when RocksDB was built with jemalloc
instead of tcmalloc, I'm still working to dig into why that might be
the case).

However, I found that tcmalloc with QEMU/KVM was very detrimental to
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
better for QEMU/KVM in the tests that we ran. [1]

I'm currently looking into I/O bottlenecks around the 16KB range and
I'm seeing a lot of time in thread creation and destruction, the
memory allocators are quite a bit down the list (both fio with
ioengine rbd and on the OSDs). I wonder what the difference can be.
I've tried using the async messenger but there wasn't a huge
difference. [2]

Further down the rabbit hole....

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
oSJX
=k281
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>IOPS from 1 VM!
>
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
> (I'm planning to send results in qemu soon)
>
>>>How fast are the SSDs in those 3 OSDs?
>
> Theses results are with datas in buffer memory of osd nodes.
>
> When reading fulling on ssd (intel s3500),
>
> For 1 client,
>
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.
>
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>
>
>
> small tip :
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%
>
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>
> as a lot of time is spent in malloc/free
>
>
> (qemu support also tcmalloc since some months , I'll bench it too
>   https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)
>
>
>
> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>
>
>
>
> ----- Mail original -----
> De: "Mark Nelson" <mnelson@redhat.com>
> À: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Mardi 9 Juin 2015 13:36:31
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi All,
>
> In the past we've hit some performance issues with RBD cache that we've
> fixed, but we've never really tried pushing a single VM beyond 40+K read
> IOPS in testing (or at least I never have). I suspect there's a couple
> of possibilities as to why it might be slower, but perhaps joshd can
> chime in as he's more familiar with what that code looks like.
>
> Frankly, I'm a little impressed that without RBD cache we can hit 80K
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>
> Mark
>
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
>> rbd_cache is almost the same than without cache with queue depth <16
>>
>>
>> cache
>> -----
>> qd1: 1651
>> qd2: 3482
>> qd4: 7958
>> qd8: 17912
>> qd16: 36020
>> qd32: 42765
>> qd64: 46169
>>
>> no cache
>> --------
>> qd1: 1748
>> qd2: 3570
>> qd4: 8356
>> qd8: 17732
>> qd16: 41396
>> qd32: 78633
>> qd64: 79063
>> qd128: 79550
>>
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier@odiso.com>
>> À: "pushpesh sharma" <pushpesh.eck@gmail.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Mardi 9 Juin 2015 09:28:21
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi,
>>
>>>> We tried adding more RBDs to single VM, but no luck.
>>
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
>> It's working for me, I can scale with adding more disks.
>>
>>
>> My bench here are done with fio-rbd on host.
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>>
>>
>> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>>
>> I'm going to see if this tracker
>> http://tracker.ceph.com/issues/11056
>>
>> could be the cause.
>>
>> (My master build was done some week ago)
>>
>>
>>
>> ----- Mail original -----
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
>> À: "aderumier" <aderumier@odiso.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Mardi 9 Juin 2015 09:21:04
>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>
>> Hi Alexandre,
>>
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>>
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>>
>>
>>
>>
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>>
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>>
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>>
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>
>>
>> Hi,
>>
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>> and rbd_cache=true seem to limit the iops around 40k
>>
>>
>> no cache
>> --------
>> 1 client - rbd_cache=false - 1osd : 38300 iops
>> 1 client - rbd_cache=false - 2osd : 69073 iops
>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>
>>
>> cache
>> -----
>> 1 client - rbd_cache=true - 1osd : 38100 iops
>> 1 client - rbd_cache=true - 2osd : 42457 iops
>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>
>>
>>
>> Is it expected ?
>>
>>
>>
>> fio result rbd_cache=false 3 osd
>> --------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>> clat percentiles (usec):
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>> | 99.99th=[ 1176]
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>> lat (msec) : 2=0.03%, 4=0.01%
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>
>>
>>
>>
>> fio result rbd_cache=true 3osd
>> ------------------------------
>>
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>> clat percentiles (usec):
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>> | 99.99th=[ 2192]
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
  2015-06-09 16:00                   ` Robert LeBlanc
@ 2015-06-09 16:47                     ` Alexandre DERUMIER
       [not found]                       ` <1058039366.2034449.1433868447253.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09 16:47 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Mark Nelson, ceph-devel, pushpesh sharma, ceph-users

Hi Robert,

>>What I found was that Ceph OSDs performed well with either 
>>tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>instead of tcmalloc, I'm still working to dig into why that might be 
>>the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc.



>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>better for QEMU/KVM in the tests that we ran. [1]


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc.
with qemu iothread, tcmalloc have a speed increase over glib
with qemu iothread, jemalloc have a speed decrease

without iothread, jemalloc have a big speed increase

this is with 
-qemu 2.3
-tcmalloc 2.2.1
-jemmaloc 3.6
-libc6 2.19


qemu : no iothread : glibc    : iops=33395
qemu : no-iothread : tcmalloc : iops=34516 (+3%)
qemu : no-iothread : jemmaloc : iops=42226 (+26%)

qemu : iothread :     glibc   : iops=34516
qemu : iothread :    tcmalloc : iops=38676 (+12%)
qemu : iothread :    jemmaloc : iops=28023 (-19%)


(The benefit of iothreads is that we can scale with more disks in 1vm)


fio results:
------------

qemu : iothread : tcmalloc : iops=38676
-----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun  9 18:16:53 2015
  read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
    slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
    clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
     lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
    clat percentiles (usec):
     |  1.00th=[  402],  5.00th=[  466], 10.00th=[  510], 20.00th=[  572],
     | 30.00th=[  636], 40.00th=[  716], 50.00th=[  780], 60.00th=[  852],
     | 70.00th=[  932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
     | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
     | 99.99th=[ 3888]
    bw (KB  /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
    lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
    lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
  cpu          : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec

Disk stats (read/write):
  vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%



qemu : no-iothread : tcmalloc : iops=34516
---------------------------------------------
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun  9 18:19:08 2015
  read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
    slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
    clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
     lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
    clat percentiles (usec):
     |  1.00th=[  434],  5.00th=[  510], 10.00th=[  564], 20.00th=[  652],
     | 30.00th=[  732], 40.00th=[  812], 50.00th=[  876], 60.00th=[  940],
     | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
     | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
     | 99.99th=[ 4320]
    bw (KB  /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
    lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
    lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
  cpu          : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec

Disk stats (read/write):
  vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%



qemu : iothread : glibc : iops=34516
-------------------------------------

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun  9 18:24:01 2015
  read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
    slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
    clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
     lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
    clat percentiles (usec):
     |  1.00th=[  506],  5.00th=[  564], 10.00th=[  596], 20.00th=[  652],
     | 30.00th=[  724], 40.00th=[  804], 50.00th=[  884], 60.00th=[  964],
     | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
     | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
     | 99.99th=[ 3984]
    bw (KB  /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
    lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
    lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
  cpu          : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec

Disk stats (read/write):
  vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%



qemu : no iothread : glibc : iops=33395
-----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun  9 18:27:18 2015
  read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
    slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
    clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
     lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
    clat percentiles (usec):
     |  1.00th=[  516],  5.00th=[  564], 10.00th=[  596], 20.00th=[  652],
     | 30.00th=[  724], 40.00th=[  820], 50.00th=[  924], 60.00th=[  996],
     | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
     | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
     | 99.99th=[ 4832]
    bw (KB  /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
    lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
    lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
  cpu          : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec

Disk stats (read/write):
  vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%



qemu : iothread : jemmaloc : iops=28023
----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun  9 18:30:26 2015
  read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
    slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
    clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
     lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
    clat percentiles (usec):
     |  1.00th=[  510],  5.00th=[  628], 10.00th=[  700], 20.00th=[  820],
     | 30.00th=[  924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
     | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
     | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
     | 99.99th=[ 3760]
    bw (KB  /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
    lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
    lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
  cpu          : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec

Disk stats (read/write):
  vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%



qemu : non-iothread : jemmaloc : iops=42226
--------------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun  9 18:34:11 2015
  read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
    slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
    clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
     lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
    clat percentiles (usec):
     |  1.00th=[  354],  5.00th=[  422], 10.00th=[  462], 20.00th=[  516],
     | 30.00th=[  572], 40.00th=[  628], 50.00th=[  684], 60.00th=[  740],
     | 70.00th=[  804], 80.00th=[  884], 90.00th=[ 1004], 95.00th=[ 1128],
     | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
     | 99.99th=[ 2608]
    bw (KB  /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
    lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
    lat (msec) : 2=10.30%, 4=0.07%
  cpu          : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec

Disk stats (read/write):
  vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%



----- Mail original -----
De: "Robert LeBlanc" <robert@leblancnet.us>
À: "aderumier" <aderumier@odiso.com>
Cc: "Mark Nelson" <mnelson@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 18:00:29
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

-----BEGIN PGP SIGNED MESSAGE----- 
Hash: SHA256 

I also saw a similar performance increase by using alternative memory 
allocators. What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 

However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran. [1] 

I'm currently looking into I/O bottlenecks around the 16KB range and 
I'm seeing a lot of time in thread creation and destruction, the 
memory allocators are quite a bit down the list (both fio with 
ioengine rbd and on the OSDs). I wonder what the difference can be. 
I've tried using the async messenger but there wasn't a huge 
difference. [2] 

Further down the rabbit hole.... 

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
-----BEGIN PGP SIGNATURE----- 
Version: Mailvelope v0.13.1 
Comment: https://www.mailvelope.com 

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
oSJX 
=k281 
-----END PGP SIGNATURE----- 
---------------- 
Robert LeBlanc 
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>IOPS from 1 VM! 
> 
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
> (I'm planning to send results in qemu soon) 
> 
>>>How fast are the SSDs in those 3 OSDs? 
> 
> Theses results are with datas in buffer memory of osd nodes. 
> 
> When reading fulling on ssd (intel s3500), 
> 
> For 1 client, 
> 
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
> 
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
> 
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
> 
> 
> 
> small tip : 
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
> 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
> 
> as a lot of time is spent in malloc/free 
> 
> 
> (qemu support also tcmalloc since some months , I'll bench it too 
> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html) 
> 
> 
> 
> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Mark Nelson" <mnelson@redhat.com> 
> À: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mardi 9 Juin 2015 13:36:31 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi All, 
> 
> In the past we've hit some performance issues with RBD cache that we've 
> fixed, but we've never really tried pushing a single VM beyond 40+K read 
> IOPS in testing (or at least I never have). I suspect there's a couple 
> of possibilities as to why it might be slower, but perhaps joshd can 
> chime in as he's more familiar with what that code looks like. 
> 
> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
> 
> Mark 
> 
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>> 
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>> rbd_cache is almost the same than without cache with queue depth <16 
>> 
>> 
>> cache 
>> ----- 
>> qd1: 1651 
>> qd2: 3482 
>> qd4: 7958 
>> qd8: 17912 
>> qd16: 36020 
>> qd32: 42765 
>> qd64: 46169 
>> 
>> no cache 
>> -------- 
>> qd1: 1748 
>> qd2: 3570 
>> qd4: 8356 
>> qd8: 17732 
>> qd16: 41396 
>> qd32: 78633 
>> qd64: 79063 
>> qd128: 79550 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi, 
>> 
>>>> We tried adding more RBDs to single VM, but no luck. 
>> 
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>> It's working for me, I can scale with adding more disks. 
>> 
>> 
>> My bench here are done with fio-rbd on host. 
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>> 
>> 
>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>> 
>> I'm going to see if this tracker 
>> http://tracker.ceph.com/issues/11056 
>> 
>> could be the cause. 
>> 
>> (My master build was done some week ago) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> 
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>> 
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>> 
>> 
>> 
>> 
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>> 
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>> 
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>> 
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>> 
>> 
>> Hi, 
>> 
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>> and rbd_cache=true seem to limit the iops around 40k 
>> 
>> 
>> no cache 
>> -------- 
>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>> 
>> 
>> cache 
>> ----- 
>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>> 
>> 
>> 
>> Is it expected ? 
>> 
>> 
>> 
>> fio result rbd_cache=false 3 osd 
>> -------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>> clat percentiles (usec): 
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>> | 99.99th=[ 1176] 
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>> lat (msec) : 2=0.03%, 4=0.01% 
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>> 
>> 
>> 
>> 
>> fio result rbd_cache=true 3osd 
>> ------------------------------ 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>> clat percentiles (usec): 
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>> | 99.99th=[ 2192] 
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1058039366.2034449.1433868447253.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                       ` <1058039366.2034449.1433868447253.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-10  4:10                         ` Alexandre DERUMIER
       [not found]                           ` <284297771.2095666.1433909407567.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-10  4:10 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel, pushpesh sharma, ceph-users

Hi,

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) !



qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)


qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 
qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
------------------------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015
  read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec
    slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58
    clat (usec): min=128, max=6262, avg=631.41, stdev=197.71
     lat (usec): min=149, max=6265, avg=635.27, stdev=197.40
    clat percentiles (usec):
     |  1.00th=[  318],  5.00th=[  378], 10.00th=[  418], 20.00th=[  474],
     | 30.00th=[  516], 40.00th=[  564], 50.00th=[  612], 60.00th=[  652],
     | 70.00th=[  700], 80.00th=[  756], 90.00th=[  860], 95.00th=[  980],
     | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
     | 99.99th=[ 3760]
    bw (KB  /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87
    lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
    lat (msec) : 2=4.46%, 4=0.03%, 10=0.01%
  cpu          : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec

Disk stats (read/write):
  vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015
  read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec
    slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35
    clat (usec): min=191, max=4740, avg=884.66, stdev=315.65
     lat (usec): min=289, max=4743, avg=888.31, stdev=315.51
    clat percentiles (usec):
     |  1.00th=[  462],  5.00th=[  516], 10.00th=[  548], 20.00th=[  596],
     | 30.00th=[  652], 40.00th=[  764], 50.00th=[  868], 60.00th=[  940],
     | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
     | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
     | 99.99th=[ 3632]
    bw (KB  /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39
    lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
    lat (msec) : 2=29.74%, 4=1.07%, 10=0.01%
  cpu          : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec

Disk stats (read/write):
  vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Robert LeBlanc" <robert@leblancnet.us>
Cc: "Mark Nelson" <mnelson@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 18:47:27
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi Robert, 

>>What I found was that Ceph OSDs performed well with either 
>>tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>instead of tcmalloc, I'm still working to dig into why that might be 
>>the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 



>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>better for QEMU/KVM in the tests that we ran. [1] 


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
with qemu iothread, tcmalloc have a speed increase over glib 
with qemu iothread, jemalloc have a speed decrease 

without iothread, jemalloc have a big speed increase 

this is with 
-qemu 2.3 
-tcmalloc 2.2.1 
-jemmaloc 3.6 
-libc6 2.19 


qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 

qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 


(The benefit of iothreads is that we can scale with more disks in 1vm) 


fio results: 
------------ 

qemu : iothread : tcmalloc : iops=38676 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
clat percentiles (usec): 
| 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
| 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
| 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
| 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
| 99.99th=[ 3888] 
bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 

Disk stats (read/write): 
vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 



qemu : no-iothread : tcmalloc : iops=34516 
--------------------------------------------- 
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
clat percentiles (usec): 
| 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
| 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
| 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
| 99.99th=[ 4320] 
bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 

Disk stats (read/write): 
vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 



qemu : iothread : glibc : iops=34516 
------------------------------------- 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
clat percentiles (usec): 
| 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
| 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
| 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
| 99.99th=[ 3984] 
bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 

Disk stats (read/write): 
vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 



qemu : no iothread : glibc : iops=33395 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
clat percentiles (usec): 
| 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
| 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
| 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
| 99.99th=[ 4832] 
bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 

Disk stats (read/write): 
vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 



qemu : iothread : jemmaloc : iops=28023 
---------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
clat percentiles (usec): 
| 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
| 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
| 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
| 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
| 99.99th=[ 3760] 
bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 

Disk stats (read/write): 
vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 



qemu : non-iothread : jemmaloc : iops=42226 
-------------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
clat percentiles (usec): 
| 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
| 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
| 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
| 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
| 99.99th=[ 2608] 
bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
lat (msec) : 2=10.30%, 4=0.07% 
cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 

Disk stats (read/write): 
vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 



----- Mail original ----- 
De: "Robert LeBlanc" <robert@leblancnet.us> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Mark Nelson" <mnelson@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
Envoyé: Mardi 9 Juin 2015 18:00:29 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

-----BEGIN PGP SIGNED MESSAGE----- 
Hash: SHA256 

I also saw a similar performance increase by using alternative memory 
allocators. What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 

However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran. [1] 

I'm currently looking into I/O bottlenecks around the 16KB range and 
I'm seeing a lot of time in thread creation and destruction, the 
memory allocators are quite a bit down the list (both fio with 
ioengine rbd and on the OSDs). I wonder what the difference can be. 
I've tried using the async messenger but there wasn't a huge 
difference. [2] 

Further down the rabbit hole.... 

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
-----BEGIN PGP SIGNATURE----- 
Version: Mailvelope v0.13.1 
Comment: https://www.mailvelope.com 

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
oSJX 
=k281 
-----END PGP SIGNATURE----- 
---------------- 
Robert LeBlanc 
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>IOPS from 1 VM! 
> 
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
> (I'm planning to send results in qemu soon) 
> 
>>>How fast are the SSDs in those 3 OSDs? 
> 
> Theses results are with datas in buffer memory of osd nodes. 
> 
> When reading fulling on ssd (intel s3500), 
> 
> For 1 client, 
> 
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
> 
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
> 
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
> 
> 
> 
> small tip : 
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
> 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
> 
> as a lot of time is spent in malloc/free 
> 
> 
> (qemu support also tcmalloc since some months , I'll bench it too 
> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html) 
> 
> 
> 
> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Mark Nelson" <mnelson@redhat.com> 
> À: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mardi 9 Juin 2015 13:36:31 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi All, 
> 
> In the past we've hit some performance issues with RBD cache that we've 
> fixed, but we've never really tried pushing a single VM beyond 40+K read 
> IOPS in testing (or at least I never have). I suspect there's a couple 
> of possibilities as to why it might be slower, but perhaps joshd can 
> chime in as he's more familiar with what that code looks like. 
> 
> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
> 
> Mark 
> 
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>> 
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>> rbd_cache is almost the same than without cache with queue depth <16 
>> 
>> 
>> cache 
>> ----- 
>> qd1: 1651 
>> qd2: 3482 
>> qd4: 7958 
>> qd8: 17912 
>> qd16: 36020 
>> qd32: 42765 
>> qd64: 46169 
>> 
>> no cache 
>> -------- 
>> qd1: 1748 
>> qd2: 3570 
>> qd4: 8356 
>> qd8: 17732 
>> qd16: 41396 
>> qd32: 78633 
>> qd64: 79063 
>> qd128: 79550 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderumier@odiso.com> 
>> À: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi, 
>> 
>>>> We tried adding more RBDs to single VM, but no luck. 
>> 
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>> It's working for me, I can scale with adding more disks. 
>> 
>> 
>> My bench here are done with fio-rbd on host. 
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>> 
>> 
>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>> 
>> I'm going to see if this tracker 
>> http://tracker.ceph.com/issues/11056 
>> 
>> could be the cause. 
>> 
>> (My master build was done some week ago) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> 
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>> 
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>> 
>> 
>> 
>> 
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>> 
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>> 
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>> 
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>> 
>> 
>> Hi, 
>> 
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>> and rbd_cache=true seem to limit the iops around 40k 
>> 
>> 
>> no cache 
>> -------- 
>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>> 
>> 
>> cache 
>> ----- 
>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>> 
>> 
>> 
>> Is it expected ? 
>> 
>> 
>> 
>> fio result rbd_cache=false 3 osd 
>> -------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>> clat percentiles (usec): 
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>> | 99.99th=[ 1176] 
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>> lat (msec) : 2=0.03%, 4=0.01% 
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>> 
>> 
>> 
>> 
>> fio result rbd_cache=true 3osd 
>> ------------------------------ 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>> clat percentiles (usec): 
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>> | 99.99th=[ 2192] 
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <284297771.2095666.1433909407567.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                           ` <284297771.2095666.1433909407567.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-10  5:21                             ` Irek Fasikhov
       [not found]                               ` <CAF-rypxjbsH3GdUG474OgSZVjdzKyf_0n8-zAkAuGhk83TXQhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Irek Fasikhov @ 2015-06-10  5:21 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, pushpesh sharma, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 32001 bytes --]

Hi, Alexandre.

Very good work!
Do you have a rpm-file?
Thanks.

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>:

> Hi,
>
> I have tested qemu with last tcmalloc 2.4, and the improvement is huge
> with iothread: 50k iops (+45%) !
>
>
>
> qemu : no iothread : glibc : iops=33395
> qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%)
> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
> qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
>
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%)
> qemu : iothread : jemmaloc : iops=28023 (-19%)
> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>
>
>
>
>
> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
> ------------------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10
> 05:54:24 2015
>   read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec
>     slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58
>     clat (usec): min=128, max=6262, avg=631.41, stdev=197.71
>      lat (usec): min=149, max=6265, avg=635.27, stdev=197.40
>     clat percentiles (usec):
>      |  1.00th=[  318],  5.00th=[  378], 10.00th=[  418], 20.00th=[  474],
>      | 30.00th=[  516], 40.00th=[  564], 50.00th=[  612], 60.00th=[  652],
>      | 70.00th=[  700], 80.00th=[  756], 90.00th=[  860], 95.00th=[  980],
>      | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
>      | 99.99th=[ 3760]
>     bw (KB  /s): min=145608, max=249688, per=100.00%, avg=201108.00,
> stdev=21718.87
>     lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
>     lat (msec) : 2=4.46%, 4=0.03%, 10=0.01%
>   cpu          : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s,
> mint=26070msec, maxt=26070msec
>
> Disk stats (read/write):
>   vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840,
> util=99.73%
>
>
>
>
>
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10
> 06:05:06 2015
>   read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec
>     slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35
>     clat (usec): min=191, max=4740, avg=884.66, stdev=315.65
>      lat (usec): min=289, max=4743, avg=888.31, stdev=315.51
>     clat percentiles (usec):
>      |  1.00th=[  462],  5.00th=[  516], 10.00th=[  548], 20.00th=[  596],
>      | 30.00th=[  652], 40.00th=[  764], 50.00th=[  868], 60.00th=[  940],
>      | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
>      | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
>      | 99.99th=[ 3632]
>     bw (KB  /s): min=98352, max=177328, per=99.91%, avg=143772.11,
> stdev=21782.39
>     lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
>     lat (msec) : 2=29.74%, 4=1.07%, 10=0.01%
>   cpu          : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s,
> mint=36435msec, maxt=36435msec
>
> Disk stats (read/write):
>   vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716,
> util=99.85%
>
>
> ----- Mail original -----
> De: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> À: "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
> Cc: "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> Envoyé: Mardi 9 Juin 2015 18:47:27
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi Robert,
>
> >>What I found was that Ceph OSDs performed well with either
> >>tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> >>instead of tcmalloc, I'm still working to dig into why that might be
> >>the case).
> yes,from my test, for osd tcmalloc is a little faster (but very little)
> than jemalloc.
>
>
>
> >>However, I found that tcmalloc with QEMU/KVM was very detrimental to
> >>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> >>better for QEMU/KVM in the tests that we ran. [1]
>
>
> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed
> regression with tcmalloc.
> with qemu iothread, tcmalloc have a speed increase over glib
> with qemu iothread, jemalloc have a speed decrease
>
> without iothread, jemalloc have a big speed increase
>
> this is with
> -qemu 2.3
> -tcmalloc 2.2.1
> -jemmaloc 3.6
> -libc6 2.19
>
>
> qemu : no iothread : glibc : iops=33395
> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%)
> qemu : iothread : jemmaloc : iops=28023 (-19%)
>
>
> (The benefit of iothreads is that we can scale with more disks in 1vm)
>
>
> fio results:
> ------------
>
> qemu : iothread : tcmalloc : iops=38676
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9
> 18:16:53 2015
> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
> clat percentiles (usec):
> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
> | 99.99th=[ 3888]
> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40,
> stdev=16978.03
> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s,
> mint=33889msec, maxt=33889msec
>
> Disk stats (read/write):
> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%
>
>
>
> qemu : no-iothread : tcmalloc : iops=34516
> ---------------------------------------------
> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9
> 18:19:08 2015
> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
> clat percentiles (usec):
> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
> | 99.99th=[ 4320]
> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88,
> stdev=16883.77
> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s,
> mint=37974msec, maxt=37974msec
>
> Disk stats (read/write):
> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396,
> util=99.86%
>
>
>
> qemu : iothread : glibc : iops=34516
> -------------------------------------
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9
> 18:24:01 2015
> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
> clat percentiles (usec):
> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
> | 99.99th=[ 3984]
> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78,
> stdev=15521.30
> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s,
> mint=38051msec, maxt=38051msec
>
> Disk stats (read/write):
> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972,
> util=99.85%
>
>
>
> qemu : no iothread : glibc : iops=33395
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9
> 18:27:18 2015
> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
> clat percentiles (usec):
> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
> | 99.99th=[ 4832]
> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64,
> stdev=19121.91
> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s,
> mint=39248msec, maxt=39248msec
>
> Disk stats (read/write):
> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536,
> util=99.84%
>
>
>
> qemu : iothread : jemmaloc : iops=28023
> ----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops]
> [eta 00m:01s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9
> 18:30:26 2015
> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
> clat percentiles (usec):
> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
> | 99.99th=[ 3760]
> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27,
> stdev=17381.70
> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s,
> mint=46772msec, maxt=46772msec
>
> Disk stats (read/write):
> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376,
> util=98.68%
>
>
>
> qemu : non-iothread : jemmaloc : iops=42226
> --------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops]
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9
> 18:34:11 2015
> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
> clat percentiles (usec):
> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
> | 99.99th=[ 2608]
> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14,
> stdev=23440.79
> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
> lat (msec) : 2=10.30%, 4=0.07%
> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s,
> mint=29599msec, maxt=29599msec
>
> Disk stats (read/write):
> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%
>
>
>
> ----- Mail original -----
> De: "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> Cc: "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> Envoyé: Mardi 9 Juin 2015 18:00:29
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I also saw a similar performance increase by using alternative memory
> allocators. What I found was that Ceph OSDs performed well with either
> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> instead of tcmalloc, I'm still working to dig into why that might be
> the case).
>
> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> better for QEMU/KVM in the tests that we ran. [1]
>
> I'm currently looking into I/O bottlenecks around the 16KB range and
> I'm seeing a lot of time in thread creation and destruction, the
> memory allocators are quite a bit down the list (both fio with
> ioengine rbd and on the OSDs). I wonder what the difference can be.
> I've tried using the async messenger but there wasn't a huge
> difference. [2]
>
> Further down the rabbit hole....
>
> [1] https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg20197.html
> [2] https://www.mail-archive.com/ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg23982.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
> oSJX
> =k281
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> wrote:
> >>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
> >>>IOPS from 1 VM!
> >
> > Note that theses result are not in a vm (fio-rbd on host), so in a vm
> we'll have overhead.
> > (I'm planning to send results in qemu soon)
> >
> >>>How fast are the SSDs in those 3 OSDs?
> >
> > Theses results are with datas in buffer memory of osd nodes.
> >
> > When reading fulling on ssd (intel s3500),
> >
> > For 1 client,
> >
> > I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
> > I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
> >
> > with multiple clients jobs, I can reach around 70kiops by osd , and 250k
> iops by osd when datas are in buffer.
> >
> > (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
> >
> >
> >
> > small tip :
> > I'm using tcmalloc for fio-rbd or rados bench to improve latencies by
> around 20%
> >
> > LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
> > LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
> >
> > as a lot of time is spent in malloc/free
> >
> >
> > (qemu support also tcmalloc since some months , I'll bench it too
> > https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)
> >
> >
> >
> > I'll try to send full bench results soon, from 1 to 18 ssd osd.
> >
> >
> >
> >
> > ----- Mail original -----
> > De: "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Cc: "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> > Envoyé: Mardi 9 Juin 2015 13:36:31
> > Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
> >
> > Hi All,
> >
> > In the past we've hit some performance issues with RBD cache that we've
> > fixed, but we've never really tried pushing a single VM beyond 40+K read
> > IOPS in testing (or at least I never have). I suspect there's a couple
> > of possibilities as to why it might be slower, but perhaps joshd can
> > chime in as he's more familiar with what that code looks like.
> >
> > Frankly, I'm a little impressed that without RBD cache we can hit 80K
> > IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
> >
> > Mark
> >
> > On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
> >> It's seem that the limit is mainly going in high queue depth (+- > 16)
> >>
> >> Here the result in iops with 1client- 4krandread- 3osd - with
> differents queue depth size.
> >> rbd_cache is almost the same than without cache with queue depth <16
> >>
> >>
> >> cache
> >> -----
> >> qd1: 1651
> >> qd2: 3482
> >> qd4: 7958
> >> qd8: 17912
> >> qd16: 36020
> >> qd32: 42765
> >> qd64: 46169
> >>
> >> no cache
> >> --------
> >> qd1: 1748
> >> qd2: 3570
> >> qd4: 8356
> >> qd8: 17732
> >> qd16: 41396
> >> qd32: 78633
> >> qd64: 79063
> >> qd128: 79550
> >>
> >>
> >> ----- Mail original -----
> >> De: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> >> À: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> Cc: "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >> Envoyé: Mardi 9 Juin 2015 09:28:21
> >> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
> >>
> >> Hi,
> >>
> >>>> We tried adding more RBDs to single VM, but no luck.
> >>
> >> If you want to scale with more disks in a single qemu vm, you need to
> use iothread feature from qemu and assign 1 iothread by disk (works with
> virtio-blk).
> >> It's working for me, I can scale with adding more disks.
> >>
> >>
> >> My bench here are done with fio-rbd on host.
> >> I can scale up to 400k iops with 10clients-rbd_cache=off on a single
> host and around 250kiops 10clients-rbdcache=on.
> >>
> >>
> >> I just wonder why I don't have performance decrease around 30k iops
> with 1osd.
> >>
> >> I'm going to see if this tracker
> >> http://tracker.ceph.com/issues/11056
> >>
> >> could be the cause.
> >>
> >> (My master build was done some week ago)
> >>
> >>
> >>
> >> ----- Mail original -----
> >> De: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> >> Cc: "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >> Envoyé: Mardi 9 Juin 2015 09:21:04
> >> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>
> >> Hi Alexandre,
> >>
> >> We have also seen something very similar on Hammer(0.94-1). We were
> doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM,
> openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as
> additional storage. For some strange reason it was not able to scale 4K- RR
> iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but
> no luck. However increasing number of VMs to 4 on a single hypervisor did
> scale to some extent. After this there was no much benefit we got from
> adding more VMs.
> >>
> >> Here is the trend we have seen, x-axis is number of hypervisor, each
> hypervisor has 4 VM, each VM has 1 RBD:-
> >>
> >>
> >>
> >>
> >> VDbench is used as benchmarking tool. We were not saturating network
> and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors,
> and that is where we were suspecting of some throttling effect. However we
> haven't setted any such limits from nova or kvm end. We tried some CPU
> pinning and other KVM related tuning as well, but no luck.
> >>
> >> We tried the same experiment on a bare metal. It was 4K RR IOPs were
> scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling
> beyond that point the numbers were actually degrading. (Single pipe more
> congestion effect)
> >>
> >> We never suspected that rbd cache enable could be detrimental to
> performance. It would nice to route cause the problem if that is the case.
> >>
> >> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER <
> aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>
> >>
> >> Hi,
> >>
> >> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
> >> and rbd_cache=true seem to limit the iops around 40k
> >>
> >>
> >> no cache
> >> --------
> >> 1 client - rbd_cache=false - 1osd : 38300 iops
> >> 1 client - rbd_cache=false - 2osd : 69073 iops
> >> 1 client - rbd_cache=false - 3osd : 78292 iops
> >>
> >>
> >> cache
> >> -----
> >> 1 client - rbd_cache=true - 1osd : 38100 iops
> >> 1 client - rbd_cache=true - 2osd : 42457 iops
> >> 1 client - rbd_cache=true - 3osd : 45823 iops
> >>
> >>
> >>
> >> Is it expected ?
> >>
> >>
> >>
> >> fio result rbd_cache=false 3 osd
> >> --------------------------------
> >> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >> fio-2.1.11
> >> Starting 1 process
> >> rbd engine: RBD version: 0.1.9
> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0
> iops] [eta 00m:00s]
> >> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9
> 07:48:42 2015
> >> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
> >> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
> >> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
> >> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
> >> clat percentiles (usec):
> >> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
> >> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
> >> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
> >> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
> >> | 99.99th=[ 1176]
> >> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34,
> stdev=25196.21
> >> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
> >> lat (msec) : 2=0.03%, 4=0.01%
> >> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,
> >=64=0.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
> >> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >> latency : target=0, window=0, percentile=100.00%, depth=32
> >>
> >> Run status group 0 (all jobs):
> >> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s,
> mint=32698msec, maxt=32698msec
> >>
> >> Disk stats (read/write):
> >> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
> >>
> >>
> >>
> >>
> >> fio result rbd_cache=true 3osd
> >> ------------------------------
> >>
> >> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >> fio-2.1.11
> >> Starting 1 process
> >> rbd engine: RBD version: 0.1.9
> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0
> iops] [eta 00m:00s]
> >> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9
> 07:47:30 2015
> >> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
> >> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
> >> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
> >> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
> >> clat percentiles (usec):
> >> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
> >> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
> >> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
> >> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
> >> | 99.99th=[ 2192]
> >> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10,
> stdev=15079.93
> >> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
> >> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
> >> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%,
> >=64=0.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
> >> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >> latency : target=0, window=0, percentile=100.00%, depth=32
> >>
> >> Run status group 0 (all jobs):
> >> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s,
> mint=55866msec, maxt=55866msec
> >>
> >> Disk stats (read/write):
> >> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%,
> aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
> >> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

[-- Attachment #1.2: Type: text/html, Size: 38321 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <CAF-rypxjbsH3GdUG474OgSZVjdzKyf_0n8-zAkAuGhk83TXQhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                               ` <CAF-rypxjbsH3GdUG474OgSZVjdzKyf_0n8-zAkAuGhk83TXQhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-10  5:41                                 ` Alexandre DERUMIER
       [not found]                                   ` <2010200873.2102614.1433914918985.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-10  5:41 UTC (permalink / raw)
  To: Irek Fasikhov; +Cc: ceph-devel, pushpesh sharma, ceph-users

>>Very good work! 
>>Do you have a rpm-file? 
>>Thanks. 
no sorry, I'm have compiled it manually (and I'm using debian jessie as client)



----- Mail original -----
De: "Irek Fasikhov" <malmyzh@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mercredi 10 Juin 2015 07:21:42
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi, Alexandre. 

Very good work! 
Do you have a rpm-file? 
Thanks. 

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 


Hi, 

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 



qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 


qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 
qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
------------------------------------------------------ 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 
read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec 
slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 
clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 
lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 
clat percentiles (usec): 
| 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
| 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
| 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
| 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
| 99.99th=[ 3760] 
bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 
lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% 
lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% 
cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 

Disk stats (read/write): 
vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 
read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec 
slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 
clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 
lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 
clat percentiles (usec): 
| 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
| 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
| 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
| 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
| 99.99th=[ 3632] 
bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 
lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% 
lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% 
cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 

Disk stats (read/write): 
vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 


----- Mail original ----- 
De: "aderumier" < aderumier@odiso.com > 
À: "Robert LeBlanc" < robert@leblancnet.us > 
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Mardi 9 Juin 2015 18:47:27 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

Hi Robert, 

>>What I found was that Ceph OSDs performed well with either 
>>tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>instead of tcmalloc, I'm still working to dig into why that might be 
>>the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 



>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>better for QEMU/KVM in the tests that we ran. [1] 


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
with qemu iothread, tcmalloc have a speed increase over glib 
with qemu iothread, jemalloc have a speed decrease 

without iothread, jemalloc have a big speed increase 

this is with 
-qemu 2.3 
-tcmalloc 2.2.1 
-jemmaloc 3.6 
-libc6 2.19 


qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 

qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 


(The benefit of iothreads is that we can scale with more disks in 1vm) 


fio results: 
------------ 

qemu : iothread : tcmalloc : iops=38676 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
clat percentiles (usec): 
| 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
| 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
| 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
| 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
| 99.99th=[ 3888] 
bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 

Disk stats (read/write): 
vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 



qemu : no-iothread : tcmalloc : iops=34516 
--------------------------------------------- 
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
clat percentiles (usec): 
| 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
| 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
| 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
| 99.99th=[ 4320] 
bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 

Disk stats (read/write): 
vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 



qemu : iothread : glibc : iops=34516 
------------------------------------- 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
clat percentiles (usec): 
| 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
| 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
| 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
| 99.99th=[ 3984] 
bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 

Disk stats (read/write): 
vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 



qemu : no iothread : glibc : iops=33395 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
clat percentiles (usec): 
| 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
| 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
| 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
| 99.99th=[ 4832] 
bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 

Disk stats (read/write): 
vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 



qemu : iothread : jemmaloc : iops=28023 
---------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
clat percentiles (usec): 
| 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
| 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
| 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
| 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
| 99.99th=[ 3760] 
bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 

Disk stats (read/write): 
vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 



qemu : non-iothread : jemmaloc : iops=42226 
-------------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
clat percentiles (usec): 
| 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
| 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
| 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
| 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
| 99.99th=[ 2608] 
bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
lat (msec) : 2=10.30%, 4=0.07% 
cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 

Disk stats (read/write): 
vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 



----- Mail original ----- 
De: "Robert LeBlanc" < robert@leblancnet.us > 
À: "aderumier" < aderumier@odiso.com > 
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Mardi 9 Juin 2015 18:00:29 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

-----BEGIN PGP SIGNED MESSAGE----- 
Hash: SHA256 

I also saw a similar performance increase by using alternative memory 
allocators. What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 

However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran. [1] 

I'm currently looking into I/O bottlenecks around the 16KB range and 
I'm seeing a lot of time in thread creation and destruction, the 
memory allocators are quite a bit down the list (both fio with 
ioengine rbd and on the OSDs). I wonder what the difference can be. 
I've tried using the async messenger but there wasn't a huge 
difference. [2] 

Further down the rabbit hole.... 

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
-----BEGIN PGP SIGNATURE----- 
Version: Mailvelope v0.13.1 
Comment: https://www.mailvelope.com 

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
oSJX 
=k281 
-----END PGP SIGNATURE----- 
---------------- 
Robert LeBlanc 
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>IOPS from 1 VM! 
> 
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
> (I'm planning to send results in qemu soon) 
> 
>>>How fast are the SSDs in those 3 OSDs? 
> 
> Theses results are with datas in buffer memory of osd nodes. 
> 
> When reading fulling on ssd (intel s3500), 
> 
> For 1 client, 
> 
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
> 
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
> 
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
> 
> 
> 
> small tip : 
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
> 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
> 
> as a lot of time is spent in malloc/free 
> 
> 
> (qemu support also tcmalloc since some months , I'll bench it too 
> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
> 
> 
> 
> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Mark Nelson" < mnelson@redhat.com > 
> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Mardi 9 Juin 2015 13:36:31 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi All, 
> 
> In the past we've hit some performance issues with RBD cache that we've 
> fixed, but we've never really tried pushing a single VM beyond 40+K read 
> IOPS in testing (or at least I never have). I suspect there's a couple 
> of possibilities as to why it might be slower, but perhaps joshd can 
> chime in as he's more familiar with what that code looks like. 
> 
> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
> 
> Mark 
> 
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>> 
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>> rbd_cache is almost the same than without cache with queue depth <16 
>> 
>> 
>> cache 
>> ----- 
>> qd1: 1651 
>> qd2: 3482 
>> qd4: 7958 
>> qd8: 17912 
>> qd16: 36020 
>> qd32: 42765 
>> qd64: 46169 
>> 
>> no cache 
>> -------- 
>> qd1: 1748 
>> qd2: 3570 
>> qd4: 8356 
>> qd8: 17732 
>> qd16: 41396 
>> qd32: 78633 
>> qd64: 79063 
>> qd128: 79550 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" < aderumier@odiso.com > 
>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi, 
>> 
>>>> We tried adding more RBDs to single VM, but no luck. 
>> 
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>> It's working for me, I can scale with adding more disks. 
>> 
>> 
>> My bench here are done with fio-rbd on host. 
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>> 
>> 
>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>> 
>> I'm going to see if this tracker 
>> http://tracker.ceph.com/issues/11056 
>> 
>> could be the cause. 
>> 
>> (My master build was done some week ago) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> À: "aderumier" < aderumier@odiso.com > 
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> 
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>> 
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>> 
>> 
>> 
>> 
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>> 
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>> 
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>> 
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>> 
>> 
>> Hi, 
>> 
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>> and rbd_cache=true seem to limit the iops around 40k 
>> 
>> 
>> no cache 
>> -------- 
>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>> 
>> 
>> cache 
>> ----- 
>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>> 
>> 
>> 
>> Is it expected ? 
>> 
>> 
>> 
>> fio result rbd_cache=false 3 osd 
>> -------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>> clat percentiles (usec): 
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>> | 99.99th=[ 1176] 
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>> lat (msec) : 2=0.03%, 4=0.01% 
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>> 
>> 
>> 
>> 
>> fio result rbd_cache=true 3osd 
>> ------------------------------ 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>> clat percentiles (usec): 
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>> | 99.99th=[ 2192] 
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 






-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <2010200873.2102614.1433914918985.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                                   ` <2010200873.2102614.1433914918985.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-10  7:06                                     ` Somnath Roy
  2015-06-10  7:29                                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-06-10  7:06 UTC (permalink / raw)
  To: Alexandre DERUMIER, Irek Fasikhov; +Cc: ceph-devel, pushpesh sharma, ceph-users

Hi Alexandre,
Thanks for sharing the data.
I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)

Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER
Sent: Tuesday, June 09, 2015 10:42 PM
To: Irek Fasikhov
Cc: ceph-devel; pushpesh sharma; ceph-users
Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

>>Very good work!
>>Do you have a rpm-file?
>>Thanks.
no sorry, I'm have compiled it manually (and I'm using debian jessie as client)



----- Mail original -----
De: "Irek Fasikhov" <malmyzh@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mercredi 10 Juin 2015 07:21:42
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi, Alexandre.

Very good work!
Do you have a rpm-file?
Thanks.

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > :


Hi,

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) !



qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)


qemu : iothread : glibc : iops=34516
qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
------------------------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
| 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
| 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
| 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
| 99.99th=[ 3760]
bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec

Disk stats (read/write):
vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec):
| 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
| 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
| 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
| 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
| 99.99th=[ 3632]
bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec

Disk stats (read/write):
vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%


----- Mail original -----
De: "aderumier" < aderumier@odiso.com >
À: "Robert LeBlanc" < robert@leblancnet.us >
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
Envoyé: Mardi 9 Juin 2015 18:47:27
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Hi Robert,

>>What I found was that Ceph OSDs performed well with either tcmalloc or
>>jemalloc (except when RocksDB was built with jemalloc instead of
>>tcmalloc, I'm still working to dig into why that might be the case).
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc.



>>However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>better for QEMU/KVM in the tests that we ran. [1]


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc.
with qemu iothread, tcmalloc have a speed increase over glib
with qemu iothread, jemalloc have a speed decrease

without iothread, jemalloc have a big speed increase

this is with
-qemu 2.3
-tcmalloc 2.2.1
-jemmaloc 3.6
-libc6 2.19


qemu : no iothread : glibc : iops=33395
qemu : no-iothread : tcmalloc : iops=34516 (+3%)
qemu : no-iothread : jemmaloc : iops=42226 (+26%)

qemu : iothread : glibc : iops=34516
qemu : iothread : tcmalloc : iops=38676 (+12%)
qemu : iothread : jemmaloc : iops=28023 (-19%)


(The benefit of iothreads is that we can scale with more disks in 1vm)


fio results:
------------

qemu : iothread : tcmalloc : iops=38676
-----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015
read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
clat percentiles (usec):
| 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
| 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
| 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
| 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
| 99.99th=[ 3888]
bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec

Disk stats (read/write):
vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%



qemu : no-iothread : tcmalloc : iops=34516
---------------------------------------------
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015
read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
clat percentiles (usec):
| 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
| 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
| 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
| 99.99th=[ 4320]
bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec

Disk stats (read/write):
vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%



qemu : iothread : glibc : iops=34516
-------------------------------------

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015
read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
clat percentiles (usec):
| 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
| 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
| 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
| 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
| 99.99th=[ 3984]
bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec

Disk stats (read/write):
vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%



qemu : no iothread : glibc : iops=33395
-----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015
read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
clat percentiles (usec):
| 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
| 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
| 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
| 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
| 99.99th=[ 4832]
bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec

Disk stats (read/write):
vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%



qemu : iothread : jemmaloc : iops=28023
----------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015
read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
clat percentiles (usec):
| 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
| 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
| 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
| 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
| 99.99th=[ 3760]
bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec

Disk stats (read/write):
vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%



qemu : non-iothread : jemmaloc : iops=42226
--------------------------------------------
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s]
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015
read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
clat percentiles (usec):
| 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
| 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
| 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
| 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
| 99.99th=[ 2608]
bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
lat (msec) : 2=10.30%, 4=0.07%
cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec

Disk stats (read/write):
vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%



----- Mail original -----
De: "Robert LeBlanc" < robert@leblancnet.us >
À: "aderumier" < aderumier@odiso.com >
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
Envoyé: Mardi 9 Juin 2015 18:00:29
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I also saw a similar performance increase by using alternative memory
allocators. What I found was that Ceph OSDs performed well with either
tcmalloc or jemalloc (except when RocksDB was built with jemalloc
instead of tcmalloc, I'm still working to dig into why that might be
the case).

However, I found that tcmalloc with QEMU/KVM was very detrimental to
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
better for QEMU/KVM in the tests that we ran. [1]

I'm currently looking into I/O bottlenecks around the 16KB range and
I'm seeing a lot of time in thread creation and destruction, the
memory allocators are quite a bit down the list (both fio with
ioengine rbd and on the OSDs). I wonder what the difference can be.
I've tried using the async messenger but there wasn't a huge
difference. [2]

Further down the rabbit hole....

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
oSJX
=k281
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>IOPS from 1 VM!
>
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
> (I'm planning to send results in qemu soon)
>
>>>How fast are the SSDs in those 3 OSDs?
>
> Theses results are with datas in buffer memory of osd nodes.
>
> When reading fulling on ssd (intel s3500),
>
> For 1 client,
>
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.
>
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>
>
>
> small tip :
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%
>
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>
> as a lot of time is spent in malloc/free
>
>
> (qemu support also tcmalloc since some months , I'll bench it too
> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )
>
>
>
> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>
>
>
>
> ----- Mail original -----
> De: "Mark Nelson" < mnelson@redhat.com >
> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >
> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
> Envoyé: Mardi 9 Juin 2015 13:36:31
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi All,
>
> In the past we've hit some performance issues with RBD cache that we've
> fixed, but we've never really tried pushing a single VM beyond 40+K read
> IOPS in testing (or at least I never have). I suspect there's a couple
> of possibilities as to why it might be slower, but perhaps joshd can
> chime in as he's more familiar with what that code looks like.
>
> Frankly, I'm a little impressed that without RBD cache we can hit 80K
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>
> Mark
>
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
>> rbd_cache is almost the same than without cache with queue depth <16
>>
>>
>> cache
>> -----
>> qd1: 1651
>> qd2: 3482
>> qd4: 7958
>> qd8: 17912
>> qd16: 36020
>> qd32: 42765
>> qd64: 46169
>>
>> no cache
>> --------
>> qd1: 1748
>> qd2: 3570
>> qd4: 8356
>> qd8: 17732
>> qd16: 41396
>> qd32: 78633
>> qd64: 79063
>> qd128: 79550
>>
>>
>> ----- Mail original -----
>> De: "aderumier" < aderumier@odiso.com >
>> À: "pushpesh sharma" < pushpesh.eck@gmail.com >
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>> Envoyé: Mardi 9 Juin 2015 09:28:21
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi,
>>
>>>> We tried adding more RBDs to single VM, but no luck.
>>
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
>> It's working for me, I can scale with adding more disks.
>>
>>
>> My bench here are done with fio-rbd on host.
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>>
>>
>> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>>
>> I'm going to see if this tracker
>> http://tracker.ceph.com/issues/11056
>>
>> could be the cause.
>>
>> (My master build was done some week ago)
>>
>>
>>
>> ----- Mail original -----
>> De: "pushpesh sharma" < pushpesh.eck@gmail.com >
>> À: "aderumier" < aderumier@odiso.com >
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>> Envoyé: Mardi 9 Juin 2015 09:21:04
>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>
>> Hi Alexandre,
>>
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>>
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>>
>>
>>
>>
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>>
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>>
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>>
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>
>>
>> Hi,
>>
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>> and rbd_cache=true seem to limit the iops around 40k
>>
>>
>> no cache
>> --------
>> 1 client - rbd_cache=false - 1osd : 38300 iops
>> 1 client - rbd_cache=false - 2osd : 69073 iops
>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>
>>
>> cache
>> -----
>> 1 client - rbd_cache=true - 1osd : 38100 iops
>> 1 client - rbd_cache=true - 2osd : 42457 iops
>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>
>>
>>
>> Is it expected ?
>>
>>
>>
>> fio result rbd_cache=false 3 osd
>> --------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>> clat percentiles (usec):
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>> | 99.99th=[ 1176]
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>> lat (msec) : 2=0.03%, 4=0.01%
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>
>>
>>
>>
>> fio result rbd_cache=true 3osd
>> ------------------------------
>>
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>> clat percentiles (usec):
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>> | 99.99th=[ 2192]
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






--
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-10  7:06                                     ` Somnath Roy
@ 2015-06-10  7:29                                       ` Alexandre DERUMIER
  2015-06-12  5:52                                         ` pushpesh sharma
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-10  7:29 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Irek Fasikhov, ceph-devel, pushpesh sharma, ceph-users

>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)

Sure no problem.

(BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk)


----- Mail original -----
De: "Somnath Roy" <Somnath.Roy@sandisk.com>
À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mercredi 10 Juin 2015 09:06:32
Objet: RE: rbd_cache, limiting read on high iops around 40k

Hi Alexandre, 
Thanks for sharing the data. 
I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 

Regards 
Somnath 

-----Original Message----- 
From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER 
Sent: Tuesday, June 09, 2015 10:42 PM 
To: Irek Fasikhov 
Cc: ceph-devel; pushpesh sharma; ceph-users 
Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

>>Very good work! 
>>Do you have a rpm-file? 
>>Thanks. 
no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 



----- Mail original ----- 
De: "Irek Fasikhov" <malmyzh@gmail.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
Envoyé: Mercredi 10 Juin 2015 07:21:42 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

Hi, Alexandre. 

Very good work! 
Do you have a rpm-file? 
Thanks. 

2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 


Hi, 

I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 



qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 


qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 





qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
------------------------------------------------------ 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
| 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
| 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
| 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
| 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
| 99.99th=[ 3760] 
bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 

Disk stats (read/write): 
vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 






rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
| 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
| 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
| 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
| 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
| 99.99th=[ 3632] 
bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 

Disk stats (read/write): 
vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 


----- Mail original ----- 
De: "aderumier" < aderumier@odiso.com > 
À: "Robert LeBlanc" < robert@leblancnet.us > 
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Mardi 9 Juin 2015 18:47:27 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

Hi Robert, 

>>What I found was that Ceph OSDs performed well with either tcmalloc or 
>>jemalloc (except when RocksDB was built with jemalloc instead of 
>>tcmalloc, I'm still working to dig into why that might be the case). 
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 



>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>better for QEMU/KVM in the tests that we ran. [1] 


Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
with qemu iothread, tcmalloc have a speed increase over glib 
with qemu iothread, jemalloc have a speed decrease 

without iothread, jemalloc have a big speed increase 

this is with 
-qemu 2.3 
-tcmalloc 2.2.1 
-jemmaloc 3.6 
-libc6 2.19 


qemu : no iothread : glibc : iops=33395 
qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 

qemu : iothread : glibc : iops=34516 
qemu : iothread : tcmalloc : iops=38676 (+12%) 
qemu : iothread : jemmaloc : iops=28023 (-19%) 


(The benefit of iothreads is that we can scale with more disks in 1vm) 


fio results: 
------------ 

qemu : iothread : tcmalloc : iops=38676 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
clat percentiles (usec): 
| 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
| 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
| 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
| 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
| 99.99th=[ 3888] 
bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 

Disk stats (read/write): 
vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 



qemu : no-iothread : tcmalloc : iops=34516 
--------------------------------------------- 
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
clat percentiles (usec): 
| 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
| 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
| 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
| 99.99th=[ 4320] 
bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 

Disk stats (read/write): 
vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 



qemu : iothread : glibc : iops=34516 
------------------------------------- 

rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
clat percentiles (usec): 
| 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
| 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
| 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
| 99.99th=[ 3984] 
bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 

Disk stats (read/write): 
vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 



qemu : no iothread : glibc : iops=33395 
----------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
clat percentiles (usec): 
| 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
| 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
| 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
| 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
| 99.99th=[ 4832] 
bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 

Disk stats (read/write): 
vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 



qemu : iothread : jemmaloc : iops=28023 
---------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
clat percentiles (usec): 
| 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
| 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
| 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
| 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
| 99.99th=[ 3760] 
bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 

Disk stats (read/write): 
vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 



qemu : non-iothread : jemmaloc : iops=42226 
-------------------------------------------- 
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
fio-2.1.11 
Starting 1 process 
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
clat percentiles (usec): 
| 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
| 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
| 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
| 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
| 99.99th=[ 2608] 
bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
lat (msec) : 2=10.30%, 4=0.07% 
cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=32 

Run status group 0 (all jobs): 
READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 

Disk stats (read/write): 
vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 



----- Mail original ----- 
De: "Robert LeBlanc" < robert@leblancnet.us > 
À: "aderumier" < aderumier@odiso.com > 
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Mardi 9 Juin 2015 18:00:29 
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

-----BEGIN PGP SIGNED MESSAGE----- 
Hash: SHA256 

I also saw a similar performance increase by using alternative memory 
allocators. What I found was that Ceph OSDs performed well with either 
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
instead of tcmalloc, I'm still working to dig into why that might be 
the case). 

However, I found that tcmalloc with QEMU/KVM was very detrimental to 
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
better for QEMU/KVM in the tests that we ran. [1] 

I'm currently looking into I/O bottlenecks around the 16KB range and 
I'm seeing a lot of time in thread creation and destruction, the 
memory allocators are quite a bit down the list (both fio with 
ioengine rbd and on the OSDs). I wonder what the difference can be. 
I've tried using the async messenger but there wasn't a huge 
difference. [2] 

Further down the rabbit hole.... 

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
-----BEGIN PGP SIGNATURE----- 
Version: Mailvelope v0.13.1 
Comment: https://www.mailvelope.com 

wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
oSJX 
=k281 
-----END PGP SIGNATURE----- 
---------------- 
Robert LeBlanc 
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 


On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>IOPS from 1 VM! 
> 
> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
> (I'm planning to send results in qemu soon) 
> 
>>>How fast are the SSDs in those 3 OSDs? 
> 
> Theses results are with datas in buffer memory of osd nodes. 
> 
> When reading fulling on ssd (intel s3500), 
> 
> For 1 client, 
> 
> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
> 
> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
> 
> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
> 
> 
> 
> small tip : 
> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
> 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
> 
> as a lot of time is spent in malloc/free 
> 
> 
> (qemu support also tcmalloc since some months , I'll bench it too 
> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
> 
> 
> 
> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Mark Nelson" < mnelson@redhat.com > 
> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Mardi 9 Juin 2015 13:36:31 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi All, 
> 
> In the past we've hit some performance issues with RBD cache that we've 
> fixed, but we've never really tried pushing a single VM beyond 40+K read 
> IOPS in testing (or at least I never have). I suspect there's a couple 
> of possibilities as to why it might be slower, but perhaps joshd can 
> chime in as he's more familiar with what that code looks like. 
> 
> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
> 
> Mark 
> 
> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>> 
>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>> rbd_cache is almost the same than without cache with queue depth <16 
>> 
>> 
>> cache 
>> ----- 
>> qd1: 1651 
>> qd2: 3482 
>> qd4: 7958 
>> qd8: 17912 
>> qd16: 36020 
>> qd32: 42765 
>> qd64: 46169 
>> 
>> no cache 
>> -------- 
>> qd1: 1748 
>> qd2: 3570 
>> qd4: 8356 
>> qd8: 17732 
>> qd16: 41396 
>> qd32: 78633 
>> qd64: 79063 
>> qd128: 79550 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" < aderumier@odiso.com > 
>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi, 
>> 
>>>> We tried adding more RBDs to single VM, but no luck. 
>> 
>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>> It's working for me, I can scale with adding more disks. 
>> 
>> 
>> My bench here are done with fio-rbd on host. 
>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>> 
>> 
>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>> 
>> I'm going to see if this tracker 
>> http://tracker.ceph.com/issues/11056 
>> 
>> could be the cause. 
>> 
>> (My master build was done some week ago) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> À: "aderumier" < aderumier@odiso.com > 
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> 
>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>> 
>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>> 
>> 
>> 
>> 
>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>> 
>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>> 
>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>> 
>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>> 
>> 
>> Hi, 
>> 
>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>> and rbd_cache=true seem to limit the iops around 40k 
>> 
>> 
>> no cache 
>> -------- 
>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>> 
>> 
>> cache 
>> ----- 
>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>> 
>> 
>> 
>> Is it expected ? 
>> 
>> 
>> 
>> fio result rbd_cache=false 3 osd 
>> -------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>> clat percentiles (usec): 
>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>> | 99.99th=[ 1176] 
>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>> lat (msec) : 2=0.03%, 4=0.01% 
>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>> 
>> 
>> 
>> 
>> fio result rbd_cache=true 3osd 
>> ------------------------------ 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> rbd engine: RBD version: 0.1.9 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>> clat percentiles (usec): 
>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>> | 99.99th=[ 2192] 
>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>> 
>> Disk stats (read/write): 
>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 






-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

________________________________ 

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-10  7:29                                       ` Alexandre DERUMIER
@ 2015-06-12  5:52                                         ` pushpesh sharma
  2015-06-12  6:03                                           ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: pushpesh sharma @ 2015-06-12  5:52 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

Hi Alexandre,

I agree with your rational, of one iothread per disk. CPU consumed in
IOwait is pretty high in each VM. But I am not finding a way to set
the same on a nova instance. I am using openstack Juno with QEMU+KVM.
As per libvirt documentation for setting iothreads, I can edit
domain.xml directly and achieve the same effect. However in as in
openstack env domain xml is created by nova with some additional
metadata, so editing the domain xml using 'virsh edit' does not seems
to work(I agree, it is not a very cloud way of doing things, but a
hack). Changes made there vanish after saving them, due to reason
libvirt validation fails on the same.

#virsh dumpxml instance-000000c5 > vm.xml
#virt-xml-validate vm.xml
Relax-NG validity error : Extra element cpu in interleave
vm.xml:1: element domain: Relax-NG validity error : Element domain
failed to validate content
vm.xml fails to validate

Second approach I took was to setting QoS in volumes types. But there
is no option to set iothreads per volume, there are parameter realted
to max_read/wrirte ops/bytes.

Thirdly, editing Nova flavor and proving extra specs like
hw:cpu_socket/thread/core, can change guest CPU topology however again
no way to set iothread. It does accept hw_disk_iothreads(no type check
in place, i believe ), but can not pass the same in domain.xml.

Could you suggest me a way to set the same.

-Pushpesh

On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>
> Sure no problem.
>
> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk)
>
>
> ----- Mail original -----
> De: "Somnath Roy" <Somnath.Roy@sandisk.com>
> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Mercredi 10 Juin 2015 09:06:32
> Objet: RE: rbd_cache, limiting read on high iops around 40k
>
> Hi Alexandre,
> Thanks for sharing the data.
> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>
> Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER
> Sent: Tuesday, June 09, 2015 10:42 PM
> To: Irek Fasikhov
> Cc: ceph-devel; pushpesh sharma; ceph-users
> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
>>>Very good work!
>>>Do you have a rpm-file?
>>>Thanks.
> no sorry, I'm have compiled it manually (and I'm using debian jessie as client)
>
>
>
> ----- Mail original -----
> De: "Irek Fasikhov" <malmyzh@gmail.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Mercredi 10 Juin 2015 07:21:42
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi, Alexandre.
>
> Very good work!
> Do you have a rpm-file?
> Thanks.
>
> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > :
>
>
> Hi,
>
> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) !
>
>
>
> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
>
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>
>
>
>
>
> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
> ------------------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec):
> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
> | 99.99th=[ 3760]
> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec
>
> Disk stats (read/write):
> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%
>
>
>
>
>
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec):
> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
> | 99.99th=[ 3632]
> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec
>
> Disk stats (read/write):
> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%
>
>
> ----- Mail original -----
> De: "aderumier" < aderumier@odiso.com >
> À: "Robert LeBlanc" < robert@leblancnet.us >
> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
> Envoyé: Mardi 9 Juin 2015 18:47:27
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi Robert,
>
>>>What I found was that Ceph OSDs performed well with either tcmalloc or
>>>jemalloc (except when RocksDB was built with jemalloc instead of
>>>tcmalloc, I'm still working to dig into why that might be the case).
> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc.
>
>
>
>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>>better for QEMU/KVM in the tests that we ran. [1]
>
>
> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc.
> with qemu iothread, tcmalloc have a speed increase over glib
> with qemu iothread, jemalloc have a speed decrease
>
> without iothread, jemalloc have a big speed increase
>
> this is with
> -qemu 2.3
> -tcmalloc 2.2.1
> -jemmaloc 3.6
> -libc6 2.19
>
>
> qemu : no iothread : glibc : iops=33395
> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%)
> qemu : iothread : jemmaloc : iops=28023 (-19%)
>
>
> (The benefit of iothreads is that we can scale with more disks in 1vm)
>
>
> fio results:
> ------------
>
> qemu : iothread : tcmalloc : iops=38676
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015
> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
> clat percentiles (usec):
> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
> | 99.99th=[ 3888]
> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec
>
> Disk stats (read/write):
> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%
>
>
>
> qemu : no-iothread : tcmalloc : iops=34516
> ---------------------------------------------
> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015
> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
> clat percentiles (usec):
> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
> | 99.99th=[ 4320]
> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec
>
> Disk stats (read/write):
> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%
>
>
>
> qemu : iothread : glibc : iops=34516
> -------------------------------------
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015
> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
> clat percentiles (usec):
> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
> | 99.99th=[ 3984]
> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec
>
> Disk stats (read/write):
> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%
>
>
>
> qemu : no iothread : glibc : iops=33395
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015
> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
> clat percentiles (usec):
> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
> | 99.99th=[ 4832]
> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec
>
> Disk stats (read/write):
> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%
>
>
>
> qemu : iothread : jemmaloc : iops=28023
> ----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015
> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
> clat percentiles (usec):
> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
> | 99.99th=[ 3760]
> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec
>
> Disk stats (read/write):
> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%
>
>
>
> qemu : non-iothread : jemmaloc : iops=42226
> --------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015
> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
> clat percentiles (usec):
> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
> | 99.99th=[ 2608]
> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
> lat (msec) : 2=10.30%, 4=0.07%
> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec
>
> Disk stats (read/write):
> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%
>
>
>
> ----- Mail original -----
> De: "Robert LeBlanc" < robert@leblancnet.us >
> À: "aderumier" < aderumier@odiso.com >
> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
> Envoyé: Mardi 9 Juin 2015 18:00:29
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I also saw a similar performance increase by using alternative memory
> allocators. What I found was that Ceph OSDs performed well with either
> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> instead of tcmalloc, I'm still working to dig into why that might be
> the case).
>
> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> better for QEMU/KVM in the tests that we ran. [1]
>
> I'm currently looking into I/O bottlenecks around the 16KB range and
> I'm seeing a lot of time in thread creation and destruction, the
> memory allocators are quite a bit down the list (both fio with
> ioengine rbd and on the OSDs). I wonder what the difference can be.
> I've tried using the async messenger but there wasn't a huge
> difference. [2]
>
> Further down the rabbit hole....
>
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
> oSJX
> =k281
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>>IOPS from 1 VM!
>>
>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
>> (I'm planning to send results in qemu soon)
>>
>>>>How fast are the SSDs in those 3 OSDs?
>>
>> Theses results are with datas in buffer memory of osd nodes.
>>
>> When reading fulling on ssd (intel s3500),
>>
>> For 1 client,
>>
>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>>
>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.
>>
>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>>
>>
>>
>> small tip :
>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%
>>
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>>
>> as a lot of time is spent in malloc/free
>>
>>
>> (qemu support also tcmalloc since some months , I'll bench it too
>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )
>>
>>
>>
>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Mark Nelson" < mnelson@redhat.com >
>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>> Envoyé: Mardi 9 Juin 2015 13:36:31
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi All,
>>
>> In the past we've hit some performance issues with RBD cache that we've
>> fixed, but we've never really tried pushing a single VM beyond 40+K read
>> IOPS in testing (or at least I never have). I suspect there's a couple
>> of possibilities as to why it might be slower, but perhaps joshd can
>> chime in as he's more familiar with what that code looks like.
>>
>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>>
>> Mark
>>
>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>>
>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
>>> rbd_cache is almost the same than without cache with queue depth <16
>>>
>>>
>>> cache
>>> -----
>>> qd1: 1651
>>> qd2: 3482
>>> qd4: 7958
>>> qd8: 17912
>>> qd16: 36020
>>> qd32: 42765
>>> qd64: 46169
>>>
>>> no cache
>>> --------
>>> qd1: 1748
>>> qd2: 3570
>>> qd4: 8356
>>> qd8: 17732
>>> qd16: 41396
>>> qd32: 78633
>>> qd64: 79063
>>> qd128: 79550
>>>
>>>
>>> ----- Mail original -----
>>> De: "aderumier" < aderumier@odiso.com >
>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>> Envoyé: Mardi 9 Juin 2015 09:28:21
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi,
>>>
>>>>> We tried adding more RBDs to single VM, but no luck.
>>>
>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
>>> It's working for me, I can scale with adding more disks.
>>>
>>>
>>> My bench here are done with fio-rbd on host.
>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>>>
>>>
>>> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>>>
>>> I'm going to see if this tracker
>>> http://tracker.ceph.com/issues/11056
>>>
>>> could be the cause.
>>>
>>> (My master build was done some week ago)
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>> À: "aderumier" < aderumier@odiso.com >
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>> Envoyé: Mardi 9 Juin 2015 09:21:04
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi Alexandre,
>>>
>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>>>
>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>>>
>>>
>>>
>>>
>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>>>
>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>>>
>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>>>
>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>>> and rbd_cache=true seem to limit the iops around 40k
>>>
>>>
>>> no cache
>>> --------
>>> 1 client - rbd_cache=false - 1osd : 38300 iops
>>> 1 client - rbd_cache=false - 2osd : 69073 iops
>>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>>
>>>
>>> cache
>>> -----
>>> 1 client - rbd_cache=true - 1osd : 38100 iops
>>> 1 client - rbd_cache=true - 2osd : 42457 iops
>>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>>
>>>
>>>
>>> Is it expected ?
>>>
>>>
>>>
>>> fio result rbd_cache=false 3 osd
>>> --------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> rbd engine: RBD version: 0.1.9
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>>> clat percentiles (usec):
>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>>> | 99.99th=[ 1176]
>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>>> lat (msec) : 2=0.03%, 4=0.01%
>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>>>
>>> Disk stats (read/write):
>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>>
>>>
>>>
>>>
>>> fio result rbd_cache=true 3osd
>>> ------------------------------
>>>
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> rbd engine: RBD version: 0.1.9
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>>> clat percentiles (usec):
>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>>> | 99.99th=[ 2192]
>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>>>
>>> Disk stats (read/write):
>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>



-- 
-Pushpesh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-12  5:52                                         ` pushpesh sharma
@ 2015-06-12  6:03                                           ` Alexandre DERUMIER
  2015-06-12  6:58                                             ` pushpesh sharma
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-12  6:03 UTC (permalink / raw)
  To: pushpesh sharma; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

Hi,

here a libvirt xml sample from libvirt src

(you need to define <iothreads>  number, then assign then in disks).

I don't use openstack, so I really don't known how it's working with it.


<domain type='qemu'>
  <name>QEMUGuest1</name>
  <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
  <memory unit='KiB'>219136</memory>
  <currentMemory unit='KiB'>219136</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <iothreads>2</iothreads>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' iothread='1'/>
      <source file='/var/lib/libvirt/images/iothrtest1.img'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' iothread='2'/>
      <source file='/var/lib/libvirt/images/iothrtest2.img'/>
      <target dev='vdc' bus='virtio'/>
    </disk>
    <controller type='usb' index='0'/>
    <controller type='ide' index='0'/>
    <controller type='pci' index='0' model='pci-root'/>
    <memballoon model='none'/>
  </devices>
</domain>


----- Mail original -----
De: "pushpesh sharma" <pushpesh.eck@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 12 Juin 2015 07:52:41
Objet: Re: rbd_cache, limiting read on high iops around 40k

Hi Alexandre, 

I agree with your rational, of one iothread per disk. CPU consumed in 
IOwait is pretty high in each VM. But I am not finding a way to set 
the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
As per libvirt documentation for setting iothreads, I can edit 
domain.xml directly and achieve the same effect. However in as in 
openstack env domain xml is created by nova with some additional 
metadata, so editing the domain xml using 'virsh edit' does not seems 
to work(I agree, it is not a very cloud way of doing things, but a 
hack). Changes made there vanish after saving them, due to reason 
libvirt validation fails on the same. 

#virsh dumpxml instance-000000c5 > vm.xml 
#virt-xml-validate vm.xml 
Relax-NG validity error : Extra element cpu in interleave 
vm.xml:1: element domain: Relax-NG validity error : Element domain 
failed to validate content 
vm.xml fails to validate 

Second approach I took was to setting QoS in volumes types. But there 
is no option to set iothreads per volume, there are parameter realted 
to max_read/wrirte ops/bytes. 

Thirdly, editing Nova flavor and proving extra specs like 
hw:cpu_socket/thread/core, can change guest CPU topology however again 
no way to set iothread. It does accept hw_disk_iothreads(no type check 
in place, i believe ), but can not pass the same in domain.xml. 

Could you suggest me a way to set the same. 

-Pushpesh 

On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
> 
> Sure no problem. 
> 
> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
> 
> 
> ----- Mail original ----- 
> De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mercredi 10 Juin 2015 09:06:32 
> Objet: RE: rbd_cache, limiting read on high iops around 40k 
> 
> Hi Alexandre, 
> Thanks for sharing the data. 
> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
> 
> Regards 
> Somnath 
> 
> -----Original Message----- 
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER 
> Sent: Tuesday, June 09, 2015 10:42 PM 
> To: Irek Fasikhov 
> Cc: ceph-devel; pushpesh sharma; ceph-users 
> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
>>>Very good work! 
>>>Do you have a rpm-file? 
>>>Thanks. 
> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
> 
> 
> 
> ----- Mail original ----- 
> De: "Irek Fasikhov" <malmyzh@gmail.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Mercredi 10 Juin 2015 07:21:42 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi, Alexandre. 
> 
> Very good work! 
> Do you have a rpm-file? 
> Thanks. 
> 
> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
> 
> 
> Hi, 
> 
> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
> 
> 
> 
> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
> 
> 
> qemu : iothread : glibc : iops=34516 
> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
> 
> 
> 
> 
> 
> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
> ------------------------------------------------------ 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
> | 99.99th=[ 3760] 
> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
> 
> Disk stats (read/write): 
> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
> 
> 
> 
> 
> 
> 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
> | 99.99th=[ 3632] 
> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
> 
> Disk stats (read/write): 
> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" < aderumier@odiso.com > 
> À: "Robert LeBlanc" < robert@leblancnet.us > 
> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Mardi 9 Juin 2015 18:47:27 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> Hi Robert, 
> 
>>>What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>jemalloc (except when RocksDB was built with jemalloc instead of 
>>>tcmalloc, I'm still working to dig into why that might be the case). 
> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
> 
> 
> 
>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>better for QEMU/KVM in the tests that we ran. [1] 
> 
> 
> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
> with qemu iothread, tcmalloc have a speed increase over glib 
> with qemu iothread, jemalloc have a speed decrease 
> 
> without iothread, jemalloc have a big speed increase 
> 
> this is with 
> -qemu 2.3 
> -tcmalloc 2.2.1 
> -jemmaloc 3.6 
> -libc6 2.19 
> 
> 
> qemu : no iothread : glibc : iops=33395 
> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
> 
> qemu : iothread : glibc : iops=34516 
> qemu : iothread : tcmalloc : iops=38676 (+12%) 
> qemu : iothread : jemmaloc : iops=28023 (-19%) 
> 
> 
> (The benefit of iothreads is that we can scale with more disks in 1vm) 
> 
> 
> fio results: 
> ------------ 
> 
> qemu : iothread : tcmalloc : iops=38676 
> ----------------------------------------- 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
> clat percentiles (usec): 
> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
> | 99.99th=[ 3888] 
> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
> 
> Disk stats (read/write): 
> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
> 
> 
> 
> qemu : no-iothread : tcmalloc : iops=34516 
> --------------------------------------------- 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
> clat percentiles (usec): 
> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
> | 99.99th=[ 4320] 
> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
> 
> Disk stats (read/write): 
> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
> 
> 
> 
> qemu : iothread : glibc : iops=34516 
> ------------------------------------- 
> 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
> clat percentiles (usec): 
> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
> | 99.99th=[ 3984] 
> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
> 
> Disk stats (read/write): 
> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
> 
> 
> 
> qemu : no iothread : glibc : iops=33395 
> ----------------------------------------- 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
> clat percentiles (usec): 
> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
> | 99.99th=[ 4832] 
> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
> 
> Disk stats (read/write): 
> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
> 
> 
> 
> qemu : iothread : jemmaloc : iops=28023 
> ---------------------------------------- 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
> clat percentiles (usec): 
> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
> | 99.99th=[ 3760] 
> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
> 
> Disk stats (read/write): 
> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
> 
> 
> 
> qemu : non-iothread : jemmaloc : iops=42226 
> -------------------------------------------- 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
> fio-2.1.11 
> Starting 1 process 
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
> clat percentiles (usec): 
> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
> | 99.99th=[ 2608] 
> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
> lat (msec) : 2=10.30%, 4=0.07% 
> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
> latency : target=0, window=0, percentile=100.00%, depth=32 
> 
> Run status group 0 (all jobs): 
> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
> 
> Disk stats (read/write): 
> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
> 
> 
> 
> ----- Mail original ----- 
> De: "Robert LeBlanc" < robert@leblancnet.us > 
> À: "aderumier" < aderumier@odiso.com > 
> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Mardi 9 Juin 2015 18:00:29 
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
> 
> -----BEGIN PGP SIGNED MESSAGE----- 
> Hash: SHA256 
> 
> I also saw a similar performance increase by using alternative memory 
> allocators. What I found was that Ceph OSDs performed well with either 
> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
> instead of tcmalloc, I'm still working to dig into why that might be 
> the case). 
> 
> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
> better for QEMU/KVM in the tests that we ran. [1] 
> 
> I'm currently looking into I/O bottlenecks around the 16KB range and 
> I'm seeing a lot of time in thread creation and destruction, the 
> memory allocators are quite a bit down the list (both fio with 
> ioengine rbd and on the OSDs). I wonder what the difference can be. 
> I've tried using the async messenger but there wasn't a huge 
> difference. [2] 
> 
> Further down the rabbit hole.... 
> 
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
> -----BEGIN PGP SIGNATURE----- 
> Version: Mailvelope v0.13.1 
> Comment: https://www.mailvelope.com 
> 
> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
> oSJX 
> =k281 
> -----END PGP SIGNATURE----- 
> ---------------- 
> Robert LeBlanc 
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
> 
> 
> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>IOPS from 1 VM! 
>> 
>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>> (I'm planning to send results in qemu soon) 
>> 
>>>>How fast are the SSDs in those 3 OSDs? 
>> 
>> Theses results are with datas in buffer memory of osd nodes. 
>> 
>> When reading fulling on ssd (intel s3500), 
>> 
>> For 1 client, 
>> 
>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>> 
>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>> 
>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>> 
>> 
>> 
>> small tip : 
>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>> 
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>> 
>> as a lot of time is spent in malloc/free 
>> 
>> 
>> (qemu support also tcmalloc since some months , I'll bench it too 
>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>> 
>> 
>> 
>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Mark Nelson" < mnelson@redhat.com > 
>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi All, 
>> 
>> In the past we've hit some performance issues with RBD cache that we've 
>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>> IOPS in testing (or at least I never have). I suspect there's a couple 
>> of possibilities as to why it might be slower, but perhaps joshd can 
>> chime in as he's more familiar with what that code looks like. 
>> 
>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>> 
>> Mark 
>> 
>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>> 
>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>> rbd_cache is almost the same than without cache with queue depth <16 
>>> 
>>> 
>>> cache 
>>> ----- 
>>> qd1: 1651 
>>> qd2: 3482 
>>> qd4: 7958 
>>> qd8: 17912 
>>> qd16: 36020 
>>> qd32: 42765 
>>> qd64: 46169 
>>> 
>>> no cache 
>>> -------- 
>>> qd1: 1748 
>>> qd2: 3570 
>>> qd4: 8356 
>>> qd8: 17732 
>>> qd16: 41396 
>>> qd32: 78633 
>>> qd64: 79063 
>>> qd128: 79550 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" < aderumier@odiso.com > 
>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi, 
>>> 
>>>>> We tried adding more RBDs to single VM, but no luck. 
>>> 
>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>> It's working for me, I can scale with adding more disks. 
>>> 
>>> 
>>> My bench here are done with fio-rbd on host. 
>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>> 
>>> 
>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>> 
>>> I'm going to see if this tracker 
>>> http://tracker.ceph.com/issues/11056 
>>> 
>>> could be the cause. 
>>> 
>>> (My master build was done some week ago) 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>> À: "aderumier" < aderumier@odiso.com > 
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Alexandre, 
>>> 
>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>> 
>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>> 
>>> 
>>> 
>>> 
>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>> 
>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>> 
>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>> 
>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>> 
>>> 
>>> Hi, 
>>> 
>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>> and rbd_cache=true seem to limit the iops around 40k 
>>> 
>>> 
>>> no cache 
>>> -------- 
>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>> 
>>> 
>>> cache 
>>> ----- 
>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>> 
>>> 
>>> 
>>> Is it expected ? 
>>> 
>>> 
>>> 
>>> fio result rbd_cache=false 3 osd 
>>> -------------------------------- 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> rbd engine: RBD version: 0.1.9 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>> | 99.99th=[ 1176] 
>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>> lat (msec) : 2=0.03%, 4=0.01% 
>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>> 
>>> Disk stats (read/write): 
>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>> 
>>> 
>>> 
>>> 
>>> fio result rbd_cache=true 3osd 
>>> ------------------------------ 
>>> 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> rbd engine: RBD version: 0.1.9 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>> | 99.99th=[ 2192] 
>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>> 
>>> Disk stats (read/write): 
>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> 
> 
> 
> -- 
> С уважением, Фасихов Ирек Нургаязович 
> Моб.: +79229045757 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ________________________________ 
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
> 



-- 
-Pushpesh 



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-12  6:03                                           ` Alexandre DERUMIER
@ 2015-06-12  6:58                                             ` pushpesh sharma
  2015-06-16 16:38                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: pushpesh sharma @ 2015-06-12  6:58 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

Thanks, posted the question in openstack list. Hopefully will get some
expert opinion.

On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Hi,
>
> here a libvirt xml sample from libvirt src
>
> (you need to define <iothreads>  number, then assign then in disks).
>
> I don't use openstack, so I really don't known how it's working with it.
>
>
> <domain type='qemu'>
>   <name>QEMUGuest1</name>
>   <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
>   <memory unit='KiB'>219136</memory>
>   <currentMemory unit='KiB'>219136</currentMemory>
>   <vcpu placement='static'>2</vcpu>
>   <iothreads>2</iothreads>
>   <os>
>     <type arch='i686' machine='pc'>hvm</type>
>     <boot dev='hd'/>
>   </os>
>   <clock offset='utc'/>
>   <on_poweroff>destroy</on_poweroff>
>   <on_reboot>restart</on_reboot>
>   <on_crash>destroy</on_crash>
>   <devices>
>     <emulator>/usr/bin/qemu</emulator>
>     <disk type='file' device='disk'>
>       <driver name='qemu' type='raw' iothread='1'/>
>       <source file='/var/lib/libvirt/images/iothrtest1.img'/>
>       <target dev='vdb' bus='virtio'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
>     </disk>
>     <disk type='file' device='disk'>
>       <driver name='qemu' type='raw' iothread='2'/>
>       <source file='/var/lib/libvirt/images/iothrtest2.img'/>
>       <target dev='vdc' bus='virtio'/>
>     </disk>
>     <controller type='usb' index='0'/>
>     <controller type='ide' index='0'/>
>     <controller type='pci' index='0' model='pci-root'/>
>     <memballoon model='none'/>
>   </devices>
> </domain>
>
>
> ----- Mail original -----
> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Vendredi 12 Juin 2015 07:52:41
> Objet: Re: rbd_cache, limiting read on high iops around 40k
>
> Hi Alexandre,
>
> I agree with your rational, of one iothread per disk. CPU consumed in
> IOwait is pretty high in each VM. But I am not finding a way to set
> the same on a nova instance. I am using openstack Juno with QEMU+KVM.
> As per libvirt documentation for setting iothreads, I can edit
> domain.xml directly and achieve the same effect. However in as in
> openstack env domain xml is created by nova with some additional
> metadata, so editing the domain xml using 'virsh edit' does not seems
> to work(I agree, it is not a very cloud way of doing things, but a
> hack). Changes made there vanish after saving them, due to reason
> libvirt validation fails on the same.
>
> #virsh dumpxml instance-000000c5 > vm.xml
> #virt-xml-validate vm.xml
> Relax-NG validity error : Extra element cpu in interleave
> vm.xml:1: element domain: Relax-NG validity error : Element domain
> failed to validate content
> vm.xml fails to validate
>
> Second approach I took was to setting QoS in volumes types. But there
> is no option to set iothreads per volume, there are parameter realted
> to max_read/wrirte ops/bytes.
>
> Thirdly, editing Nova flavor and proving extra specs like
> hw:cpu_socket/thread/core, can change guest CPU topology however again
> no way to set iothread. It does accept hw_disk_iothreads(no type check
> in place, i believe ), but can not pass the same in domain.xml.
>
> Could you suggest me a way to set the same.
>
> -Pushpesh
>
> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>
>> Sure no problem.
>>
>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk)
>>
>>
>> ----- Mail original -----
>> De: "Somnath Roy" <Somnath.Roy@sandisk.com>
>> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Mercredi 10 Juin 2015 09:06:32
>> Objet: RE: rbd_cache, limiting read on high iops around 40k
>>
>> Hi Alexandre,
>> Thanks for sharing the data.
>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>
>> Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER
>> Sent: Tuesday, June 09, 2015 10:42 PM
>> To: Irek Fasikhov
>> Cc: ceph-devel; pushpesh sharma; ceph-users
>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>>>>Very good work!
>>>>Do you have a rpm-file?
>>>>Thanks.
>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client)
>>
>>
>>
>> ----- Mail original -----
>> De: "Irek Fasikhov" <malmyzh@gmail.com>
>> À: "aderumier" <aderumier@odiso.com>
>> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Mercredi 10 Juin 2015 07:21:42
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi, Alexandre.
>>
>> Very good work!
>> Do you have a rpm-file?
>> Thanks.
>>
>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > :
>>
>>
>> Hi,
>>
>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) !
>>
>>
>>
>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
>>
>>
>> qemu : iothread : glibc : iops=34516
>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>>
>>
>>
>>
>>
>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>> ------------------------------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec):
>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
>> | 99.99th=[ 3760]
>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec
>>
>> Disk stats (read/write):
>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%
>>
>>
>>
>>
>>
>>
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec):
>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
>> | 99.99th=[ 3632]
>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec
>>
>> Disk stats (read/write):
>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%
>>
>>
>> ----- Mail original -----
>> De: "aderumier" < aderumier@odiso.com >
>> À: "Robert LeBlanc" < robert@leblancnet.us >
>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
>> Envoyé: Mardi 9 Juin 2015 18:47:27
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi Robert,
>>
>>>>What I found was that Ceph OSDs performed well with either tcmalloc or
>>>>jemalloc (except when RocksDB was built with jemalloc instead of
>>>>tcmalloc, I'm still working to dig into why that might be the case).
>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc.
>>
>>
>>
>>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>>>better for QEMU/KVM in the tests that we ran. [1]
>>
>>
>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc.
>> with qemu iothread, tcmalloc have a speed increase over glib
>> with qemu iothread, jemalloc have a speed decrease
>>
>> without iothread, jemalloc have a big speed increase
>>
>> this is with
>> -qemu 2.3
>> -tcmalloc 2.2.1
>> -jemmaloc 3.6
>> -libc6 2.19
>>
>>
>> qemu : no iothread : glibc : iops=33395
>> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
>> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
>>
>> qemu : iothread : glibc : iops=34516
>> qemu : iothread : tcmalloc : iops=38676 (+12%)
>> qemu : iothread : jemmaloc : iops=28023 (-19%)
>>
>>
>> (The benefit of iothreads is that we can scale with more disks in 1vm)
>>
>>
>> fio results:
>> ------------
>>
>> qemu : iothread : tcmalloc : iops=38676
>> -----------------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015
>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
>> clat percentiles (usec):
>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
>> | 99.99th=[ 3888]
>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec
>>
>> Disk stats (read/write):
>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%
>>
>>
>>
>> qemu : no-iothread : tcmalloc : iops=34516
>> ---------------------------------------------
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015
>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
>> clat percentiles (usec):
>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
>> | 99.99th=[ 4320]
>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec
>>
>> Disk stats (read/write):
>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%
>>
>>
>>
>> qemu : iothread : glibc : iops=34516
>> -------------------------------------
>>
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015
>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
>> clat percentiles (usec):
>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
>> | 99.99th=[ 3984]
>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec
>>
>> Disk stats (read/write):
>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%
>>
>>
>>
>> qemu : no iothread : glibc : iops=33395
>> -----------------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015
>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
>> clat percentiles (usec):
>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
>> | 99.99th=[ 4832]
>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec
>>
>> Disk stats (read/write):
>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%
>>
>>
>>
>> qemu : iothread : jemmaloc : iops=28023
>> ----------------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015
>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
>> clat percentiles (usec):
>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
>> | 99.99th=[ 3760]
>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec
>>
>> Disk stats (read/write):
>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%
>>
>>
>>
>> qemu : non-iothread : jemmaloc : iops=42226
>> --------------------------------------------
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s]
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015
>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
>> clat percentiles (usec):
>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
>> | 99.99th=[ 2608]
>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
>> lat (msec) : 2=10.30%, 4=0.07%
>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec
>>
>> Disk stats (read/write):
>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%
>>
>>
>>
>> ----- Mail original -----
>> De: "Robert LeBlanc" < robert@leblancnet.us >
>> À: "aderumier" < aderumier@odiso.com >
>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
>> Envoyé: Mardi 9 Juin 2015 18:00:29
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I also saw a similar performance increase by using alternative memory
>> allocators. What I found was that Ceph OSDs performed well with either
>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
>> instead of tcmalloc, I'm still working to dig into why that might be
>> the case).
>>
>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>> better for QEMU/KVM in the tests that we ran. [1]
>>
>> I'm currently looking into I/O bottlenecks around the 16KB range and
>> I'm seeing a lot of time in thread creation and destruction, the
>> memory allocators are quite a bit down the list (both fio with
>> ioengine rbd and on the OSDs). I wonder what the difference can be.
>> I've tried using the async messenger but there wasn't a huge
>> difference. [2]
>>
>> Further down the rabbit hole....
>>
>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v0.13.1
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
>> oSJX
>> =k281
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>>>IOPS from 1 VM!
>>>
>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
>>> (I'm planning to send results in qemu soon)
>>>
>>>>>How fast are the SSDs in those 3 OSDs?
>>>
>>> Theses results are with datas in buffer memory of osd nodes.
>>>
>>> When reading fulling on ssd (intel s3500),
>>>
>>> For 1 client,
>>>
>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>>>
>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.
>>>
>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>>>
>>>
>>>
>>> small tip :
>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%
>>>
>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>>>
>>> as a lot of time is spent in malloc/free
>>>
>>>
>>> (qemu support also tcmalloc since some months , I'll bench it too
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )
>>>
>>>
>>>
>>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>>>
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Mark Nelson" < mnelson@redhat.com >
>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>> Envoyé: Mardi 9 Juin 2015 13:36:31
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi All,
>>>
>>> In the past we've hit some performance issues with RBD cache that we've
>>> fixed, but we've never really tried pushing a single VM beyond 40+K read
>>> IOPS in testing (or at least I never have). I suspect there's a couple
>>> of possibilities as to why it might be slower, but perhaps joshd can
>>> chime in as he's more familiar with what that code looks like.
>>>
>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>>>
>>> Mark
>>>
>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>>>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>>>
>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
>>>> rbd_cache is almost the same than without cache with queue depth <16
>>>>
>>>>
>>>> cache
>>>> -----
>>>> qd1: 1651
>>>> qd2: 3482
>>>> qd4: 7958
>>>> qd8: 17912
>>>> qd16: 36020
>>>> qd32: 42765
>>>> qd64: 46169
>>>>
>>>> no cache
>>>> --------
>>>> qd1: 1748
>>>> qd2: 3570
>>>> qd4: 8356
>>>> qd8: 17732
>>>> qd16: 41396
>>>> qd32: 78633
>>>> qd64: 79063
>>>> qd128: 79550
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "aderumier" < aderumier@odiso.com >
>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>>> Envoyé: Mardi 9 Juin 2015 09:28:21
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>>
>>>> Hi,
>>>>
>>>>>> We tried adding more RBDs to single VM, but no luck.
>>>>
>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
>>>> It's working for me, I can scale with adding more disks.
>>>>
>>>>
>>>> My bench here are done with fio-rbd on host.
>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>>>>
>>>>
>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>>>>
>>>> I'm going to see if this tracker
>>>> http://tracker.ceph.com/issues/11056
>>>>
>>>> could be the cause.
>>>>
>>>> (My master build was done some week ago)
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>>> À: "aderumier" < aderumier@odiso.com >
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>>> Envoyé: Mardi 9 Juin 2015 09:21:04
>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>>>
>>>> Hi Alexandre,
>>>>
>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>>>>
>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>>>>
>>>>
>>>>
>>>>
>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>>>>
>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>>>>
>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>>>>
>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>>>> and rbd_cache=true seem to limit the iops around 40k
>>>>
>>>>
>>>> no cache
>>>> --------
>>>> 1 client - rbd_cache=false - 1osd : 38300 iops
>>>> 1 client - rbd_cache=false - 2osd : 69073 iops
>>>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>>>
>>>>
>>>> cache
>>>> -----
>>>> 1 client - rbd_cache=true - 1osd : 38100 iops
>>>> 1 client - rbd_cache=true - 2osd : 42457 iops
>>>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>>>
>>>>
>>>>
>>>> Is it expected ?
>>>>
>>>>
>>>>
>>>> fio result rbd_cache=false 3 osd
>>>> --------------------------------
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>>> fio-2.1.11
>>>> Starting 1 process
>>>> rbd engine: RBD version: 0.1.9
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>>>> clat percentiles (usec):
>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>>>> | 99.99th=[ 1176]
>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>>>> lat (msec) : 2=0.03%, 4=0.01%
>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>>
>>>> Run status group 0 (all jobs):
>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>>>>
>>>> Disk stats (read/write):
>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>>>
>>>>
>>>>
>>>>
>>>> fio result rbd_cache=true 3osd
>>>> ------------------------------
>>>>
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>>> fio-2.1.11
>>>> Starting 1 process
>>>> rbd engine: RBD version: 0.1.9
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>>>> clat percentiles (usec):
>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>>>> | 99.99th=[ 2192]
>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>>
>>>> Run status group 0 (all jobs):
>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>>>>
>>>> Disk stats (read/write):
>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>
>> --
>> С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>
>
>
> --
> -Pushpesh
>
>
>



-- 
-Pushpesh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-12  6:58                                             ` pushpesh sharma
@ 2015-06-16 16:38                                               ` Alexandre DERUMIER
  2015-06-22  5:58                                                 ` pushpesh sharma
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-16 16:38 UTC (permalink / raw)
  To: pushpesh sharma; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

Hi,

some news about qemu with tcmalloc vs jemmaloc.

I'm testing with multiple disks (with iothreads) in 1 qemu guest.

And if tcmalloc is a little faster than jemmaloc,

I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug.

increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.


with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc.

The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ...



----- Mail original -----
De: "pushpesh sharma" <pushpesh.eck@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 12 Juin 2015 08:58:21
Objet: Re: rbd_cache, limiting read on high iops around 40k

Thanks, posted the question in openstack list. Hopefully will get some 
expert opinion. 

On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
> Hi, 
> 
> here a libvirt xml sample from libvirt src 
> 
> (you need to define <iothreads> number, then assign then in disks). 
> 
> I don't use openstack, so I really don't known how it's working with it. 
> 
> 
> <domain type='qemu'> 
> <name>QEMUGuest1</name> 
> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
> <memory unit='KiB'>219136</memory> 
> <currentMemory unit='KiB'>219136</currentMemory> 
> <vcpu placement='static'>2</vcpu> 
> <iothreads>2</iothreads> 
> <os> 
> <type arch='i686' machine='pc'>hvm</type> 
> <boot dev='hd'/> 
> </os> 
> <clock offset='utc'/> 
> <on_poweroff>destroy</on_poweroff> 
> <on_reboot>restart</on_reboot> 
> <on_crash>destroy</on_crash> 
> <devices> 
> <emulator>/usr/bin/qemu</emulator> 
> <disk type='file' device='disk'> 
> <driver name='qemu' type='raw' iothread='1'/> 
> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
> <target dev='vdb' bus='virtio'/> 
> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
> </disk> 
> <disk type='file' device='disk'> 
> <driver name='qemu' type='raw' iothread='2'/> 
> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
> <target dev='vdc' bus='virtio'/> 
> </disk> 
> <controller type='usb' index='0'/> 
> <controller type='ide' index='0'/> 
> <controller type='pci' index='0' model='pci-root'/> 
> <memballoon model='none'/> 
> </devices> 
> </domain> 
> 
> 
> ----- Mail original ----- 
> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Vendredi 12 Juin 2015 07:52:41 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> Hi Alexandre, 
> 
> I agree with your rational, of one iothread per disk. CPU consumed in 
> IOwait is pretty high in each VM. But I am not finding a way to set 
> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
> As per libvirt documentation for setting iothreads, I can edit 
> domain.xml directly and achieve the same effect. However in as in 
> openstack env domain xml is created by nova with some additional 
> metadata, so editing the domain xml using 'virsh edit' does not seems 
> to work(I agree, it is not a very cloud way of doing things, but a 
> hack). Changes made there vanish after saving them, due to reason 
> libvirt validation fails on the same. 
> 
> #virsh dumpxml instance-000000c5 > vm.xml 
> #virt-xml-validate vm.xml 
> Relax-NG validity error : Extra element cpu in interleave 
> vm.xml:1: element domain: Relax-NG validity error : Element domain 
> failed to validate content 
> vm.xml fails to validate 
> 
> Second approach I took was to setting QoS in volumes types. But there 
> is no option to set iothreads per volume, there are parameter realted 
> to max_read/wrirte ops/bytes. 
> 
> Thirdly, editing Nova flavor and proving extra specs like 
> hw:cpu_socket/thread/core, can change guest CPU topology however again 
> no way to set iothread. It does accept hw_disk_iothreads(no type check 
> in place, i believe ), but can not pass the same in domain.xml. 
> 
> Could you suggest me a way to set the same. 
> 
> -Pushpesh 
> 
> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
> <aderumier@odiso.com> wrote: 
>>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>> 
>> Sure no problem. 
>> 
>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
>> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> Thanks for sharing the data. 
>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>> 
>> Regards 
>> Somnath 
>> 
>> -----Original Message----- 
>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER 
>> Sent: Tuesday, June 09, 2015 10:42 PM 
>> To: Irek Fasikhov 
>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>>>>Very good work! 
>>>>Do you have a rpm-file? 
>>>>Thanks. 
>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Irek Fasikhov" <malmyzh@gmail.com> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi, Alexandre. 
>> 
>> Very good work! 
>> Do you have a rpm-file? 
>> Thanks. 
>> 
>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
>> 
>> 
>> Hi, 
>> 
>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>> 
>> 
>> 
>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>> 
>> 
>> qemu : iothread : glibc : iops=34516 
>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>> 
>> 
>> 
>> 
>> 
>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>> ------------------------------------------------------ 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>> | 99.99th=[ 3760] 
>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>> 
>> 
>> 
>> 
>> 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>> | 99.99th=[ 3632] 
>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" < aderumier@odiso.com > 
>> À: "Robert LeBlanc" < robert@leblancnet.us > 
>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Robert, 
>> 
>>>>What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>tcmalloc, I'm still working to dig into why that might be the case). 
>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>> 
>> 
>> 
>>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>better for QEMU/KVM in the tests that we ran. [1] 
>> 
>> 
>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>> with qemu iothread, tcmalloc have a speed increase over glib 
>> with qemu iothread, jemalloc have a speed decrease 
>> 
>> without iothread, jemalloc have a big speed increase 
>> 
>> this is with 
>> -qemu 2.3 
>> -tcmalloc 2.2.1 
>> -jemmaloc 3.6 
>> -libc6 2.19 
>> 
>> 
>> qemu : no iothread : glibc : iops=33395 
>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>> 
>> qemu : iothread : glibc : iops=34516 
>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>> 
>> 
>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>> 
>> 
>> fio results: 
>> ------------ 
>> 
>> qemu : iothread : tcmalloc : iops=38676 
>> ----------------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>> clat percentiles (usec): 
>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>> | 99.99th=[ 3888] 
>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>> 
>> 
>> 
>> qemu : no-iothread : tcmalloc : iops=34516 
>> --------------------------------------------- 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>> clat percentiles (usec): 
>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>> | 99.99th=[ 4320] 
>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>> 
>> 
>> 
>> qemu : iothread : glibc : iops=34516 
>> ------------------------------------- 
>> 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>> clat percentiles (usec): 
>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>> | 99.99th=[ 3984] 
>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>> 
>> 
>> 
>> qemu : no iothread : glibc : iops=33395 
>> ----------------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>> clat percentiles (usec): 
>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>> | 99.99th=[ 4832] 
>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>> 
>> 
>> 
>> qemu : iothread : jemmaloc : iops=28023 
>> ---------------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>> clat percentiles (usec): 
>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>> | 99.99th=[ 3760] 
>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>> 
>> 
>> 
>> qemu : non-iothread : jemmaloc : iops=42226 
>> -------------------------------------------- 
>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>> fio-2.1.11 
>> Starting 1 process 
>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>> clat percentiles (usec): 
>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>> | 99.99th=[ 2608] 
>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>> lat (msec) : 2=10.30%, 4=0.07% 
>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>> latency : target=0, window=0, percentile=100.00%, depth=32 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>> 
>> Disk stats (read/write): 
>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Robert LeBlanc" < robert@leblancnet.us > 
>> À: "aderumier" < aderumier@odiso.com > 
>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>> 
>> -----BEGIN PGP SIGNED MESSAGE----- 
>> Hash: SHA256 
>> 
>> I also saw a similar performance increase by using alternative memory 
>> allocators. What I found was that Ceph OSDs performed well with either 
>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>> instead of tcmalloc, I'm still working to dig into why that might be 
>> the case). 
>> 
>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>> better for QEMU/KVM in the tests that we ran. [1] 
>> 
>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>> I'm seeing a lot of time in thread creation and destruction, the 
>> memory allocators are quite a bit down the list (both fio with 
>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>> I've tried using the async messenger but there wasn't a huge 
>> difference. [2] 
>> 
>> Further down the rabbit hole.... 
>> 
>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
>> -----BEGIN PGP SIGNATURE----- 
>> Version: Mailvelope v0.13.1 
>> Comment: https://www.mailvelope.com 
>> 
>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>> oSJX 
>> =k281 
>> -----END PGP SIGNATURE----- 
>> ---------------- 
>> Robert LeBlanc 
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>> 
>> 
>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>IOPS from 1 VM! 
>>> 
>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>> (I'm planning to send results in qemu soon) 
>>> 
>>>>>How fast are the SSDs in those 3 OSDs? 
>>> 
>>> Theses results are with datas in buffer memory of osd nodes. 
>>> 
>>> When reading fulling on ssd (intel s3500), 
>>> 
>>> For 1 client, 
>>> 
>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>> 
>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>> 
>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>> 
>>> 
>>> 
>>> small tip : 
>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>> 
>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>> 
>>> as a lot of time is spent in malloc/free 
>>> 
>>> 
>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>> 
>>> 
>>> 
>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Mark Nelson" < mnelson@redhat.com > 
>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi All, 
>>> 
>>> In the past we've hit some performance issues with RBD cache that we've 
>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>> chime in as he's more familiar with what that code looks like. 
>>> 
>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>> 
>>> Mark 
>>> 
>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>> 
>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>> 
>>>> 
>>>> cache 
>>>> ----- 
>>>> qd1: 1651 
>>>> qd2: 3482 
>>>> qd4: 7958 
>>>> qd8: 17912 
>>>> qd16: 36020 
>>>> qd32: 42765 
>>>> qd64: 46169 
>>>> 
>>>> no cache 
>>>> -------- 
>>>> qd1: 1748 
>>>> qd2: 3570 
>>>> qd4: 8356 
>>>> qd8: 17732 
>>>> qd16: 41396 
>>>> qd32: 78633 
>>>> qd64: 79063 
>>>> qd128: 79550 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" < aderumier@odiso.com > 
>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi, 
>>>> 
>>>>>> We tried adding more RBDs to single VM, but no luck. 
>>>> 
>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>> It's working for me, I can scale with adding more disks. 
>>>> 
>>>> 
>>>> My bench here are done with fio-rbd on host. 
>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>> 
>>>> 
>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>> 
>>>> I'm going to see if this tracker 
>>>> http://tracker.ceph.com/issues/11056 
>>>> 
>>>> could be the cause. 
>>>> 
>>>> (My master build was done some week ago) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>> 
>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>> 
>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>> 
>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>> 
>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>> 
>>>> 
>>>> Hi, 
>>>> 
>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>> 
>>>> 
>>>> no cache 
>>>> -------- 
>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>> 
>>>> 
>>>> cache 
>>>> ----- 
>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>> 
>>>> 
>>>> 
>>>> Is it expected ? 
>>>> 
>>>> 
>>>> 
>>>> fio result rbd_cache=false 3 osd 
>>>> -------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> rbd engine: RBD version: 0.1.9 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>> | 99.99th=[ 1176] 
>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> fio result rbd_cache=true 3osd 
>>>> ------------------------------ 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> rbd engine: RBD version: 0.1.9 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>> | 99.99th=[ 2192] 
>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> С уважением, Фасихов Ирек Нургаязович 
>> Моб.: +79229045757 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> ________________________________ 
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>> 
> 
> 
> 
> -- 
> -Pushpesh 
> 
> 
> 



-- 
-Pushpesh 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-16 16:38                                               ` Alexandre DERUMIER
@ 2015-06-22  5:58                                                 ` pushpesh sharma
  2015-06-22  7:08                                                   ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: pushpesh sharma @ 2015-06-22  5:58 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

Just an update, there seems to be no proper way to pass iothread
parameter from openstack-nova (not at least in Juno release). So a
default single iothread per VM is what all we have. So in conclusion a
nova instance max iops on ceph rbd will be limited to 30-40K.

On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Hi,
>
> some news about qemu with tcmalloc vs jemmaloc.
>
> I'm testing with multiple disks (with iothreads) in 1 qemu guest.
>
> And if tcmalloc is a little faster than jemmaloc,
>
> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug.
>
> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.
>
>
> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc.
>
> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ...
>
>
>
> ----- Mail original -----
> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Vendredi 12 Juin 2015 08:58:21
> Objet: Re: rbd_cache, limiting read on high iops around 40k
>
> Thanks, posted the question in openstack list. Hopefully will get some
> expert opinion.
>
> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> Hi,
>>
>> here a libvirt xml sample from libvirt src
>>
>> (you need to define <iothreads> number, then assign then in disks).
>>
>> I don't use openstack, so I really don't known how it's working with it.
>>
>>
>> <domain type='qemu'>
>> <name>QEMUGuest1</name>
>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
>> <memory unit='KiB'>219136</memory>
>> <currentMemory unit='KiB'>219136</currentMemory>
>> <vcpu placement='static'>2</vcpu>
>> <iothreads>2</iothreads>
>> <os>
>> <type arch='i686' machine='pc'>hvm</type>
>> <boot dev='hd'/>
>> </os>
>> <clock offset='utc'/>
>> <on_poweroff>destroy</on_poweroff>
>> <on_reboot>restart</on_reboot>
>> <on_crash>destroy</on_crash>
>> <devices>
>> <emulator>/usr/bin/qemu</emulator>
>> <disk type='file' device='disk'>
>> <driver name='qemu' type='raw' iothread='1'/>
>> <source file='/var/lib/libvirt/images/iothrtest1.img'/>
>> <target dev='vdb' bus='virtio'/>
>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
>> </disk>
>> <disk type='file' device='disk'>
>> <driver name='qemu' type='raw' iothread='2'/>
>> <source file='/var/lib/libvirt/images/iothrtest2.img'/>
>> <target dev='vdc' bus='virtio'/>
>> </disk>
>> <controller type='usb' index='0'/>
>> <controller type='ide' index='0'/>
>> <controller type='pci' index='0' model='pci-root'/>
>> <memballoon model='none'/>
>> </devices>
>> </domain>
>>
>>
>> ----- Mail original -----
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
>> À: "aderumier" <aderumier@odiso.com>
>> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Vendredi 12 Juin 2015 07:52:41
>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>
>> Hi Alexandre,
>>
>> I agree with your rational, of one iothread per disk. CPU consumed in
>> IOwait is pretty high in each VM. But I am not finding a way to set
>> the same on a nova instance. I am using openstack Juno with QEMU+KVM.
>> As per libvirt documentation for setting iothreads, I can edit
>> domain.xml directly and achieve the same effect. However in as in
>> openstack env domain xml is created by nova with some additional
>> metadata, so editing the domain xml using 'virsh edit' does not seems
>> to work(I agree, it is not a very cloud way of doing things, but a
>> hack). Changes made there vanish after saving them, due to reason
>> libvirt validation fails on the same.
>>
>> #virsh dumpxml instance-000000c5 > vm.xml
>> #virt-xml-validate vm.xml
>> Relax-NG validity error : Extra element cpu in interleave
>> vm.xml:1: element domain: Relax-NG validity error : Element domain
>> failed to validate content
>> vm.xml fails to validate
>>
>> Second approach I took was to setting QoS in volumes types. But there
>> is no option to set iothreads per volume, there are parameter realted
>> to max_read/wrirte ops/bytes.
>>
>> Thirdly, editing Nova flavor and proving extra specs like
>> hw:cpu_socket/thread/core, can change guest CPU topology however again
>> no way to set iothread. It does accept hw_disk_iothreads(no type check
>> in place, i believe ), but can not pass the same in domain.xml.
>>
>> Could you suggest me a way to set the same.
>>
>> -Pushpesh
>>
>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>>
>>> Sure no problem.
>>>
>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk)
>>>
>>>
>>> ----- Mail original -----
>>> De: "Somnath Roy" <Somnath.Roy@sandisk.com>
>>> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
>>> Envoyé: Mercredi 10 Juin 2015 09:06:32
>>> Objet: RE: rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi Alexandre,
>>> Thanks for sharing the data.
>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>>
>>> Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER
>>> Sent: Tuesday, June 09, 2015 10:42 PM
>>> To: Irek Fasikhov
>>> Cc: ceph-devel; pushpesh sharma; ceph-users
>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>>>>Very good work!
>>>>>Do you have a rpm-file?
>>>>>Thanks.
>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client)
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Irek Fasikhov" <malmyzh@gmail.com>
>>> À: "aderumier" <aderumier@odiso.com>
>>> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
>>> Envoyé: Mercredi 10 Juin 2015 07:21:42
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi, Alexandre.
>>>
>>> Very good work!
>>> Do you have a rpm-file?
>>> Thanks.
>>>
>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > :
>>>
>>>
>>> Hi,
>>>
>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) !
>>>
>>>
>>>
>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
>>>
>>>
>>> qemu : iothread : glibc : iops=34516
>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>>>
>>>
>>>
>>>
>>>
>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>>> ------------------------------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec):
>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
>>> | 99.99th=[ 3760]
>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%
>>>
>>>
>>>
>>>
>>>
>>>
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec):
>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
>>> | 99.99th=[ 3632]
>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%
>>>
>>>
>>> ----- Mail original -----
>>> De: "aderumier" < aderumier@odiso.com >
>>> À: "Robert LeBlanc" < robert@leblancnet.us >
>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
>>> Envoyé: Mardi 9 Juin 2015 18:47:27
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi Robert,
>>>
>>>>>What I found was that Ceph OSDs performed well with either tcmalloc or
>>>>>jemalloc (except when RocksDB was built with jemalloc instead of
>>>>>tcmalloc, I'm still working to dig into why that might be the case).
>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc.
>>>
>>>
>>>
>>>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>>>>better for QEMU/KVM in the tests that we ran. [1]
>>>
>>>
>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc.
>>> with qemu iothread, tcmalloc have a speed increase over glib
>>> with qemu iothread, jemalloc have a speed decrease
>>>
>>> without iothread, jemalloc have a big speed increase
>>>
>>> this is with
>>> -qemu 2.3
>>> -tcmalloc 2.2.1
>>> -jemmaloc 3.6
>>> -libc6 2.19
>>>
>>>
>>> qemu : no iothread : glibc : iops=33395
>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
>>>
>>> qemu : iothread : glibc : iops=34516
>>> qemu : iothread : tcmalloc : iops=38676 (+12%)
>>> qemu : iothread : jemmaloc : iops=28023 (-19%)
>>>
>>>
>>> (The benefit of iothreads is that we can scale with more disks in 1vm)
>>>
>>>
>>> fio results:
>>> ------------
>>>
>>> qemu : iothread : tcmalloc : iops=38676
>>> -----------------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015
>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
>>> clat percentiles (usec):
>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
>>> | 99.99th=[ 3888]
>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%
>>>
>>>
>>>
>>> qemu : no-iothread : tcmalloc : iops=34516
>>> ---------------------------------------------
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015
>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
>>> clat percentiles (usec):
>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
>>> | 99.99th=[ 4320]
>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%
>>>
>>>
>>>
>>> qemu : iothread : glibc : iops=34516
>>> -------------------------------------
>>>
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015
>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
>>> clat percentiles (usec):
>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
>>> | 99.99th=[ 3984]
>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%
>>>
>>>
>>>
>>> qemu : no iothread : glibc : iops=33395
>>> -----------------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015
>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
>>> clat percentiles (usec):
>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
>>> | 99.99th=[ 4832]
>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%
>>>
>>>
>>>
>>> qemu : iothread : jemmaloc : iops=28023
>>> ----------------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015
>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
>>> clat percentiles (usec):
>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
>>> | 99.99th=[ 3760]
>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%
>>>
>>>
>>>
>>> qemu : non-iothread : jemmaloc : iops=42226
>>> --------------------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015
>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
>>> clat percentiles (usec):
>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
>>> | 99.99th=[ 2608]
>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
>>> lat (msec) : 2=10.30%, 4=0.07%
>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec
>>>
>>> Disk stats (read/write):
>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Robert LeBlanc" < robert@leblancnet.us >
>>> À: "aderumier" < aderumier@odiso.com >
>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com >
>>> Envoyé: Mardi 9 Juin 2015 18:00:29
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I also saw a similar performance increase by using alternative memory
>>> allocators. What I found was that Ceph OSDs performed well with either
>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
>>> instead of tcmalloc, I'm still working to dig into why that might be
>>> the case).
>>>
>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>> better for QEMU/KVM in the tests that we ran. [1]
>>>
>>> I'm currently looking into I/O bottlenecks around the 16KB range and
>>> I'm seeing a lot of time in thread creation and destruction, the
>>> memory allocators are quite a bit down the list (both fio with
>>> ioengine rbd and on the OSDs). I wonder what the difference can be.
>>> I've tried using the async messenger but there wasn't a huge
>>> difference. [2]
>>>
>>> Further down the rabbit hole....
>>>
>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html
>>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v0.13.1
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
>>> oSJX
>>> =k281
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>>>>IOPS from 1 VM!
>>>>
>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead.
>>>> (I'm planning to send results in qemu soon)
>>>>
>>>>>>How fast are the SSDs in those 3 OSDs?
>>>>
>>>> Theses results are with datas in buffer memory of osd nodes.
>>>>
>>>> When reading fulling on ssd (intel s3500),
>>>>
>>>> For 1 client,
>>>>
>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>>>>
>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer.
>>>>
>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>>>>
>>>>
>>>>
>>>> small tip :
>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20%
>>>>
>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>>>>
>>>> as a lot of time is spent in malloc/free
>>>>
>>>>
>>>> (qemu support also tcmalloc since some months , I'll bench it too
>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )
>>>>
>>>>
>>>>
>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Mark Nelson" < mnelson@redhat.com >
>>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>>> Envoyé: Mardi 9 Juin 2015 13:36:31
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>>
>>>> Hi All,
>>>>
>>>> In the past we've hit some performance issues with RBD cache that we've
>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read
>>>> IOPS in testing (or at least I never have). I suspect there's a couple
>>>> of possibilities as to why it might be slower, but perhaps joshd can
>>>> chime in as he's more familiar with what that code looks like.
>>>>
>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>>>>
>>>> Mark
>>>>
>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>>>>
>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size.
>>>>> rbd_cache is almost the same than without cache with queue depth <16
>>>>>
>>>>>
>>>>> cache
>>>>> -----
>>>>> qd1: 1651
>>>>> qd2: 3482
>>>>> qd4: 7958
>>>>> qd8: 17912
>>>>> qd16: 36020
>>>>> qd32: 42765
>>>>> qd64: 46169
>>>>>
>>>>> no cache
>>>>> --------
>>>>> qd1: 1748
>>>>> qd2: 3570
>>>>> qd4: 8356
>>>>> qd8: 17732
>>>>> qd16: 41396
>>>>> qd32: 78633
>>>>> qd64: 79063
>>>>> qd128: 79550
>>>>>
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "aderumier" < aderumier@odiso.com >
>>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>>>
>>>>> Hi,
>>>>>
>>>>>>> We tried adding more RBDs to single VM, but no luck.
>>>>>
>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk).
>>>>> It's working for me, I can scale with adding more disks.
>>>>>
>>>>>
>>>>> My bench here are done with fio-rbd on host.
>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on.
>>>>>
>>>>>
>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd.
>>>>>
>>>>> I'm going to see if this tracker
>>>>> http://tracker.ceph.com/issues/11056
>>>>>
>>>>> could be the cause.
>>>>>
>>>>> (My master build was done some week ago)
>>>>>
>>>>>
>>>>>
>>>>> ----- Mail original -----
>>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com >
>>>>> À: "aderumier" < aderumier@odiso.com >
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com >
>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04
>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>>>>
>>>>> Hi Alexandre,
>>>>>
>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs.
>>>>>
>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:-
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck.
>>>>>
>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect)
>>>>>
>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case.
>>>>>
>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>>>>> and rbd_cache=true seem to limit the iops around 40k
>>>>>
>>>>>
>>>>> no cache
>>>>> --------
>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops
>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops
>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>>>>
>>>>>
>>>>> cache
>>>>> -----
>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops
>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops
>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>>>>
>>>>>
>>>>>
>>>>> Is it expected ?
>>>>>
>>>>>
>>>>>
>>>>> fio result rbd_cache=false 3 osd
>>>>> --------------------------------
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>>>> fio-2.1.11
>>>>> Starting 1 process
>>>>> rbd engine: RBD version: 0.1.9
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s]
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015
>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>>>>> clat percentiles (usec):
>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>>>>> | 99.99th=[ 1176]
>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21
>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>>>>> lat (msec) : 2=0.03%, 4=0.01%
>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>>>
>>>>> Run status group 0 (all jobs):
>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec
>>>>>
>>>>> Disk stats (read/write):
>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> fio result rbd_cache=true 3osd
>>>>> ------------------------------
>>>>>
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
>>>>> fio-2.1.11
>>>>> Starting 1 process
>>>>> rbd engine: RBD version: 0.1.9
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s]
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015
>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>>>>> clat percentiles (usec):
>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>>>>> | 99.99th=[ 2192]
>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93
>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>>>
>>>>> Run status group 0 (all jobs):
>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec
>>>>>
>>>>> Disk stats (read/write):
>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>
>>
>>
>> --
>> -Pushpesh
>>
>>
>>
>
>
>
> --
> -Pushpesh
>
>



-- 
-Pushpesh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-22  5:58                                                 ` pushpesh sharma
@ 2015-06-22  7:08                                                   ` Alexandre DERUMIER
  2015-06-22  7:12                                                     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-22  7:08 UTC (permalink / raw)
  To: pushpesh sharma; +Cc: Somnath Roy, Irek Fasikhov, ceph-devel, ceph-users

>>Just an update, there seems to be no proper way to pass iothread 
>>parameter from openstack-nova (not at least in Juno release). So a 
>>default single iothread per VM is what all we have. So in conclusion a 
>>nova instance max iops on ceph rbd will be limited to 30-40K. 

Thanks for the update.

For proxmox users, 

I have added iothread option to gui for proxmox 4.0,
and added jemalloc as default memory allocator


I have also send a jemmaloc patch to qemu dev mailing
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html

(Help is welcome to push it in qemu upstream ! )



----- Mail original -----
De: "pushpesh sharma" <pushpesh.eck@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 22 Juin 2015 07:58:47
Objet: Re: rbd_cache, limiting read on high iops around 40k

Just an update, there seems to be no proper way to pass iothread 
parameter from openstack-nova (not at least in Juno release). So a 
default single iothread per VM is what all we have. So in conclusion a 
nova instance max iops on ceph rbd will be limited to 30-40K. 

On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
> Hi, 
> 
> some news about qemu with tcmalloc vs jemmaloc. 
> 
> I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
> 
> And if tcmalloc is a little faster than jemmaloc, 
> 
> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 
> 
> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
> 
> 
> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 
> 
> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 
> 
> 
> 
> ----- Mail original ----- 
> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
> À: "aderumier" <aderumier@odiso.com> 
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Vendredi 12 Juin 2015 08:58:21 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> Thanks, posted the question in openstack list. Hopefully will get some 
> expert opinion. 
> 
> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
> <aderumier@odiso.com> wrote: 
>> Hi, 
>> 
>> here a libvirt xml sample from libvirt src 
>> 
>> (you need to define <iothreads> number, then assign then in disks). 
>> 
>> I don't use openstack, so I really don't known how it's working with it. 
>> 
>> 
>> <domain type='qemu'> 
>> <name>QEMUGuest1</name> 
>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
>> <memory unit='KiB'>219136</memory> 
>> <currentMemory unit='KiB'>219136</currentMemory> 
>> <vcpu placement='static'>2</vcpu> 
>> <iothreads>2</iothreads> 
>> <os> 
>> <type arch='i686' machine='pc'>hvm</type> 
>> <boot dev='hd'/> 
>> </os> 
>> <clock offset='utc'/> 
>> <on_poweroff>destroy</on_poweroff> 
>> <on_reboot>restart</on_reboot> 
>> <on_crash>destroy</on_crash> 
>> <devices> 
>> <emulator>/usr/bin/qemu</emulator> 
>> <disk type='file' device='disk'> 
>> <driver name='qemu' type='raw' iothread='1'/> 
>> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
>> <target dev='vdb' bus='virtio'/> 
>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
>> </disk> 
>> <disk type='file' device='disk'> 
>> <driver name='qemu' type='raw' iothread='2'/> 
>> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
>> <target dev='vdc' bus='virtio'/> 
>> </disk> 
>> <controller type='usb' index='0'/> 
>> <controller type='ide' index='0'/> 
>> <controller type='pci' index='0' model='pci-root'/> 
>> <memballoon model='none'/> 
>> </devices> 
>> </domain> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Vendredi 12 Juin 2015 07:52:41 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Hi Alexandre, 
>> 
>> I agree with your rational, of one iothread per disk. CPU consumed in 
>> IOwait is pretty high in each VM. But I am not finding a way to set 
>> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
>> As per libvirt documentation for setting iothreads, I can edit 
>> domain.xml directly and achieve the same effect. However in as in 
>> openstack env domain xml is created by nova with some additional 
>> metadata, so editing the domain xml using 'virsh edit' does not seems 
>> to work(I agree, it is not a very cloud way of doing things, but a 
>> hack). Changes made there vanish after saving them, due to reason 
>> libvirt validation fails on the same. 
>> 
>> #virsh dumpxml instance-000000c5 > vm.xml 
>> #virt-xml-validate vm.xml 
>> Relax-NG validity error : Extra element cpu in interleave 
>> vm.xml:1: element domain: Relax-NG validity error : Element domain 
>> failed to validate content 
>> vm.xml fails to validate 
>> 
>> Second approach I took was to setting QoS in volumes types. But there 
>> is no option to set iothreads per volume, there are parameter realted 
>> to max_read/wrirte ops/bytes. 
>> 
>> Thirdly, editing Nova flavor and proving extra specs like 
>> hw:cpu_socket/thread/core, can change guest CPU topology however again 
>> no way to set iothread. It does accept hw_disk_iothreads(no type check 
>> in place, i believe ), but can not pass the same in domain.xml. 
>> 
>> Could you suggest me a way to set the same. 
>> 
>> -Pushpesh 
>> 
>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
>> <aderumier@odiso.com> wrote: 
>>>>>I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>> 
>>> Sure no problem. 
>>> 
>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
>>> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com> 
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Alexandre, 
>>> Thanks for sharing the data. 
>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>> 
>>> Regards 
>>> Somnath 
>>> 
>>> -----Original Message----- 
>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER 
>>> Sent: Tuesday, June 09, 2015 10:42 PM 
>>> To: Irek Fasikhov 
>>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>>>>Very good work! 
>>>>>Do you have a rpm-file? 
>>>>>Thanks. 
>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Irek Fasikhov" <malmyzh@gmail.com> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi, Alexandre. 
>>> 
>>> Very good work! 
>>> Do you have a rpm-file? 
>>> Thanks. 
>>> 
>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
>>> 
>>> 
>>> Hi, 
>>> 
>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>>> 
>>> 
>>> 
>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>>> 
>>> 
>>> qemu : iothread : glibc : iops=34516 
>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>> ------------------------------------------------------ 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>>> | 99.99th=[ 3760] 
>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>>> | 99.99th=[ 3632] 
>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" < aderumier@odiso.com > 
>>> À: "Robert LeBlanc" < robert@leblancnet.us > 
>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Robert, 
>>> 
>>>>>What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>>jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>>tcmalloc, I'm still working to dig into why that might be the case). 
>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>>> 
>>> 
>>> 
>>>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>>better for QEMU/KVM in the tests that we ran. [1] 
>>> 
>>> 
>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>>> with qemu iothread, tcmalloc have a speed increase over glib 
>>> with qemu iothread, jemalloc have a speed decrease 
>>> 
>>> without iothread, jemalloc have a big speed increase 
>>> 
>>> this is with 
>>> -qemu 2.3 
>>> -tcmalloc 2.2.1 
>>> -jemmaloc 3.6 
>>> -libc6 2.19 
>>> 
>>> 
>>> qemu : no iothread : glibc : iops=33395 
>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>>> 
>>> qemu : iothread : glibc : iops=34516 
>>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>>> 
>>> 
>>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>>> 
>>> 
>>> fio results: 
>>> ------------ 
>>> 
>>> qemu : iothread : tcmalloc : iops=38676 
>>> ----------------------------------------- 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>>> | 99.99th=[ 3888] 
>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>>> 
>>> 
>>> 
>>> qemu : no-iothread : tcmalloc : iops=34516 
>>> --------------------------------------------- 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>>> | 99.99th=[ 4320] 
>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>>> 
>>> 
>>> 
>>> qemu : iothread : glibc : iops=34516 
>>> ------------------------------------- 
>>> 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>>> | 99.99th=[ 3984] 
>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>>> 
>>> 
>>> 
>>> qemu : no iothread : glibc : iops=33395 
>>> ----------------------------------------- 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>>> | 99.99th=[ 4832] 
>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>>> 
>>> 
>>> 
>>> qemu : iothread : jemmaloc : iops=28023 
>>> ---------------------------------------- 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>>> | 99.99th=[ 3760] 
>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>>> 
>>> 
>>> 
>>> qemu : non-iothread : jemmaloc : iops=42226 
>>> -------------------------------------------- 
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>> fio-2.1.11 
>>> Starting 1 process 
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>>> | 99.99th=[ 2608] 
>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>>> lat (msec) : 2=10.30%, 4=0.07% 
>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>>> 
>>> Disk stats (read/write): 
>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Robert LeBlanc" < robert@leblancnet.us > 
>>> À: "aderumier" < aderumier@odiso.com > 
>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>> 
>>> -----BEGIN PGP SIGNED MESSAGE----- 
>>> Hash: SHA256 
>>> 
>>> I also saw a similar performance increase by using alternative memory 
>>> allocators. What I found was that Ceph OSDs performed well with either 
>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>> instead of tcmalloc, I'm still working to dig into why that might be 
>>> the case). 
>>> 
>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>> better for QEMU/KVM in the tests that we ran. [1] 
>>> 
>>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>>> I'm seeing a lot of time in thread creation and destruction, the 
>>> memory allocators are quite a bit down the list (both fio with 
>>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>>> I've tried using the async messenger but there wasn't a huge 
>>> difference. [2] 
>>> 
>>> Further down the rabbit hole.... 
>>> 
>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
>>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
>>> -----BEGIN PGP SIGNATURE----- 
>>> Version: Mailvelope v0.13.1 
>>> Comment: https://www.mailvelope.com 
>>> 
>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>>> oSJX 
>>> =k281 
>>> -----END PGP SIGNATURE----- 
>>> ---------------- 
>>> Robert LeBlanc 
>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>>> 
>>> 
>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>>IOPS from 1 VM! 
>>>> 
>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>>> (I'm planning to send results in qemu soon) 
>>>> 
>>>>>>How fast are the SSDs in those 3 OSDs? 
>>>> 
>>>> Theses results are with datas in buffer memory of osd nodes. 
>>>> 
>>>> When reading fulling on ssd (intel s3500), 
>>>> 
>>>> For 1 client, 
>>>> 
>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>>> 
>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>>> 
>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>>> 
>>>> 
>>>> 
>>>> small tip : 
>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>>> 
>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>>> 
>>>> as a lot of time is spent in malloc/free 
>>>> 
>>>> 
>>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>>> 
>>>> 
>>>> 
>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Mark Nelson" < mnelson@redhat.com > 
>>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi All, 
>>>> 
>>>> In the past we've hit some performance issues with RBD cache that we've 
>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>>> chime in as he's more familiar with what that code looks like. 
>>>> 
>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>>> 
>>>> Mark 
>>>> 
>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>>> 
>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>>> 
>>>>> 
>>>>> cache 
>>>>> ----- 
>>>>> qd1: 1651 
>>>>> qd2: 3482 
>>>>> qd4: 7958 
>>>>> qd8: 17912 
>>>>> qd16: 36020 
>>>>> qd32: 42765 
>>>>> qd64: 46169 
>>>>> 
>>>>> no cache 
>>>>> -------- 
>>>>> qd1: 1748 
>>>>> qd2: 3570 
>>>>> qd4: 8356 
>>>>> qd8: 17732 
>>>>> qd16: 41396 
>>>>> qd32: 78633 
>>>>> qd64: 79063 
>>>>> qd128: 79550 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" < aderumier@odiso.com > 
>>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>>>> We tried adding more RBDs to single VM, but no luck. 
>>>>> 
>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>>> It's working for me, I can scale with adding more disks. 
>>>>> 
>>>>> 
>>>>> My bench here are done with fio-rbd on host. 
>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>>> 
>>>>> 
>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>>> 
>>>>> I'm going to see if this tracker 
>>>>> http://tracker.ceph.com/issues/11056 
>>>>> 
>>>>> could be the cause. 
>>>>> 
>>>>> (My master build was done some week ago) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>> À: "aderumier" < aderumier@odiso.com > 
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi Alexandre, 
>>>>> 
>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>>> 
>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>>> 
>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>>> 
>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>>> 
>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>> 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>>> 
>>>>> 
>>>>> no cache 
>>>>> -------- 
>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>>> 
>>>>> 
>>>>> cache 
>>>>> ----- 
>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>>> 
>>>>> 
>>>>> 
>>>>> Is it expected ? 
>>>>> 
>>>>> 
>>>>> 
>>>>> fio result rbd_cache=false 3 osd 
>>>>> -------------------------------- 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> rbd engine: RBD version: 0.1.9 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>>> | 99.99th=[ 1176] 
>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> fio result rbd_cache=true 3osd 
>>>>> ------------------------------ 
>>>>> 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> rbd engine: RBD version: 0.1.9 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>>> | 99.99th=[ 2192] 
>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> С уважением, Фасихов Ирек Нургаязович 
>>> Моб.: +79229045757 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 
>>> ________________________________ 
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>>> 
>> 
>> 
>> 
>> -- 
>> -Pushpesh 
>> 
>> 
>> 
> 
> 
> 
> -- 
> -Pushpesh 
> 
> 



-- 
-Pushpesh 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-22  7:08                                                   ` Alexandre DERUMIER
@ 2015-06-22  7:12                                                     ` Stefan Priebe - Profihost AG
       [not found]                                                       ` <942E436A-5668-4F76-91E7-FAA08CC0F48A-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-06-22  7:12 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: pushpesh sharma, Somnath Roy, Irek Fasikhov, ceph-devel,
	ceph-users


Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER <aderumier@odiso.com>:

>>> Just an update, there seems to be no proper way to pass iothread 
>>> parameter from openstack-nova (not at least in Juno release). So a 
>>> default single iothread per VM is what all we have. So in conclusion a 
>>> nova instance max iops on ceph rbd will be limited to 30-40K.
> 
> Thanks for the update.
> 
> For proxmox users, 
> 
> I have added iothread option to gui for proxmox 4.0

Can we make iothread the default? Does it also help for single disks or only multiple disks?

> and added jemalloc as default memory allocator
> 
> 
> I have also send a jemmaloc patch to qemu dev mailing
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html
> 
> (Help is welcome to push it in qemu upstream ! )
> 
> 
> 
> ----- Mail original -----
> De: "pushpesh sharma" <pushpesh.eck@gmail.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Lundi 22 Juin 2015 07:58:47
> Objet: Re: rbd_cache, limiting read on high iops around 40k
> 
> Just an update, there seems to be no proper way to pass iothread 
> parameter from openstack-nova (not at least in Juno release). So a 
> default single iothread per VM is what all we have. So in conclusion a 
> nova instance max iops on ceph rbd will be limited to 30-40K. 
> 
> On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
> <aderumier@odiso.com> wrote: 
>> Hi, 
>> 
>> some news about qemu with tcmalloc vs jemmaloc. 
>> 
>> I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
>> 
>> And if tcmalloc is a little faster than jemmaloc, 
>> 
>> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 
>> 
>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
>> 
>> 
>> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 
>> 
>> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>> À: "aderumier" <aderumier@odiso.com> 
>> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Vendredi 12 Juin 2015 08:58:21 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Thanks, posted the question in openstack list. Hopefully will get some 
>> expert opinion. 
>> 
>> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
>> <aderumier@odiso.com> wrote: 
>>> Hi, 
>>> 
>>> here a libvirt xml sample from libvirt src 
>>> 
>>> (you need to define <iothreads> number, then assign then in disks). 
>>> 
>>> I don't use openstack, so I really don't known how it's working with it. 
>>> 
>>> 
>>> <domain type='qemu'> 
>>> <name>QEMUGuest1</name> 
>>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
>>> <memory unit='KiB'>219136</memory> 
>>> <currentMemory unit='KiB'>219136</currentMemory> 
>>> <vcpu placement='static'>2</vcpu> 
>>> <iothreads>2</iothreads> 
>>> <os> 
>>> <type arch='i686' machine='pc'>hvm</type> 
>>> <boot dev='hd'/> 
>>> </os> 
>>> <clock offset='utc'/> 
>>> <on_poweroff>destroy</on_poweroff> 
>>> <on_reboot>restart</on_reboot> 
>>> <on_crash>destroy</on_crash> 
>>> <devices> 
>>> <emulator>/usr/bin/qemu</emulator> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='1'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
>>> <target dev='vdb' bus='virtio'/> 
>>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
>>> </disk> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='2'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
>>> <target dev='vdc' bus='virtio'/> 
>>> </disk> 
>>> <controller type='usb' index='0'/> 
>>> <controller type='ide' index='0'/> 
>>> <controller type='pci' index='0' model='pci-root'/> 
>>> <memballoon model='none'/> 
>>> </devices> 
>>> </domain> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "pushpesh sharma" <pushpesh.eck@gmail.com> 
>>> À: "aderumier" <aderumier@odiso.com> 
>>> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Irek Fasikhov" <malmyzh@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com> 
>>> Envoyé: Vendredi 12 Juin 2015 07:52:41 
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Alexandre, 
>>> 
>>> I agree with your rational, of one iothread per disk. CPU consumed in 
>>> IOwait is pretty high in each VM. But I am not finding a way to set 
>>> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
>>> As per libvirt documentation for setting iothreads, I can edit 
>>> domain.xml directly and achieve the same effect. However in as in 
>>> openstack env domain xml is created by nova with some additional 
>>> metadata, so editing the domain xml using 'virsh edit' does not seems 
>>> to work(I agree, it is not a very cloud way of doing things, but a 
>>> hack). Changes made there vanish after saving them, due to reason 
>>> libvirt validation fails on the same. 
>>> 
>>> #virsh dumpxml instance-000000c5 > vm.xml 
>>> #virt-xml-validate vm.xml 
>>> Relax-NG validity error : Extra element cpu in interleave 
>>> vm.xml:1: element domain: Relax-NG validity error : Element domain 
>>> failed to validate content 
>>> vm.xml fails to validate 
>>> 
>>> Second approach I took was to setting QoS in volumes types. But there 
>>> is no option to set iothreads per volume, there are parameter realted 
>>> to max_read/wrirte ops/bytes. 
>>> 
>>> Thirdly, editing Nova flavor and proving extra specs like 
>>> hw:cpu_socket/thread/core, can change guest CPU topology however again 
>>> no way to set iothread. It does accept hw_disk_iothreads(no type check 
>>> in place, i believe ), but can not pass the same in domain.xml. 
>>> 
>>> Could you suggest me a way to set the same. 
>>> 
>>> -Pushpesh 
>>> 
>>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
>>> <aderumier@odiso.com> wrote: 
>>>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>>> 
>>>> Sure no problem. 
>>>> 
>>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
>>>> À: "aderumier" <aderumier@odiso.com>, "Irek Fasikhov" <malmyzh@gmail.com> 
>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>>>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>>>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Alexandre, 
>>>> Thanks for sharing the data. 
>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>> 
>>>> Regards 
>>>> Somnath 
>>>> 
>>>> -----Original Message----- 
>>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Alexandre DERUMIER 
>>>> Sent: Tuesday, June 09, 2015 10:42 PM 
>>>> To: Irek Fasikhov 
>>>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>>>> Very good work! 
>>>>>> Do you have a rpm-file? 
>>>>>> Thanks.
>>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Irek Fasikhov" <malmyzh@gmail.com> 
>>>> À: "aderumier" <aderumier@odiso.com> 
>>>> Cc: "Robert LeBlanc" <robert@leblancnet.us>, "ceph-devel" <ceph-devel@vger.kernel.org>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com> 
>>>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi, Alexandre. 
>>>> 
>>>> Very good work! 
>>>> Do you have a rpm-file? 
>>>> Thanks. 
>>>> 
>>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
>>>> 
>>>> 
>>>> Hi, 
>>>> 
>>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> ------------------------------------------------------ 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3632] 
>>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" < aderumier@odiso.com > 
>>>> À: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Robert, 
>>>> 
>>>>>> What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>>> jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>>> tcmalloc, I'm still working to dig into why that might be the case).
>>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>>>> 
>>>> 
>>>> 
>>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>>> better for QEMU/KVM in the tests that we ran. [1]
>>>> 
>>>> 
>>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>>>> with qemu iothread, tcmalloc have a speed increase over glib 
>>>> with qemu iothread, jemalloc have a speed decrease 
>>>> 
>>>> without iothread, jemalloc have a big speed increase 
>>>> 
>>>> this is with 
>>>> -qemu 2.3 
>>>> -tcmalloc 2.2.1 
>>>> -jemmaloc 3.6 
>>>> -libc6 2.19 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>>>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>>>> 
>>>> 
>>>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>>>> 
>>>> 
>>>> fio results: 
>>>> ------------ 
>>>> 
>>>> qemu : iothread : tcmalloc : iops=38676 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>>>> | 99.99th=[ 3888] 
>>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no-iothread : tcmalloc : iops=34516 
>>>> --------------------------------------------- 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>>>> | 99.99th=[ 4320] 
>>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> ------------------------------------- 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3984] 
>>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>>>> | 99.99th=[ 4832] 
>>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : jemmaloc : iops=28023 
>>>> ---------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>>>> 
>>>> 
>>>> 
>>>> qemu : non-iothread : jemmaloc : iops=42226 
>>>> -------------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>>>> | 99.99th=[ 2608] 
>>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>>>> lat (msec) : 2=10.30%, 4=0.07% 
>>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE----- 
>>>> Hash: SHA256 
>>>> 
>>>> I also saw a similar performance increase by using alternative memory 
>>>> allocators. What I found was that Ceph OSDs performed well with either 
>>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>>> instead of tcmalloc, I'm still working to dig into why that might be 
>>>> the case). 
>>>> 
>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>> 
>>>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>>>> I'm seeing a lot of time in thread creation and destruction, the 
>>>> memory allocators are quite a bit down the list (both fio with 
>>>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>>>> I've tried using the async messenger but there wasn't a huge 
>>>> difference. [2] 
>>>> 
>>>> Further down the rabbit hole.... 
>>>> 
>>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
>>>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
>>>> -----BEGIN PGP SIGNATURE----- 
>>>> Version: Mailvelope v0.13.1 
>>>> Comment: https://www.mailvelope.com 
>>>> 
>>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
>>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>>>> oSJX 
>>>> =k281 
>>>> -----END PGP SIGNATURE----- 
>>>> ---------------- 
>>>> Robert LeBlanc 
>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>>>> 
>>>> 
>>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>>> IOPS from 1 VM!
>>>>> 
>>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>>>> (I'm planning to send results in qemu soon) 
>>>>> 
>>>>>>> How fast are the SSDs in those 3 OSDs?
>>>>> 
>>>>> Theses results are with datas in buffer memory of osd nodes. 
>>>>> 
>>>>> When reading fulling on ssd (intel s3500), 
>>>>> 
>>>>> For 1 client, 
>>>>> 
>>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>>>> 
>>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>>>> 
>>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>>>> 
>>>>> 
>>>>> 
>>>>> small tip : 
>>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>>>> 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>>>> 
>>>>> as a lot of time is spent in malloc/free 
>>>>> 
>>>>> 
>>>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Mark Nelson" < mnelson@redhat.com > 
>>>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi All, 
>>>>> 
>>>>> In the past we've hit some performance issues with RBD cache that we've 
>>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>>>> chime in as he's more familiar with what that code looks like. 
>>>>> 
>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>>>> 
>>>>> Mark 
>>>>> 
>>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>>>> 
>>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> qd1: 1651 
>>>>>> qd2: 3482 
>>>>>> qd4: 7958 
>>>>>> qd8: 17912 
>>>>>> qd16: 36020 
>>>>>> qd32: 42765 
>>>>>> qd64: 46169 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> qd1: 1748 
>>>>>> qd2: 3570 
>>>>>> qd4: 8356 
>>>>>> qd8: 17732 
>>>>>> qd16: 41396 
>>>>>> qd32: 78633 
>>>>>> qd64: 79063 
>>>>>> qd128: 79550 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" < aderumier@odiso.com > 
>>>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>>>> We tried adding more RBDs to single VM, but no luck.
>>>>>> 
>>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>>>> It's working for me, I can scale with adding more disks. 
>>>>>> 
>>>>>> 
>>>>>> My bench here are done with fio-rbd on host. 
>>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>>>> 
>>>>>> 
>>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>>>> 
>>>>>> I'm going to see if this tracker 
>>>>>> http://tracker.ceph.com/issues/11056 
>>>>>> 
>>>>>> could be the cause. 
>>>>>> 
>>>>>> (My master build was done some week ago) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> À: "aderumier" < aderumier@odiso.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>>>> 
>>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>>>> 
>>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>>>> 
>>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>>>> 
>>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>> 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>>>> 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Is it expected ? 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=false 3 osd 
>>>>>> -------------------------------- 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>>>> | 99.99th=[ 1176] 
>>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=true 3osd 
>>>>>> ------------------------------ 
>>>>>> 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>>>> | 99.99th=[ 2192] 
>>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users@lists.ceph.com 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> С уважением, Фасихов Ирек Нургаязович 
>>>> Моб.: +79229045757 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> ________________________________ 
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> 
>>> 
>>> 
>>> -- 
>>> -Pushpesh
>> 
>> 
>> 
>> -- 
>> -Pushpesh
> 
> 
> 
> -- 
> -Pushpesh 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <942E436A-5668-4F76-91E7-FAA08CC0F48A-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                                                       ` <942E436A-5668-4F76-91E7-FAA08CC0F48A-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
@ 2015-06-22  7:22                                                         ` Irek Fasikhov
  2015-06-22  8:54                                                           ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Irek Fasikhov @ 2015-06-22  7:22 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel, pushpesh sharma, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 46077 bytes --]

It is already possible to do in proxmox 3.4 (with the latest updates
qemu-kvm 2.2.x). But it is necessary to register in the conf file
iothread:1. For single drives the ambiguous behavior of productivity.

2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG <
s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>:

>
> Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>:
>
> >>> Just an update, there seems to be no proper way to pass iothread
> >>> parameter from openstack-nova (not at least in Juno release). So a
> >>> default single iothread per VM is what all we have. So in conclusion a
> >>> nova instance max iops on ceph rbd will be limited to 30-40K.
> >
> > Thanks for the update.
> >
> > For proxmox users,
> >
> > I have added iothread option to gui for proxmox 4.0
>
> Can we make iothread the default? Does it also help for single disks or
> only multiple disks?
>
> > and added jemalloc as default memory allocator
> >
> >
> > I have also send a jemmaloc patch to qemu dev mailing
> > https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html
> >
> > (Help is welcome to push it in qemu upstream ! )
> >
> >
> >
> > ----- Mail original -----
> > De: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> > Cc: "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> > Envoyé: Lundi 22 Juin 2015 07:58:47
> > Objet: Re: rbd_cache, limiting read on high iops around 40k
> >
> > Just an update, there seems to be no proper way to pass iothread
> > parameter from openstack-nova (not at least in Juno release). So a
> > default single iothread per VM is what all we have. So in conclusion a
> > nova instance max iops on ceph rbd will be limited to 30-40K.
> >
> > On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER
> > <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org> wrote:
> >> Hi,
> >>
> >> some news about qemu with tcmalloc vs jemmaloc.
> >>
> >> I'm testing with multiple disks (with iothreads) in 1 qemu guest.
> >>
> >> And if tcmalloc is a little faster than jemmaloc,
> >>
> >> I have hit a lot of time the
> tcmalloc::ThreadCache::ReleaseToCentralCache bug.
> >>
> >> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.
> >>
> >>
> >> with multiple disk, I'm around 200k iops with tcmalloc (before hitting
> the bug) and 350kiops with jemmaloc.
> >>
> >> The problem is that when I hit malloc bug, I'm around 4000-10000 iops,
> and only way to fix is is to restart qemu ...
> >>
> >>
> >>
> >> ----- Mail original -----
> >> De: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> >> Cc: "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >> Envoyé: Vendredi 12 Juin 2015 08:58:21
> >> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>
> >> Thanks, posted the question in openstack list. Hopefully will get some
> >> expert opinion.
> >>
> >> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER
> >> <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org> wrote:
> >>> Hi,
> >>>
> >>> here a libvirt xml sample from libvirt src
> >>>
> >>> (you need to define <iothreads> number, then assign then in disks).
> >>>
> >>> I don't use openstack, so I really don't known how it's working with
> it.
> >>>
> >>>
> >>> <domain type='qemu'>
> >>> <name>QEMUGuest1</name>
> >>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
> >>> <memory unit='KiB'>219136</memory>
> >>> <currentMemory unit='KiB'>219136</currentMemory>
> >>> <vcpu placement='static'>2</vcpu>
> >>> <iothreads>2</iothreads>
> >>> <os>
> >>> <type arch='i686' machine='pc'>hvm</type>
> >>> <boot dev='hd'/>
> >>> </os>
> >>> <clock offset='utc'/>
> >>> <on_poweroff>destroy</on_poweroff>
> >>> <on_reboot>restart</on_reboot>
> >>> <on_crash>destroy</on_crash>
> >>> <devices>
> >>> <emulator>/usr/bin/qemu</emulator>
> >>> <disk type='file' device='disk'>
> >>> <driver name='qemu' type='raw' iothread='1'/>
> >>> <source file='/var/lib/libvirt/images/iothrtest1.img'/>
> >>> <target dev='vdb' bus='virtio'/>
> >>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
> function='0x0'/>
> >>> </disk>
> >>> <disk type='file' device='disk'>
> >>> <driver name='qemu' type='raw' iothread='2'/>
> >>> <source file='/var/lib/libvirt/images/iothrtest2.img'/>
> >>> <target dev='vdc' bus='virtio'/>
> >>> </disk>
> >>> <controller type='usb' index='0'/>
> >>> <controller type='ide' index='0'/>
> >>> <controller type='pci' index='0' model='pci-root'/>
> >>> <memballoon model='none'/>
> >>> </devices>
> >>> </domain>
> >>>
> >>>
> >>> ----- Mail original -----
> >>> De: "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> >>> Cc: "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >>> Envoyé: Vendredi 12 Juin 2015 07:52:41
> >>> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>>
> >>> Hi Alexandre,
> >>>
> >>> I agree with your rational, of one iothread per disk. CPU consumed in
> >>> IOwait is pretty high in each VM. But I am not finding a way to set
> >>> the same on a nova instance. I am using openstack Juno with QEMU+KVM.
> >>> As per libvirt documentation for setting iothreads, I can edit
> >>> domain.xml directly and achieve the same effect. However in as in
> >>> openstack env domain xml is created by nova with some additional
> >>> metadata, so editing the domain xml using 'virsh edit' does not seems
> >>> to work(I agree, it is not a very cloud way of doing things, but a
> >>> hack). Changes made there vanish after saving them, due to reason
> >>> libvirt validation fails on the same.
> >>>
> >>> #virsh dumpxml instance-000000c5 > vm.xml
> >>> #virt-xml-validate vm.xml
> >>> Relax-NG validity error : Extra element cpu in interleave
> >>> vm.xml:1: element domain: Relax-NG validity error : Element domain
> >>> failed to validate content
> >>> vm.xml fails to validate
> >>>
> >>> Second approach I took was to setting QoS in volumes types. But there
> >>> is no option to set iothreads per volume, there are parameter realted
> >>> to max_read/wrirte ops/bytes.
> >>>
> >>> Thirdly, editing Nova flavor and proving extra specs like
> >>> hw:cpu_socket/thread/core, can change guest CPU topology however again
> >>> no way to set iothread. It does accept hw_disk_iothreads(no type check
> >>> in place, i believe ), but can not pass the same in domain.xml.
> >>>
> >>> Could you suggest me a way to set the same.
> >>>
> >>> -Pushpesh
> >>>
> >>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
> >>> <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> I need to try out the performance on qemu soon and may come back to
> you if I need some qemu setting trick :-)
> >>>>
> >>>> Sure no problem.
> >>>>
> >>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks
> with 1 iothread by disk)
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> >>>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >>>> Cc: "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >>>> Envoyé: Mercredi 10 Juin 2015 09:06:32
> >>>> Objet: RE: rbd_cache, limiting read on high iops around 40k
> >>>>
> >>>> Hi Alexandre,
> >>>> Thanks for sharing the data.
> >>>> I need to try out the performance on qemu soon and may come back to
> you if I need some qemu setting trick :-)
> >>>>
> >>>> Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On
> Behalf Of Alexandre DERUMIER
> >>>> Sent: Tuesday, June 09, 2015 10:42 PM
> >>>> To: Irek Fasikhov
> >>>> Cc: ceph-devel; pushpesh sharma; ceph-users
> >>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops
> around 40k
> >>>>
> >>>>>> Very good work!
> >>>>>> Do you have a rpm-file?
> >>>>>> Thanks.
> >>>> no sorry, I'm have compiled it manually (and I'm using debian jessie
> as client)
> >>>>
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Irek Fasikhov" <malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >>>> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> >>>> Cc: "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
> "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> >>>> Envoyé: Mercredi 10 Juin 2015 07:21:42
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> Hi, Alexandre.
> >>>>
> >>>> Very good work!
> >>>> Do you have a rpm-file?
> >>>> Thanks.
> >>>>
> >>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> :
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I have tested qemu with last tcmalloc 2.4, and the improvement is
> huge with iothread: 50k iops (+45%) !
> >>>>
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc
> (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226
> (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
> >>>>
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread :
> jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276
> (+45%)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
> >>>> ------------------------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10
> 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt=
> 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec):
> min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265,
> avg=635.27, stdev=197.40 clat percentiles (usec):
> >>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
> >>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
> >>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
> >>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
> >>>> | 99.99th=[ 3760]
> >>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00,
> stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
> lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%,
> ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%,
> 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0,
> short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s,
> maxb=201107KB/s, mint=26070msec, maxt=26070msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840,
> util=99.73%
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10
> 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt=
> 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec):
> min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743,
> avg=888.31, stdev=315.51 clat percentiles (usec):
> >>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
> >>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
> >>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
> >>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
> >>>> | 99.99th=[ 3632]
> >>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11,
> stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
> lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%,
> ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%,
> 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0,
> short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s,
> maxb=143896KB/s, mint=36435msec, maxt=36435msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716,
> util=99.85%
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>> À: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >
> >>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mardi 9 Juin 2015 18:47:27
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> Hi Robert,
> >>>>
> >>>>>> What I found was that Ceph OSDs performed well with either tcmalloc
> or
> >>>>>> jemalloc (except when RocksDB was built with jemalloc instead of
> >>>>>> tcmalloc, I'm still working to dig into why that might be the case).
> >>>> yes,from my test, for osd tcmalloc is a little faster (but very
> little) than jemalloc.
> >>>>
> >>>>
> >>>>
> >>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> >>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> >>>>>> better for QEMU/KVM in the tests that we ran. [1]
> >>>>
> >>>>
> >>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see
> speed regression with tcmalloc.
> >>>> with qemu iothread, tcmalloc have a speed increase over glib
> >>>> with qemu iothread, jemalloc have a speed decrease
> >>>>
> >>>> without iothread, jemalloc have a big speed increase
> >>>>
> >>>> this is with
> >>>> -qemu 2.3
> >>>> -tcmalloc 2.2.1
> >>>> -jemmaloc 3.6
> >>>> -libc6 2.19
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395
> >>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
> >>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> qemu : iothread : tcmalloc : iops=38676 (+12%)
> >>>> qemu : iothread : jemmaloc : iops=28023 (-19%)
> >>>>
> >>>>
> >>>> (The benefit of iothreads is that we can scale with more disks in 1vm)
> >>>>
> >>>>
> >>>> fio results:
> >>>> ------------
> >>>>
> >>>> qemu : iothread : tcmalloc : iops=38676
> >>>> -----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9
> 18:16:53 2015
> >>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
> >>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
> >>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
> >>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
> >>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
> >>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
> >>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
> >>>> | 99.99th=[ 3888]
> >>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40,
> stdev=16978.03
> >>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
> >>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
> >>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s,
> maxb=154707KB/s, mint=33889msec, maxt=33889msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096,
> util=99.77%
> >>>>
> >>>>
> >>>>
> >>>> qemu : no-iothread : tcmalloc : iops=34516
> >>>> ---------------------------------------------
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9
> 18:19:08 2015
> >>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
> >>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
> >>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
> >>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
> >>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
> >>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
> >>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
> >>>> | 99.99th=[ 4320]
> >>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88,
> stdev=16883.77
> >>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
> >>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
> >>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s,
> maxb=138064KB/s, mint=37974msec, maxt=37974msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396,
> util=99.86%
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> -------------------------------------
> >>>>
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9
> 18:24:01 2015
> >>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
> >>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
> >>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
> >>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> >>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
> >>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
> >>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
> >>>> | 99.99th=[ 3984]
> >>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78,
> stdev=15521.30
> >>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
> >>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
> >>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s,
> maxb=137785KB/s, mint=38051msec, maxt=38051msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972,
> util=99.85%
> >>>>
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395
> >>>> -----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9
> 18:27:18 2015
> >>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
> >>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
> >>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
> >>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> >>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
> >>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
> >>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
> >>>> | 99.99th=[ 4832]
> >>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64,
> stdev=19121.91
> >>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
> >>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
> >>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s,
> maxb=133583KB/s, mint=39248msec, maxt=39248msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536,
> util=99.84%
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : jemmaloc : iops=28023
> >>>> ----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0
> iops] [eta 00m:01s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9
> 18:30:26 2015
> >>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
> >>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
> >>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
> >>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
> >>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
> >>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
> >>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
> >>>> | 99.99th=[ 3760]
> >>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27,
> stdev=17381.70
> >>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
> >>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
> >>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s,
> maxb=112094KB/s, mint=46772msec, maxt=46772msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376,
> util=98.68%
> >>>>
> >>>>
> >>>>
> >>>> qemu : non-iothread : jemmaloc : iops=42226
> >>>> --------------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9
> 18:34:11 2015
> >>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
> >>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
> >>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
> >>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
> >>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
> >>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
> >>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
> >>>> | 99.99th=[ 2608]
> >>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14,
> stdev=23440.79
> >>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
> >>>> lat (msec) : 2=10.30%, 4=0.07%
> >>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s,
> maxb=177130KB/s, mint=29599msec, maxt=29599msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636,
> util=99.80%
> >>>>
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >
> >>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mardi 9 Juin 2015 18:00:29
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> I also saw a similar performance increase by using alternative memory
> >>>> allocators. What I found was that Ceph OSDs performed well with either
> >>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> >>>> instead of tcmalloc, I'm still working to dig into why that might be
> >>>> the case).
> >>>>
> >>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> >>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> >>>> better for QEMU/KVM in the tests that we ran. [1]
> >>>>
> >>>> I'm currently looking into I/O bottlenecks around the 16KB range and
> >>>> I'm seeing a lot of time in thread creation and destruction, the
> >>>> memory allocators are quite a bit down the list (both fio with
> >>>> ioengine rbd and on the OSDs). I wonder what the difference can be.
> >>>> I've tried using the async messenger but there wasn't a huge
> >>>> difference. [2]
> >>>>
> >>>> Further down the rabbit hole....
> >>>>
> >>>> [1]
> https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg20197.html
> >>>> [2]
> https://www.mail-archive.com/ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg23982.html
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v0.13.1
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
> >>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
> >>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
> >>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
> >>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
> >>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
> >>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
> >>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
> >>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
> >>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
> >>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
> >>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
> >>>> oSJX
> >>>> =k281
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <
> aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit
> 80K
> >>>>>>> IOPS from 1 VM!
> >>>>>
> >>>>> Note that theses result are not in a vm (fio-rbd on host), so in a
> vm we'll have overhead.
> >>>>> (I'm planning to send results in qemu soon)
> >>>>>
> >>>>>>> How fast are the SSDs in those 3 OSDs?
> >>>>>
> >>>>> Theses results are with datas in buffer memory of osd nodes.
> >>>>>
> >>>>> When reading fulling on ssd (intel s3500),
> >>>>>
> >>>>> For 1 client,
> >>>>>
> >>>>> I'm around 33k iops without cache and 32k iops with cache, with 1
> osd.
> >>>>> I'm around 55k iops without cache and 38k iops with cache, with 3
> osd.
> >>>>>
> >>>>> with multiple clients jobs, I can reach around 70kiops by osd , and
> 250k iops by osd when datas are in buffer.
> >>>>>
> >>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
> >>>>>
> >>>>>
> >>>>>
> >>>>> small tip :
> >>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies
> by around 20%
> >>>>>
> >>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
> >>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
> >>>>>
> >>>>> as a lot of time is spent in malloc/free
> >>>>>
> >>>>>
> >>>>> (qemu support also tcmalloc since some months , I'll bench it too
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html
> )
> >>>>>
> >>>>>
> >>>>>
> >>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- Mail original -----
> >>>>> De: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >
> >>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>> Envoyé: Mardi 9 Juin 2015 13:36:31
> >>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> In the past we've hit some performance issues with RBD cache that
> we've
> >>>>> fixed, but we've never really tried pushing a single VM beyond 40+K
> read
> >>>>> IOPS in testing (or at least I never have). I suspect there's a
> couple
> >>>>> of possibilities as to why it might be slower, but perhaps joshd can
> >>>>> chime in as he's more familiar with what that code looks like.
> >>>>>
> >>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
> >>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
> >>>>>> It's seem that the limit is mainly going in high queue depth (+- >
> 16)
> >>>>>>
> >>>>>> Here the result in iops with 1client- 4krandread- 3osd - with
> differents queue depth size.
> >>>>>> rbd_cache is almost the same than without cache with queue depth <16
> >>>>>>
> >>>>>>
> >>>>>> cache
> >>>>>> -----
> >>>>>> qd1: 1651
> >>>>>> qd2: 3482
> >>>>>> qd4: 7958
> >>>>>> qd8: 17912
> >>>>>> qd16: 36020
> >>>>>> qd32: 42765
> >>>>>> qd64: 46169
> >>>>>>
> >>>>>> no cache
> >>>>>> --------
> >>>>>> qd1: 1748
> >>>>>> qd2: 3570
> >>>>>> qd4: 8356
> >>>>>> qd8: 17732
> >>>>>> qd16: 41396
> >>>>>> qd32: 78633
> >>>>>> qd64: 79063
> >>>>>> qd128: 79550
> >>>>>>
> >>>>>>
> >>>>>> ----- Mail original -----
> >>>>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>>>> À: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21
> >>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops
> around 40k
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>>> We tried adding more RBDs to single VM, but no luck.
> >>>>>>
> >>>>>> If you want to scale with more disks in a single qemu vm, you need
> to use iothread feature from qemu and assign 1 iothread by disk (works with
> virtio-blk).
> >>>>>> It's working for me, I can scale with adding more disks.
> >>>>>>
> >>>>>>
> >>>>>> My bench here are done with fio-rbd on host.
> >>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a
> single host and around 250kiops 10clients-rbdcache=on.
> >>>>>>
> >>>>>>
> >>>>>> I just wonder why I don't have performance decrease around 30k iops
> with 1osd.
> >>>>>>
> >>>>>> I'm going to see if this tracker
> >>>>>> http://tracker.ceph.com/issues/11056
> >>>>>>
> >>>>>> could be the cause.
> >>>>>>
> >>>>>> (My master build was done some week ago)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----- Mail original -----
> >>>>>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04
> >>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>>>>>
> >>>>>> Hi Alexandre,
> >>>>>>
> >>>>>> We have also seen something very similar on Hammer(0.94-1). We were
> doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM,
> openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as
> additional storage. For some strange reason it was not able to scale 4K- RR
> iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but
> no luck. However increasing number of VMs to 4 on a single hypervisor did
> scale to some extent. After this there was no much benefit we got from
> adding more VMs.
> >>>>>>
> >>>>>> Here is the trend we have seen, x-axis is number of hypervisor,
> each hypervisor has 4 VM, each VM has 1 RBD:-
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> VDbench is used as benchmarking tool. We were not saturating
> network and CPUs at OSD nodes. We were not able to saturate CPUs at
> hypervisors, and that is where we were suspecting of some throttling
> effect. However we haven't setted any such limits from nova or kvm end. We
> tried some CPU pinning and other KVM related tuning as well, but no luck.
> >>>>>>
> >>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs
> were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than
> scaling beyond that point the numbers were actually degrading. (Single pipe
> more congestion effect)
> >>>>>>
> >>>>>> We never suspected that rbd cache enable could be detrimental to
> performance. It would nice to route cause the problem if that is the case.
> >>>>>>
> >>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER <
> aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm doing benchmark (ceph master branch), with randread 4k
> qdepth=32,
> >>>>>> and rbd_cache=true seem to limit the iops around 40k
> >>>>>>
> >>>>>>
> >>>>>> no cache
> >>>>>> --------
> >>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops
> >>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops
> >>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops
> >>>>>>
> >>>>>>
> >>>>>> cache
> >>>>>> -----
> >>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops
> >>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops
> >>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Is it expected ?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> fio result rbd_cache=false 3 osd
> >>>>>> --------------------------------
> >>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >>>>>> fio-2.1.11
> >>>>>> Starting 1 process
> >>>>>> rbd engine: RBD version: 0.1.9
> >>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0
> iops] [eta 00m:00s]
> >>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue
> Jun 9 07:48:42 2015
> >>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
> >>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
> >>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
> >>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
> >>>>>> clat percentiles (usec):
> >>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
> >>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
> >>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
> >>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
> >>>>>> | 99.99th=[ 1176]
> >>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34,
> stdev=25196.21
> >>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%,
> 1000=0.23%
> >>>>>> lat (msec) : 2=0.03%, 4=0.01%
> >>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
> >>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,
> >=64=0.0%
> >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%,
> >=64=0.0%
> >>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>>>
> >>>>>> Run status group 0 (all jobs):
> >>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s,
> maxb=313169KB/s, mint=32698msec, maxt=32698msec
> >>>>>>
> >>>>>> Disk stats (read/write):
> >>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> fio result rbd_cache=true 3osd
> >>>>>> ------------------------------
> >>>>>>
> >>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >>>>>> fio-2.1.11
> >>>>>> Starting 1 process
> >>>>>> rbd engine: RBD version: 0.1.9
> >>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0
> iops] [eta 00m:00s]
> >>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue
> Jun 9 07:47:30 2015
> >>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
> >>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
> >>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
> >>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
> >>>>>> clat percentiles (usec):
> >>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
> >>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
> >>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
> >>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
> >>>>>> | 99.99th=[ 2192]
> >>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10,
> stdev=15079.93
> >>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
> >>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
> >>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
> >>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%,
> >=64=0.0%
> >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>>>
> >>>>>> Run status group 0 (all jobs):
> >>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s,
> maxb=183295KB/s, mint=55866msec, maxt=55866msec
> >>>>>>
> >>>>>> Disk stats (read/write):
> >>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%,
> aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
> >>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> С уважением, Фасихов Ирек Нургаязович
> >>>> Моб.: +79229045757
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> ________________________________
> >>>>
> >>>> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> >>>
> >>>
> >>>
> >>> --
> >>> -Pushpesh
> >>
> >>
> >>
> >> --
> >> -Pushpesh
> >
> >
> >
> > --
> > -Pushpesh
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

[-- Attachment #1.2: Type: text/html, Size: 65366 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-22  7:22                                                         ` Irek Fasikhov
@ 2015-06-22  8:54                                                           ` Alexandre DERUMIER
       [not found]                                                             ` <1581092206.1667776.1434963299884.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-22  8:54 UTC (permalink / raw)
  To: Irek Fasikhov
  Cc: Stefan Priebe, pushpesh sharma, Somnath Roy, ceph-devel,
	ceph-users

>>It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity.

Yes and no ;)

Currently in proxmox 3.4, iothread:1  generate only 1 iothread for all disks.

So, you'll have a small extra boost, but it'll not scale with multiple disks.

Proxmox 4.0 will allow to enable|disable 1 iothread by disk.


>>Does it also help for single disks or only multiple disks? 

Iothread can also help for single disk, because by default qemu use a main thread for disk but also other things(don't remember what exactly)




----- Mail original -----
De: "Irek Fasikhov" <malmyzh@gmail.com>
À: "Stefan Priebe" <s.priebe@profihost.ag>
Cc: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "Somnath Roy" <Somnath.Roy@sandisk.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 22 Juin 2015 09:22:13
Objet: Re: rbd_cache, limiting read on high iops around 40k

It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 

2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG < s.priebe@profihost.ag > : 



Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER < aderumier@odiso.com >: 

>>> Just an update, there seems to be no proper way to pass iothread 
>>> parameter from openstack-nova (not at least in Juno release). So a 
>>> default single iothread per VM is what all we have. So in conclusion a 
>>> nova instance max iops on ceph rbd will be limited to 30-40K. 
> 
> Thanks for the update. 
> 
> For proxmox users, 
> 
> I have added iothread option to gui for proxmox 4.0 

Can we make iothread the default? Does it also help for single disks or only multiple disks? 

> and added jemalloc as default memory allocator 
> 
> 
> I have also send a jemmaloc patch to qemu dev mailing 
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html 
> 
> (Help is welcome to push it in qemu upstream ! ) 
> 
> 
> 
> ----- Mail original ----- 
> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
> À: "aderumier" < aderumier@odiso.com > 
> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Lundi 22 Juin 2015 07:58:47 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> Just an update, there seems to be no proper way to pass iothread 
> parameter from openstack-nova (not at least in Juno release). So a 
> default single iothread per VM is what all we have. So in conclusion a 
> nova instance max iops on ceph rbd will be limited to 30-40K. 
> 
> On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
> < aderumier@odiso.com > wrote: 
>> Hi, 
>> 
>> some news about qemu with tcmalloc vs jemmaloc. 
>> 
>> I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
>> 
>> And if tcmalloc is a little faster than jemmaloc, 
>> 
>> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 
>> 
>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
>> 
>> 
>> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 
>> 
>> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> À: "aderumier" < aderumier@odiso.com > 
>> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Vendredi 12 Juin 2015 08:58:21 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Thanks, posted the question in openstack list. Hopefully will get some 
>> expert opinion. 
>> 
>> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
>> < aderumier@odiso.com > wrote: 
>>> Hi, 
>>> 
>>> here a libvirt xml sample from libvirt src 
>>> 
>>> (you need to define <iothreads> number, then assign then in disks). 
>>> 
>>> I don't use openstack, so I really don't known how it's working with it. 
>>> 
>>> 
>>> <domain type='qemu'> 
>>> <name>QEMUGuest1</name> 
>>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
>>> <memory unit='KiB'>219136</memory> 
>>> <currentMemory unit='KiB'>219136</currentMemory> 
>>> <vcpu placement='static'>2</vcpu> 
>>> <iothreads>2</iothreads> 
>>> <os> 
>>> <type arch='i686' machine='pc'>hvm</type> 
>>> <boot dev='hd'/> 
>>> </os> 
>>> <clock offset='utc'/> 
>>> <on_poweroff>destroy</on_poweroff> 
>>> <on_reboot>restart</on_reboot> 
>>> <on_crash>destroy</on_crash> 
>>> <devices> 
>>> <emulator>/usr/bin/qemu</emulator> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='1'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
>>> <target dev='vdb' bus='virtio'/> 
>>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
>>> </disk> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='2'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
>>> <target dev='vdc' bus='virtio'/> 
>>> </disk> 
>>> <controller type='usb' index='0'/> 
>>> <controller type='ide' index='0'/> 
>>> <controller type='pci' index='0' model='pci-root'/> 
>>> <memballoon model='none'/> 
>>> </devices> 
>>> </domain> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>> À: "aderumier" < aderumier@odiso.com > 
>>> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Vendredi 12 Juin 2015 07:52:41 
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Alexandre, 
>>> 
>>> I agree with your rational, of one iothread per disk. CPU consumed in 
>>> IOwait is pretty high in each VM. But I am not finding a way to set 
>>> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
>>> As per libvirt documentation for setting iothreads, I can edit 
>>> domain.xml directly and achieve the same effect. However in as in 
>>> openstack env domain xml is created by nova with some additional 
>>> metadata, so editing the domain xml using 'virsh edit' does not seems 
>>> to work(I agree, it is not a very cloud way of doing things, but a 
>>> hack). Changes made there vanish after saving them, due to reason 
>>> libvirt validation fails on the same. 
>>> 
>>> #virsh dumpxml instance-000000c5 > vm.xml 
>>> #virt-xml-validate vm.xml 
>>> Relax-NG validity error : Extra element cpu in interleave 
>>> vm.xml:1: element domain: Relax-NG validity error : Element domain 
>>> failed to validate content 
>>> vm.xml fails to validate 
>>> 
>>> Second approach I took was to setting QoS in volumes types. But there 
>>> is no option to set iothreads per volume, there are parameter realted 
>>> to max_read/wrirte ops/bytes. 
>>> 
>>> Thirdly, editing Nova flavor and proving extra specs like 
>>> hw:cpu_socket/thread/core, can change guest CPU topology however again 
>>> no way to set iothread. It does accept hw_disk_iothreads(no type check 
>>> in place, i believe ), but can not pass the same in domain.xml. 
>>> 
>>> Could you suggest me a way to set the same. 
>>> 
>>> -Pushpesh 
>>> 
>>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
>>> < aderumier@odiso.com > wrote: 
>>>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>> 
>>>> Sure no problem. 
>>>> 
>>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Somnath Roy" < Somnath.Roy@sandisk.com > 
>>>> À: "aderumier" < aderumier@odiso.com >, "Irek Fasikhov" < malmyzh@gmail.com > 
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>>>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Alexandre, 
>>>> Thanks for sharing the data. 
>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>> 
>>>> Regards 
>>>> Somnath 
>>>> 
>>>> -----Original Message----- 
>>>> From: ceph-users [mailto: ceph-users-bounces@lists.ceph.com ] On Behalf Of Alexandre DERUMIER 
>>>> Sent: Tuesday, June 09, 2015 10:42 PM 
>>>> To: Irek Fasikhov 
>>>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>>>> Very good work! 
>>>>>> Do you have a rpm-file? 
>>>>>> Thanks. 
>>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Irek Fasikhov" < malmyzh@gmail.com > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "Robert LeBlanc" < robert@leblancnet.us >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi, Alexandre. 
>>>> 
>>>> Very good work! 
>>>> Do you have a rpm-file? 
>>>> Thanks. 
>>>> 
>>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
>>>> 
>>>> 
>>>> Hi, 
>>>> 
>>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> ------------------------------------------------------ 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3632] 
>>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" < aderumier@odiso.com > 
>>>> À: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Robert, 
>>>> 
>>>>>> What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>>> jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>>> tcmalloc, I'm still working to dig into why that might be the case). 
>>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>>>> 
>>>> 
>>>> 
>>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>> 
>>>> 
>>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>>>> with qemu iothread, tcmalloc have a speed increase over glib 
>>>> with qemu iothread, jemalloc have a speed decrease 
>>>> 
>>>> without iothread, jemalloc have a big speed increase 
>>>> 
>>>> this is with 
>>>> -qemu 2.3 
>>>> -tcmalloc 2.2.1 
>>>> -jemmaloc 3.6 
>>>> -libc6 2.19 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>>>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>>>> 
>>>> 
>>>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>>>> 
>>>> 
>>>> fio results: 
>>>> ------------ 
>>>> 
>>>> qemu : iothread : tcmalloc : iops=38676 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>>>> | 99.99th=[ 3888] 
>>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no-iothread : tcmalloc : iops=34516 
>>>> --------------------------------------------- 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>>>> | 99.99th=[ 4320] 
>>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> ------------------------------------- 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3984] 
>>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>>>> | 99.99th=[ 4832] 
>>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : jemmaloc : iops=28023 
>>>> ---------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>>>> 
>>>> 
>>>> 
>>>> qemu : non-iothread : jemmaloc : iops=42226 
>>>> -------------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>>>> | 99.99th=[ 2608] 
>>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>>>> lat (msec) : 2=10.30%, 4=0.07% 
>>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE----- 
>>>> Hash: SHA256 
>>>> 
>>>> I also saw a similar performance increase by using alternative memory 
>>>> allocators. What I found was that Ceph OSDs performed well with either 
>>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>>> instead of tcmalloc, I'm still working to dig into why that might be 
>>>> the case). 
>>>> 
>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>> 
>>>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>>>> I'm seeing a lot of time in thread creation and destruction, the 
>>>> memory allocators are quite a bit down the list (both fio with 
>>>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>>>> I've tried using the async messenger but there wasn't a huge 
>>>> difference. [2] 
>>>> 
>>>> Further down the rabbit hole.... 
>>>> 
>>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
>>>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
>>>> -----BEGIN PGP SIGNATURE----- 
>>>> Version: Mailvelope v0.13.1 
>>>> Comment: https://www.mailvelope.com 
>>>> 
>>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
>>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>>>> oSJX 
>>>> =k281 
>>>> -----END PGP SIGNATURE----- 
>>>> ---------------- 
>>>> Robert LeBlanc 
>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>>>> 
>>>> 
>>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>>> IOPS from 1 VM! 
>>>>> 
>>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>>>> (I'm planning to send results in qemu soon) 
>>>>> 
>>>>>>> How fast are the SSDs in those 3 OSDs? 
>>>>> 
>>>>> Theses results are with datas in buffer memory of osd nodes. 
>>>>> 
>>>>> When reading fulling on ssd (intel s3500), 
>>>>> 
>>>>> For 1 client, 
>>>>> 
>>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>>>> 
>>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>>>> 
>>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>>>> 
>>>>> 
>>>>> 
>>>>> small tip : 
>>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>>>> 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>>>> 
>>>>> as a lot of time is spent in malloc/free 
>>>>> 
>>>>> 
>>>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Mark Nelson" < mnelson@redhat.com > 
>>>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi All, 
>>>>> 
>>>>> In the past we've hit some performance issues with RBD cache that we've 
>>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>>>> chime in as he's more familiar with what that code looks like. 
>>>>> 
>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>>>> 
>>>>> Mark 
>>>>> 
>>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>>>> 
>>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> qd1: 1651 
>>>>>> qd2: 3482 
>>>>>> qd4: 7958 
>>>>>> qd8: 17912 
>>>>>> qd16: 36020 
>>>>>> qd32: 42765 
>>>>>> qd64: 46169 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> qd1: 1748 
>>>>>> qd2: 3570 
>>>>>> qd4: 8356 
>>>>>> qd8: 17732 
>>>>>> qd16: 41396 
>>>>>> qd32: 78633 
>>>>>> qd64: 79063 
>>>>>> qd128: 79550 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" < aderumier@odiso.com > 
>>>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>>>> We tried adding more RBDs to single VM, but no luck. 
>>>>>> 
>>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>>>> It's working for me, I can scale with adding more disks. 
>>>>>> 
>>>>>> 
>>>>>> My bench here are done with fio-rbd on host. 
>>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>>>> 
>>>>>> 
>>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>>>> 
>>>>>> I'm going to see if this tracker 
>>>>>> http://tracker.ceph.com/issues/11056 
>>>>>> 
>>>>>> could be the cause. 
>>>>>> 
>>>>>> (My master build was done some week ago) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> À: "aderumier" < aderumier@odiso.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>>>> 
>>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>>>> 
>>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>>>> 
>>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>>>> 
>>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>> 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>>>> 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Is it expected ? 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=false 3 osd 
>>>>>> -------------------------------- 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>>>> | 99.99th=[ 1176] 
>>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=true 3osd 
>>>>>> ------------------------------ 
>>>>>> 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>>>> | 99.99th=[ 2192] 
>>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users@lists.ceph.com 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> С уважением, Фасихов Ирек Нургаязович 
>>>> Моб.: +79229045757 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> ________________________________ 
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>>> 
>>> 
>>> 
>>> -- 
>>> -Pushpesh 
>> 
>> 
>> 
>> -- 
>> -Pushpesh 
> 
> 
> 
> -- 
> -Pushpesh 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 






-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1581092206.1667776.1434963299884.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                                                             ` <1581092206.1667776.1434963299884.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-22  9:04                                                               ` Irek Fasikhov
  2015-06-22  9:26                                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Irek Fasikhov @ 2015-06-22  9:04 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, pushpesh sharma, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 47968 bytes --]

| Proxmox 4.0 will allow to enable|disable 1 iothread by disk.

Alexandre, Useful option!
In proxmox 3.4 will it be possible to add at least in the configuration
file? Or it entails a change in the source code KVM?
Thanks.

2015-06-22 11:54 GMT+03:00 Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>:

> >>It is already possible to do in proxmox 3.4 (with the latest updates
> qemu-kvm 2.2.x). But it is necessary to register in the conf file
> iothread:1. For single drives the ambiguous behavior of productivity.
>
> Yes and no ;)
>
> Currently in proxmox 3.4, iothread:1  generate only 1 iothread for all
> disks.
>
> So, you'll have a small extra boost, but it'll not scale with multiple
> disks.
>
> Proxmox 4.0 will allow to enable|disable 1 iothread by disk.
>
>
> >>Does it also help for single disks or only multiple disks?
>
> Iothread can also help for single disk, because by default qemu use a main
> thread for disk but also other things(don't remember what exactly)
>
>
>
>
> ----- Mail original -----
> De: "Irek Fasikhov" <malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> À: "Stefan Priebe" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
> Cc: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>,
> "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> Envoyé: Lundi 22 Juin 2015 09:22:13
> Objet: Re: rbd_cache, limiting read on high iops around 40k
>
> It is already possible to do in proxmox 3.4 (with the latest updates
> qemu-kvm 2.2.x). But it is necessary to register in the conf file
> iothread:1. For single drives the ambiguous behavior of productivity.
>
> 2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG <
> s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org > :
>
>
>
> Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >:
>
> >>> Just an update, there seems to be no proper way to pass iothread
> >>> parameter from openstack-nova (not at least in Juno release). So a
> >>> default single iothread per VM is what all we have. So in conclusion a
> >>> nova instance max iops on ceph rbd will be limited to 30-40K.
> >
> > Thanks for the update.
> >
> > For proxmox users,
> >
> > I have added iothread option to gui for proxmox 4.0
>
> Can we make iothread the default? Does it also help for single disks or
> only multiple disks?
>
> > and added jemalloc as default memory allocator
> >
> >
> > I have also send a jemmaloc patch to qemu dev mailing
> > https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html
> >
> > (Help is welcome to push it in qemu upstream ! )
> >
> >
> >
> > ----- Mail original -----
> > De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> > À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> > Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >,
> "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> > Envoyé: Lundi 22 Juin 2015 07:58:47
> > Objet: Re: rbd_cache, limiting read on high iops around 40k
> >
> > Just an update, there seems to be no proper way to pass iothread
> > parameter from openstack-nova (not at least in Juno release). So a
> > default single iothread per VM is what all we have. So in conclusion a
> > nova instance max iops on ceph rbd will be limited to 30-40K.
> >
> > On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER
> > < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >> Hi,
> >>
> >> some news about qemu with tcmalloc vs jemmaloc.
> >>
> >> I'm testing with multiple disks (with iothreads) in 1 qemu guest.
> >>
> >> And if tcmalloc is a little faster than jemmaloc,
> >>
> >> I have hit a lot of time the
> tcmalloc::ThreadCache::ReleaseToCentralCache bug.
> >>
> >> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.
> >>
> >>
> >> with multiple disk, I'm around 200k iops with tcmalloc (before hitting
> the bug) and 350kiops with jemmaloc.
> >>
> >> The problem is that when I hit malloc bug, I'm around 4000-10000 iops,
> and only way to fix is is to restart qemu ...
> >>
> >>
> >>
> >> ----- Mail original -----
> >> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >> Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >,
> "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >> Envoyé: Vendredi 12 Juin 2015 08:58:21
> >> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>
> >> Thanks, posted the question in openstack list. Hopefully will get some
> >> expert opinion.
> >>
> >> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER
> >> < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>> Hi,
> >>>
> >>> here a libvirt xml sample from libvirt src
> >>>
> >>> (you need to define <iothreads> number, then assign then in disks).
> >>>
> >>> I don't use openstack, so I really don't known how it's working with
> it.
> >>>
> >>>
> >>> <domain type='qemu'>
> >>> <name>QEMUGuest1</name>
> >>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
> >>> <memory unit='KiB'>219136</memory>
> >>> <currentMemory unit='KiB'>219136</currentMemory>
> >>> <vcpu placement='static'>2</vcpu>
> >>> <iothreads>2</iothreads>
> >>> <os>
> >>> <type arch='i686' machine='pc'>hvm</type>
> >>> <boot dev='hd'/>
> >>> </os>
> >>> <clock offset='utc'/>
> >>> <on_poweroff>destroy</on_poweroff>
> >>> <on_reboot>restart</on_reboot>
> >>> <on_crash>destroy</on_crash>
> >>> <devices>
> >>> <emulator>/usr/bin/qemu</emulator>
> >>> <disk type='file' device='disk'>
> >>> <driver name='qemu' type='raw' iothread='1'/>
> >>> <source file='/var/lib/libvirt/images/iothrtest1.img'/>
> >>> <target dev='vdb' bus='virtio'/>
> >>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
> function='0x0'/>
> >>> </disk>
> >>> <disk type='file' device='disk'>
> >>> <driver name='qemu' type='raw' iothread='2'/>
> >>> <source file='/var/lib/libvirt/images/iothrtest2.img'/>
> >>> <target dev='vdc' bus='virtio'/>
> >>> </disk>
> >>> <controller type='usb' index='0'/>
> >>> <controller type='ide' index='0'/>
> >>> <controller type='pci' index='0' model='pci-root'/>
> >>> <memballoon model='none'/>
> >>> </devices>
> >>> </domain>
> >>>
> >>>
> >>> ----- Mail original -----
> >>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>> Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >,
> "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>> Envoyé: Vendredi 12 Juin 2015 07:52:41
> >>> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>>
> >>> Hi Alexandre,
> >>>
> >>> I agree with your rational, of one iothread per disk. CPU consumed in
> >>> IOwait is pretty high in each VM. But I am not finding a way to set
> >>> the same on a nova instance. I am using openstack Juno with QEMU+KVM.
> >>> As per libvirt documentation for setting iothreads, I can edit
> >>> domain.xml directly and achieve the same effect. However in as in
> >>> openstack env domain xml is created by nova with some additional
> >>> metadata, so editing the domain xml using 'virsh edit' does not seems
> >>> to work(I agree, it is not a very cloud way of doing things, but a
> >>> hack). Changes made there vanish after saving them, due to reason
> >>> libvirt validation fails on the same.
> >>>
> >>> #virsh dumpxml instance-000000c5 > vm.xml
> >>> #virt-xml-validate vm.xml
> >>> Relax-NG validity error : Extra element cpu in interleave
> >>> vm.xml:1: element domain: Relax-NG validity error : Element domain
> >>> failed to validate content
> >>> vm.xml fails to validate
> >>>
> >>> Second approach I took was to setting QoS in volumes types. But there
> >>> is no option to set iothreads per volume, there are parameter realted
> >>> to max_read/wrirte ops/bytes.
> >>>
> >>> Thirdly, editing Nova flavor and proving extra specs like
> >>> hw:cpu_socket/thread/core, can change guest CPU topology however again
> >>> no way to set iothread. It does accept hw_disk_iothreads(no type check
> >>> in place, i believe ), but can not pass the same in domain.xml.
> >>>
> >>> Could you suggest me a way to set the same.
> >>>
> >>> -Pushpesh
> >>>
> >>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
> >>> < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>>>>> I need to try out the performance on qemu soon and may come back to
> you if I need some qemu setting trick :-)
> >>>>
> >>>> Sure no problem.
> >>>>
> >>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks
> with 1 iothread by disk)
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >
> >>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "Irek Fasikhov" <
> malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mercredi 10 Juin 2015 09:06:32
> >>>> Objet: RE: rbd_cache, limiting read on high iops around 40k
> >>>>
> >>>> Hi Alexandre,
> >>>> Thanks for sharing the data.
> >>>> I need to try out the performance on qemu soon and may come back to
> you if I need some qemu setting trick :-)
> >>>>
> >>>> Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto: ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org ] On
> Behalf Of Alexandre DERUMIER
> >>>> Sent: Tuesday, June 09, 2015 10:42 PM
> >>>> To: Irek Fasikhov
> >>>> Cc: ceph-devel; pushpesh sharma; ceph-users
> >>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops
> around 40k
> >>>>
> >>>>>> Very good work!
> >>>>>> Do you have a rpm-file?
> >>>>>> Thanks.
> >>>> no sorry, I'm have compiled it manually (and I'm using debian jessie
> as client)
> >>>>
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Irek Fasikhov" < malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>> Cc: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mercredi 10 Juin 2015 07:21:42
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> Hi, Alexandre.
> >>>>
> >>>> Very good work!
> >>>> Do you have a rpm-file?
> >>>> Thanks.
> >>>>
> >>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> :
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I have tested qemu with last tcmalloc 2.4, and the improvement is
> huge with iothread: 50k iops (+45%) !
> >>>>
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc
> (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226
> (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
> >>>>
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread :
> jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276
> (+45%)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
> >>>> ------------------------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10
> 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt=
> 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec):
> min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265,
> avg=635.27, stdev=197.40 clat percentiles (usec):
> >>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
> >>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
> >>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
> >>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
> >>>> | 99.99th=[ 3760]
> >>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00,
> stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63%
> lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%,
> ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%,
> 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0,
> short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s,
> maxb=201107KB/s, mint=26070msec, maxt=26070msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840,
> util=99.73%
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10
> 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt=
> 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec):
> min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743,
> avg=888.31, stdev=315.51 clat percentiles (usec):
> >>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
> >>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
> >>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
> >>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
> >>>> | 99.99th=[ 3632]
> >>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11,
> stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01%
> lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%,
> ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%,
> 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0,
> short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s,
> maxb=143896KB/s, mint=36435msec, maxt=36435msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716,
> util=99.85%
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>> À: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >
> >>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mardi 9 Juin 2015 18:47:27
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> Hi Robert,
> >>>>
> >>>>>> What I found was that Ceph OSDs performed well with either tcmalloc
> or
> >>>>>> jemalloc (except when RocksDB was built with jemalloc instead of
> >>>>>> tcmalloc, I'm still working to dig into why that might be the case).
> >>>> yes,from my test, for osd tcmalloc is a little faster (but very
> little) than jemalloc.
> >>>>
> >>>>
> >>>>
> >>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> >>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> >>>>>> better for QEMU/KVM in the tests that we ran. [1]
> >>>>
> >>>>
> >>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see
> speed regression with tcmalloc.
> >>>> with qemu iothread, tcmalloc have a speed increase over glib
> >>>> with qemu iothread, jemalloc have a speed decrease
> >>>>
> >>>> without iothread, jemalloc have a big speed increase
> >>>>
> >>>> this is with
> >>>> -qemu 2.3
> >>>> -tcmalloc 2.2.1
> >>>> -jemmaloc 3.6
> >>>> -libc6 2.19
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395
> >>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
> >>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> qemu : iothread : tcmalloc : iops=38676 (+12%)
> >>>> qemu : iothread : jemmaloc : iops=28023 (-19%)
> >>>>
> >>>>
> >>>> (The benefit of iothreads is that we can scale with more disks in 1vm)
> >>>>
> >>>>
> >>>> fio results:
> >>>> ------------
> >>>>
> >>>> qemu : iothread : tcmalloc : iops=38676
> >>>> -----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9
> 18:16:53 2015
> >>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
> >>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
> >>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
> >>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
> >>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
> >>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
> >>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
> >>>> | 99.99th=[ 3888]
> >>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40,
> stdev=16978.03
> >>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
> >>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
> >>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s,
> maxb=154707KB/s, mint=33889msec, maxt=33889msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096,
> util=99.77%
> >>>>
> >>>>
> >>>>
> >>>> qemu : no-iothread : tcmalloc : iops=34516
> >>>> ---------------------------------------------
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9
> 18:19:08 2015
> >>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
> >>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
> >>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
> >>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
> >>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
> >>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
> >>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
> >>>> | 99.99th=[ 4320]
> >>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88,
> stdev=16883.77
> >>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
> >>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
> >>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s,
> maxb=138064KB/s, mint=37974msec, maxt=37974msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396,
> util=99.86%
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : glibc : iops=34516
> >>>> -------------------------------------
> >>>>
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9
> 18:24:01 2015
> >>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
> >>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
> >>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
> >>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> >>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
> >>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
> >>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
> >>>> | 99.99th=[ 3984]
> >>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78,
> stdev=15521.30
> >>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
> >>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
> >>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s,
> maxb=137785KB/s, mint=38051msec, maxt=38051msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972,
> util=99.85%
> >>>>
> >>>>
> >>>>
> >>>> qemu : no iothread : glibc : iops=33395
> >>>> -----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9
> 18:27:18 2015
> >>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
> >>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
> >>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
> >>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> >>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
> >>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
> >>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
> >>>> | 99.99th=[ 4832]
> >>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64,
> stdev=19121.91
> >>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
> >>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
> >>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s,
> maxb=133583KB/s, mint=39248msec, maxt=39248msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536,
> util=99.84%
> >>>>
> >>>>
> >>>>
> >>>> qemu : iothread : jemmaloc : iops=28023
> >>>> ----------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0
> iops] [eta 00m:01s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9
> 18:30:26 2015
> >>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
> >>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
> >>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
> >>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
> >>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
> >>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
> >>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
> >>>> | 99.99th=[ 3760]
> >>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27,
> stdev=17381.70
> >>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
> >>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
> >>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s,
> maxb=112094KB/s, mint=46772msec, maxt=46772msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376,
> util=98.68%
> >>>>
> >>>>
> >>>>
> >>>> qemu : non-iothread : jemmaloc : iops=42226
> >>>> --------------------------------------------
> >>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> >>>> fio-2.1.11
> >>>> Starting 1 process
> >>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0
> iops] [eta 00m:00s]
> >>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9
> 18:34:11 2015
> >>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
> >>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
> >>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
> >>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
> >>>> clat percentiles (usec):
> >>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
> >>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
> >>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
> >>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
> >>>> | 99.99th=[ 2608]
> >>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14,
> stdev=23440.79
> >>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
> >>>> lat (msec) : 2=10.30%, 4=0.07%
> >>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
> >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
> >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
> >>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> >>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>
> >>>> Run status group 0 (all jobs):
> >>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s,
> maxb=177130KB/s, mint=29599msec, maxt=29599msec
> >>>>
> >>>> Disk stats (read/write):
> >>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636,
> util=99.80%
> >>>>
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>> De: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >
> >>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" <
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>> Envoyé: Mardi 9 Juin 2015 18:00:29
> >>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> I also saw a similar performance increase by using alternative memory
> >>>> allocators. What I found was that Ceph OSDs performed well with either
> >>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> >>>> instead of tcmalloc, I'm still working to dig into why that might be
> >>>> the case).
> >>>>
> >>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> >>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> >>>> better for QEMU/KVM in the tests that we ran. [1]
> >>>>
> >>>> I'm currently looking into I/O bottlenecks around the 16KB range and
> >>>> I'm seeing a lot of time in thread creation and destruction, the
> >>>> memory allocators are quite a bit down the list (both fio with
> >>>> ioengine rbd and on the OSDs). I wonder what the difference can be.
> >>>> I've tried using the async messenger but there wasn't a huge
> >>>> difference. [2]
> >>>>
> >>>> Further down the rabbit hole....
> >>>>
> >>>> [1]
> https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg20197.html
> >>>> [2]
> https://www.mail-archive.com/ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg23982.html
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v0.13.1
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
> >>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
> >>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
> >>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
> >>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
> >>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
> >>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
> >>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
> >>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
> >>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
> >>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
> >>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
> >>>> oSJX
> >>>> =k281
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <
> aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit
> 80K
> >>>>>>> IOPS from 1 VM!
> >>>>>
> >>>>> Note that theses result are not in a vm (fio-rbd on host), so in a
> vm we'll have overhead.
> >>>>> (I'm planning to send results in qemu soon)
> >>>>>
> >>>>>>> How fast are the SSDs in those 3 OSDs?
> >>>>>
> >>>>> Theses results are with datas in buffer memory of osd nodes.
> >>>>>
> >>>>> When reading fulling on ssd (intel s3500),
> >>>>>
> >>>>> For 1 client,
> >>>>>
> >>>>> I'm around 33k iops without cache and 32k iops with cache, with 1
> osd.
> >>>>> I'm around 55k iops without cache and 38k iops with cache, with 3
> osd.
> >>>>>
> >>>>> with multiple clients jobs, I can reach around 70kiops by osd , and
> 250k iops by osd when datas are in buffer.
> >>>>>
> >>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
> >>>>>
> >>>>>
> >>>>>
> >>>>> small tip :
> >>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies
> by around 20%
> >>>>>
> >>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
> >>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
> >>>>>
> >>>>> as a lot of time is spent in malloc/free
> >>>>>
> >>>>>
> >>>>> (qemu support also tcmalloc since some months , I'll bench it too
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html
> )
> >>>>>
> >>>>>
> >>>>>
> >>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- Mail original -----
> >>>>> De: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >
> >>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "pushpesh sharma" <
> pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>> Envoyé: Mardi 9 Juin 2015 13:36:31
> >>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around
> 40k
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> In the past we've hit some performance issues with RBD cache that
> we've
> >>>>> fixed, but we've never really tried pushing a single VM beyond 40+K
> read
> >>>>> IOPS in testing (or at least I never have). I suspect there's a
> couple
> >>>>> of possibilities as to why it might be slower, but perhaps joshd can
> >>>>> chime in as he's more familiar with what that code looks like.
> >>>>>
> >>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
> >>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
> >>>>>> It's seem that the limit is mainly going in high queue depth (+- >
> 16)
> >>>>>>
> >>>>>> Here the result in iops with 1client- 4krandread- 3osd - with
> differents queue depth size.
> >>>>>> rbd_cache is almost the same than without cache with queue depth <16
> >>>>>>
> >>>>>>
> >>>>>> cache
> >>>>>> -----
> >>>>>> qd1: 1651
> >>>>>> qd2: 3482
> >>>>>> qd4: 7958
> >>>>>> qd8: 17912
> >>>>>> qd16: 36020
> >>>>>> qd32: 42765
> >>>>>> qd64: 46169
> >>>>>>
> >>>>>> no cache
> >>>>>> --------
> >>>>>> qd1: 1748
> >>>>>> qd2: 3570
> >>>>>> qd4: 8356
> >>>>>> qd8: 17732
> >>>>>> qd16: 41396
> >>>>>> qd32: 78633
> >>>>>> qd64: 79063
> >>>>>> qd128: 79550
> >>>>>>
> >>>>>>
> >>>>>> ----- Mail original -----
> >>>>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>>>> À: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21
> >>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops
> around 40k
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>>> We tried adding more RBDs to single VM, but no luck.
> >>>>>>
> >>>>>> If you want to scale with more disks in a single qemu vm, you need
> to use iothread feature from qemu and assign 1 iothread by disk (works with
> virtio-blk).
> >>>>>> It's working for me, I can scale with adding more disks.
> >>>>>>
> >>>>>>
> >>>>>> My bench here are done with fio-rbd on host.
> >>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a
> single host and around 250kiops 10clients-rbdcache=on.
> >>>>>>
> >>>>>>
> >>>>>> I just wonder why I don't have performance decrease around 30k iops
> with 1osd.
> >>>>>>
> >>>>>> I'm going to see if this tracker
> >>>>>> http://tracker.ceph.com/issues/11056
> >>>>>>
> >>>>>> could be the cause.
> >>>>>>
> >>>>>> (My master build was done some week ago)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----- Mail original -----
> >>>>>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >
> >>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >
> >>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" <
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >
> >>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04
> >>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
> >>>>>>
> >>>>>> Hi Alexandre,
> >>>>>>
> >>>>>> We have also seen something very similar on Hammer(0.94-1). We were
> doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM,
> openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as
> additional storage. For some strange reason it was not able to scale 4K- RR
> iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but
> no luck. However increasing number of VMs to 4 on a single hypervisor did
> scale to some extent. After this there was no much benefit we got from
> adding more VMs.
> >>>>>>
> >>>>>> Here is the trend we have seen, x-axis is number of hypervisor,
> each hypervisor has 4 VM, each VM has 1 RBD:-
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> VDbench is used as benchmarking tool. We were not saturating
> network and CPUs at OSD nodes. We were not able to saturate CPUs at
> hypervisors, and that is where we were suspecting of some throttling
> effect. However we haven't setted any such limits from nova or kvm end. We
> tried some CPU pinning and other KVM related tuning as well, but no luck.
> >>>>>>
> >>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs
> were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than
> scaling beyond that point the numbers were actually degrading. (Single pipe
> more congestion effect)
> >>>>>>
> >>>>>> We never suspected that rbd cache enable could be detrimental to
> performance. It would nice to route cause the problem if that is the case.
> >>>>>>
> >>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER <
> aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm doing benchmark (ceph master branch), with randread 4k
> qdepth=32,
> >>>>>> and rbd_cache=true seem to limit the iops around 40k
> >>>>>>
> >>>>>>
> >>>>>> no cache
> >>>>>> --------
> >>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops
> >>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops
> >>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops
> >>>>>>
> >>>>>>
> >>>>>> cache
> >>>>>> -----
> >>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops
> >>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops
> >>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Is it expected ?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> fio result rbd_cache=false 3 osd
> >>>>>> --------------------------------
> >>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >>>>>> fio-2.1.11
> >>>>>> Starting 1 process
> >>>>>> rbd engine: RBD version: 0.1.9
> >>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0
> iops] [eta 00m:00s]
> >>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue
> Jun 9 07:48:42 2015
> >>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
> >>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
> >>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
> >>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
> >>>>>> clat percentiles (usec):
> >>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
> >>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
> >>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
> >>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
> >>>>>> | 99.99th=[ 1176]
> >>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34,
> stdev=25196.21
> >>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%,
> 1000=0.23%
> >>>>>> lat (msec) : 2=0.03%, 4=0.01%
> >>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
> >>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,
> >=64=0.0%
> >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%,
> >=64=0.0%
> >>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>>>
> >>>>>> Run status group 0 (all jobs):
> >>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s,
> maxb=313169KB/s, mint=32698msec, maxt=32698msec
> >>>>>>
> >>>>>> Disk stats (read/write):
> >>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> fio result rbd_cache=true 3osd
> >>>>>> ------------------------------
> >>>>>>
> >>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=rbd, iodepth=32
> >>>>>> fio-2.1.11
> >>>>>> Starting 1 process
> >>>>>> rbd engine: RBD version: 0.1.9
> >>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0
> iops] [eta 00m:00s]
> >>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue
> Jun 9 07:47:30 2015
> >>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
> >>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
> >>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
> >>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
> >>>>>> clat percentiles (usec):
> >>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
> >>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
> >>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
> >>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
> >>>>>> | 99.99th=[ 2192]
> >>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10,
> stdev=15079.93
> >>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
> >>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
> >>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
> >>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%,
> >=64=0.0%
> >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%,
> >=64=0.0%
> >>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
> >>>>>> latency : target=0, window=0, percentile=100.00%, depth=32
> >>>>>>
> >>>>>> Run status group 0 (all jobs):
> >>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s,
> maxb=183295KB/s, mint=55866msec, maxt=55866msec
> >>>>>>
> >>>>>> Disk stats (read/write):
> >>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%,
> aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
> >>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> С уважением, Фасихов Ирек Нургаязович
> >>>> Моб.: +79229045757
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> ________________________________
> >>>>
> >>>> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> >>>
> >>>
> >>>
> >>> --
> >>> -Pushpesh
> >>
> >>
> >>
> >> --
> >> -Pushpesh
> >
> >
> >
> > --
> > -Pushpesh
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>
>
>
>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

[-- Attachment #1.2: Type: text/html, Size: 68204 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: rbd_cache, limiting read on high iops around 40k
  2015-06-22  9:04                                                               ` Irek Fasikhov
@ 2015-06-22  9:26                                                                 ` Alexandre DERUMIER
       [not found]                                                                   ` <43279853.1688973.1434965164602.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-22  9:26 UTC (permalink / raw)
  To: Irek Fasikhov
  Cc: Stefan Priebe, pushpesh sharma, Somnath Roy, ceph-devel,
	ceph-users

>>In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 
>>Thanks. 

This small patch on top of qemu-server should be enough (I think it should apply on 3.4 sources without problem)

https://git.proxmox.com/?p=qemu-server.git;a=commit;h=51f492cd6da0228129aaab1393b5c5844d75a53c

No need to hack qemu-kvm



----- Mail original -----
De: "Irek Fasikhov" <malmyzh@gmail.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Stefan Priebe" <s.priebe@profihost.ag>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "Somnath Roy" <Somnath.Roy@sandisk.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 22 Juin 2015 11:04:42
Objet: Re: rbd_cache, limiting read on high iops around 40k

| Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 
Alexandre, Useful option! 
In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 
Thanks. 

2015-06-22 11:54 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 


>>It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 

Yes and no ;) 

Currently in proxmox 3.4, iothread:1 generate only 1 iothread for all disks. 

So, you'll have a small extra boost, but it'll not scale with multiple disks. 

Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 


>>Does it also help for single disks or only multiple disks? 

Iothread can also help for single disk, because by default qemu use a main thread for disk but also other things(don't remember what exactly) 




----- Mail original ----- 
De: "Irek Fasikhov" < malmyzh@gmail.com > 
À: "Stefan Priebe" < s.priebe@profihost.ag > 
Cc: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "Somnath Roy" < Somnath.Roy@sandisk.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Lundi 22 Juin 2015 09:22:13 
Objet: Re: rbd_cache, limiting read on high iops around 40k 

It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 

2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG < s.priebe@profihost.ag > : 



Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER < aderumier@odiso.com >: 

>>> Just an update, there seems to be no proper way to pass iothread 
>>> parameter from openstack-nova (not at least in Juno release). So a 
>>> default single iothread per VM is what all we have. So in conclusion a 
>>> nova instance max iops on ceph rbd will be limited to 30-40K. 
> 
> Thanks for the update. 
> 
> For proxmox users, 
> 
> I have added iothread option to gui for proxmox 4.0 

Can we make iothread the default? Does it also help for single disks or only multiple disks? 

> and added jemalloc as default memory allocator 
> 
> 
> I have also send a jemmaloc patch to qemu dev mailing 
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html 
> 
> (Help is welcome to push it in qemu upstream ! ) 
> 
> 
> 
> ----- Mail original ----- 
> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
> À: "aderumier" < aderumier@odiso.com > 
> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
> Envoyé: Lundi 22 Juin 2015 07:58:47 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> Just an update, there seems to be no proper way to pass iothread 
> parameter from openstack-nova (not at least in Juno release). So a 
> default single iothread per VM is what all we have. So in conclusion a 
> nova instance max iops on ceph rbd will be limited to 30-40K. 
> 
> On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
> < aderumier@odiso.com > wrote: 
>> Hi, 
>> 
>> some news about qemu with tcmalloc vs jemmaloc. 
>> 
>> I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
>> 
>> And if tcmalloc is a little faster than jemmaloc, 
>> 
>> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 
>> 
>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
>> 
>> 
>> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 
>> 
>> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>> À: "aderumier" < aderumier@odiso.com > 
>> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>> Envoyé: Vendredi 12 Juin 2015 08:58:21 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Thanks, posted the question in openstack list. Hopefully will get some 
>> expert opinion. 
>> 
>> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
>> < aderumier@odiso.com > wrote: 
>>> Hi, 
>>> 
>>> here a libvirt xml sample from libvirt src 
>>> 
>>> (you need to define <iothreads> number, then assign then in disks). 
>>> 
>>> I don't use openstack, so I really don't known how it's working with it. 
>>> 
>>> 
>>> <domain type='qemu'> 
>>> <name>QEMUGuest1</name> 
>>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
>>> <memory unit='KiB'>219136</memory> 
>>> <currentMemory unit='KiB'>219136</currentMemory> 
>>> <vcpu placement='static'>2</vcpu> 
>>> <iothreads>2</iothreads> 
>>> <os> 
>>> <type arch='i686' machine='pc'>hvm</type> 
>>> <boot dev='hd'/> 
>>> </os> 
>>> <clock offset='utc'/> 
>>> <on_poweroff>destroy</on_poweroff> 
>>> <on_reboot>restart</on_reboot> 
>>> <on_crash>destroy</on_crash> 
>>> <devices> 
>>> <emulator>/usr/bin/qemu</emulator> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='1'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
>>> <target dev='vdb' bus='virtio'/> 
>>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
>>> </disk> 
>>> <disk type='file' device='disk'> 
>>> <driver name='qemu' type='raw' iothread='2'/> 
>>> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
>>> <target dev='vdc' bus='virtio'/> 
>>> </disk> 
>>> <controller type='usb' index='0'/> 
>>> <controller type='ide' index='0'/> 
>>> <controller type='pci' index='0' model='pci-root'/> 
>>> <memballoon model='none'/> 
>>> </devices> 
>>> </domain> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>> À: "aderumier" < aderumier@odiso.com > 
>>> Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>> Envoyé: Vendredi 12 Juin 2015 07:52:41 
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Hi Alexandre, 
>>> 
>>> I agree with your rational, of one iothread per disk. CPU consumed in 
>>> IOwait is pretty high in each VM. But I am not finding a way to set 
>>> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
>>> As per libvirt documentation for setting iothreads, I can edit 
>>> domain.xml directly and achieve the same effect. However in as in 
>>> openstack env domain xml is created by nova with some additional 
>>> metadata, so editing the domain xml using 'virsh edit' does not seems 
>>> to work(I agree, it is not a very cloud way of doing things, but a 
>>> hack). Changes made there vanish after saving them, due to reason 
>>> libvirt validation fails on the same. 
>>> 
>>> #virsh dumpxml instance-000000c5 > vm.xml 
>>> #virt-xml-validate vm.xml 
>>> Relax-NG validity error : Extra element cpu in interleave 
>>> vm.xml:1: element domain: Relax-NG validity error : Element domain 
>>> failed to validate content 
>>> vm.xml fails to validate 
>>> 
>>> Second approach I took was to setting QoS in volumes types. But there 
>>> is no option to set iothreads per volume, there are parameter realted 
>>> to max_read/wrirte ops/bytes. 
>>> 
>>> Thirdly, editing Nova flavor and proving extra specs like 
>>> hw:cpu_socket/thread/core, can change guest CPU topology however again 
>>> no way to set iothread. It does accept hw_disk_iothreads(no type check 
>>> in place, i believe ), but can not pass the same in domain.xml. 
>>> 
>>> Could you suggest me a way to set the same. 
>>> 
>>> -Pushpesh 
>>> 
>>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
>>> < aderumier@odiso.com > wrote: 
>>>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>> 
>>>> Sure no problem. 
>>>> 
>>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Somnath Roy" < Somnath.Roy@sandisk.com > 
>>>> À: "aderumier" < aderumier@odiso.com >, "Irek Fasikhov" < malmyzh@gmail.com > 
>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>>>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Alexandre, 
>>>> Thanks for sharing the data. 
>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>> 
>>>> Regards 
>>>> Somnath 
>>>> 
>>>> -----Original Message----- 
>>>> From: ceph-users [mailto: ceph-users-bounces@lists.ceph.com ] On Behalf Of Alexandre DERUMIER 
>>>> Sent: Tuesday, June 09, 2015 10:42 PM 
>>>> To: Irek Fasikhov 
>>>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>>>> Very good work! 
>>>>>> Do you have a rpm-file? 
>>>>>> Thanks. 
>>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Irek Fasikhov" < malmyzh@gmail.com > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "Robert LeBlanc" < robert@leblancnet.us >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi, Alexandre. 
>>>> 
>>>> Very good work! 
>>>> Do you have a rpm-file? 
>>>> Thanks. 
>>>> 
>>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 
>>>> 
>>>> 
>>>> Hi, 
>>>> 
>>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>> ------------------------------------------------------ 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3632] 
>>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "aderumier" < aderumier@odiso.com > 
>>>> À: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Robert, 
>>>> 
>>>>>> What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>>> jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>>> tcmalloc, I'm still working to dig into why that might be the case). 
>>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>>>> 
>>>> 
>>>> 
>>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>> 
>>>> 
>>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>>>> with qemu iothread, tcmalloc have a speed increase over glib 
>>>> with qemu iothread, jemalloc have a speed decrease 
>>>> 
>>>> without iothread, jemalloc have a big speed increase 
>>>> 
>>>> this is with 
>>>> -qemu 2.3 
>>>> -tcmalloc 2.2.1 
>>>> -jemmaloc 3.6 
>>>> -libc6 2.19 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>>>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>>>> 
>>>> 
>>>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>>>> 
>>>> 
>>>> fio results: 
>>>> ------------ 
>>>> 
>>>> qemu : iothread : tcmalloc : iops=38676 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>>>> | 99.99th=[ 3888] 
>>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no-iothread : tcmalloc : iops=34516 
>>>> --------------------------------------------- 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>>>> | 99.99th=[ 4320] 
>>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : glibc : iops=34516 
>>>> ------------------------------------- 
>>>> 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>>>> | 99.99th=[ 3984] 
>>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>>>> 
>>>> 
>>>> 
>>>> qemu : no iothread : glibc : iops=33395 
>>>> ----------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>>>> | 99.99th=[ 4832] 
>>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>>>> 
>>>> 
>>>> 
>>>> qemu : iothread : jemmaloc : iops=28023 
>>>> ---------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>>>> | 99.99th=[ 3760] 
>>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>>>> 
>>>> 
>>>> 
>>>> qemu : non-iothread : jemmaloc : iops=42226 
>>>> -------------------------------------------- 
>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>> fio-2.1.11 
>>>> Starting 1 process 
>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>>>> | 99.99th=[ 2608] 
>>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>>>> lat (msec) : 2=10.30%, 4=0.07% 
>>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>> 
>>>> Run status group 0 (all jobs): 
>>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>>>> 
>>>> Disk stats (read/write): 
>>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Robert LeBlanc" < robert@leblancnet.us > 
>>>> À: "aderumier" < aderumier@odiso.com > 
>>>> Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE----- 
>>>> Hash: SHA256 
>>>> 
>>>> I also saw a similar performance increase by using alternative memory 
>>>> allocators. What I found was that Ceph OSDs performed well with either 
>>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>>> instead of tcmalloc, I'm still working to dig into why that might be 
>>>> the case). 
>>>> 
>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>> 
>>>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>>>> I'm seeing a lot of time in thread creation and destruction, the 
>>>> memory allocators are quite a bit down the list (both fio with 
>>>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>>>> I've tried using the async messenger but there wasn't a huge 
>>>> difference. [2] 
>>>> 
>>>> Further down the rabbit hole.... 
>>>> 
>>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 
>>>> [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 
>>>> -----BEGIN PGP SIGNATURE----- 
>>>> Version: Mailvelope v0.13.1 
>>>> Comment: https://www.mailvelope.com 
>>>> 
>>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 
>>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>>>> oSJX 
>>>> =k281 
>>>> -----END PGP SIGNATURE----- 
>>>> ---------------- 
>>>> Robert LeBlanc 
>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>>>> 
>>>> 
>>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>>> IOPS from 1 VM! 
>>>>> 
>>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>>>> (I'm planning to send results in qemu soon) 
>>>>> 
>>>>>>> How fast are the SSDs in those 3 OSDs? 
>>>>> 
>>>>> Theses results are with datas in buffer memory of osd nodes. 
>>>>> 
>>>>> When reading fulling on ssd (intel s3500), 
>>>>> 
>>>>> For 1 client, 
>>>>> 
>>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>>>> 
>>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>>>> 
>>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>>>> 
>>>>> 
>>>>> 
>>>>> small tip : 
>>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>>>> 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>>>> 
>>>>> as a lot of time is spent in malloc/free 
>>>>> 
>>>>> 
>>>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Mark Nelson" < mnelson@redhat.com > 
>>>>> À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi All, 
>>>>> 
>>>>> In the past we've hit some performance issues with RBD cache that we've 
>>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>>>> chime in as he's more familiar with what that code looks like. 
>>>>> 
>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>>>> 
>>>>> Mark 
>>>>> 
>>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>>>> 
>>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> qd1: 1651 
>>>>>> qd2: 3482 
>>>>>> qd4: 7958 
>>>>>> qd8: 17912 
>>>>>> qd16: 36020 
>>>>>> qd32: 42765 
>>>>>> qd64: 46169 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> qd1: 1748 
>>>>>> qd2: 3570 
>>>>>> qd4: 8356 
>>>>>> qd8: 17732 
>>>>>> qd16: 41396 
>>>>>> qd32: 78633 
>>>>>> qd64: 79063 
>>>>>> qd128: 79550 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "aderumier" < aderumier@odiso.com > 
>>>>>> À: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>>>> We tried adding more RBDs to single VM, but no luck. 
>>>>>> 
>>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>>>> It's working for me, I can scale with adding more disks. 
>>>>>> 
>>>>>> 
>>>>>> My bench here are done with fio-rbd on host. 
>>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>>>> 
>>>>>> 
>>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>>>> 
>>>>>> I'm going to see if this tracker 
>>>>>> http://tracker.ceph.com/issues/11056 
>>>>>> 
>>>>>> could be the cause. 
>>>>>> 
>>>>>> (My master build was done some week ago) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "pushpesh sharma" < pushpesh.eck@gmail.com > 
>>>>>> À: "aderumier" < aderumier@odiso.com > 
>>>>>> Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
>>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi Alexandre, 
>>>>>> 
>>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>>>> 
>>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>>>> 
>>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>>>> 
>>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>>>> 
>>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>> 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>>>> 
>>>>>> 
>>>>>> no cache 
>>>>>> -------- 
>>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>>>> 
>>>>>> 
>>>>>> cache 
>>>>>> ----- 
>>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Is it expected ? 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=false 3 osd 
>>>>>> -------------------------------- 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>>>> | 99.99th=[ 1176] 
>>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fio result rbd_cache=true 3osd 
>>>>>> ------------------------------ 
>>>>>> 
>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>> fio-2.1.11 
>>>>>> Starting 1 process 
>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>>>> clat percentiles (usec): 
>>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>>>> | 99.99th=[ 2192] 
>>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users@lists.ceph.com 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> С уважением, Фасихов Ирек Нургаязович 
>>>> Моб.: +79229045757 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> 
>>>> ________________________________ 
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>>> 
>>> 
>>> 
>>> -- 
>>> -Pushpesh 
>> 
>> 
>> 
>> -- 
>> -Pushpesh 
> 
> 
> 
> -- 
> -Pushpesh 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 






-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 








-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <43279853.1688973.1434965164602.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                                                                   ` <43279853.1688973.1434965164602.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
@ 2015-06-22  9:28                                                                     ` Stefan Priebe - Profihost AG
       [not found]                                                                       ` <B7D8B5F0-4AB9-449A-895D-CF87AE49BCF6-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Priebe - Profihost AG @ 2015-06-22  9:28 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, pushpesh sharma, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 48536 bytes --]

Oh so it only works for virtio disks? I'm using scsi with the virtio PCI controller.

Stefan

Excuse my typo sent from my mobile phone.

Am 22.06.2015 um 11:26 schrieb Alexandre DERUMIER <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>:

>>> In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 
>>> Thanks.
> 
> This small patch on top of qemu-server should be enough (I think it should apply on 3.4 sources without problem)
> 
> https://git.proxmox.com/?p=qemu-server.git;a=commit;h=51f492cd6da0228129aaab1393b5c5844d75a53c
> 
> No need to hack qemu-kvm
> 
> 
> 
> ----- Mail original -----
> De: "Irek Fasikhov" <malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> À: "aderumier" <aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org>
> Cc: "Stefan Priebe" <s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>, "pushpesh sharma" <pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "Somnath Roy" <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "ceph-devel" <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "ceph-users" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
> Envoyé: Lundi 22 Juin 2015 11:04:42
> Objet: Re: rbd_cache, limiting read on high iops around 40k
> 
> | Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 
> Alexandre, Useful option! 
> In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 
> Thanks. 
> 
> 2015-06-22 11:54 GMT+03:00 Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > : 
> 
> 
>>> It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity.
> 
> Yes and no ;) 
> 
> Currently in proxmox 3.4, iothread:1 generate only 1 iothread for all disks. 
> 
> So, you'll have a small extra boost, but it'll not scale with multiple disks. 
> 
> Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 
> 
> 
>>> Does it also help for single disks or only multiple disks?
> 
> Iothread can also help for single disk, because by default qemu use a main thread for disk but also other things(don't remember what exactly) 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Irek Fasikhov" < malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
> À: "Stefan Priebe" < s.priebe-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org > 
> Cc: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
> Envoyé: Lundi 22 Juin 2015 09:22:13 
> Objet: Re: rbd_cache, limiting read on high iops around 40k 
> 
> It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 
> 
> 2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG < s.priebe@profihost.ag > : 
> 
> 
> 
> Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >: 
> 
>>>> Just an update, there seems to be no proper way to pass iothread 
>>>> parameter from openstack-nova (not at least in Juno release). So a 
>>>> default single iothread per VM is what all we have. So in conclusion a 
>>>> nova instance max iops on ceph rbd will be limited to 30-40K.
>> 
>> Thanks for the update. 
>> 
>> For proxmox users, 
>> 
>> I have added iothread option to gui for proxmox 4.0
> 
> Can we make iothread the default? Does it also help for single disks or only multiple disks? 
> 
>> and added jemalloc as default memory allocator 
>> 
>> 
>> I have also send a jemmaloc patch to qemu dev mailing 
>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html 
>> 
>> (Help is welcome to push it in qemu upstream ! ) 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>> Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>> Envoyé: Lundi 22 Juin 2015 07:58:47 
>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>> 
>> Just an update, there seems to be no proper way to pass iothread 
>> parameter from openstack-nova (not at least in Juno release). So a 
>> default single iothread per VM is what all we have. So in conclusion a 
>> nova instance max iops on ceph rbd will be limited to 30-40K. 
>> 
>> On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
>> < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote: 
>>> Hi, 
>>> 
>>> some news about qemu with tcmalloc vs jemmaloc. 
>>> 
>>> I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
>>> 
>>> And if tcmalloc is a little faster than jemmaloc, 
>>> 
>>> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 
>>> 
>>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
>>> 
>>> 
>>> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 
>>> 
>>> The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 
>>> 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>> Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>> Envoyé: Vendredi 12 Juin 2015 08:58:21 
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>> 
>>> Thanks, posted the question in openstack list. Hopefully will get some 
>>> expert opinion. 
>>> 
>>> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
>>> < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote: 
>>>> Hi, 
>>>> 
>>>> here a libvirt xml sample from libvirt src 
>>>> 
>>>> (you need to define <iothreads> number, then assign then in disks). 
>>>> 
>>>> I don't use openstack, so I really don't known how it's working with it. 
>>>> 
>>>> 
>>>> <domain type='qemu'> 
>>>> <name>QEMUGuest1</name> 
>>>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 
>>>> <memory unit='KiB'>219136</memory> 
>>>> <currentMemory unit='KiB'>219136</currentMemory> 
>>>> <vcpu placement='static'>2</vcpu> 
>>>> <iothreads>2</iothreads> 
>>>> <os> 
>>>> <type arch='i686' machine='pc'>hvm</type> 
>>>> <boot dev='hd'/> 
>>>> </os> 
>>>> <clock offset='utc'/> 
>>>> <on_poweroff>destroy</on_poweroff>
>>>> <on_reboot>restart</on_reboot> 
>>>> <on_crash>destroy</on_crash> 
>>>> <devices> 
>>>> <emulator>/usr/bin/qemu</emulator> 
>>>> <disk type='file' device='disk'> 
>>>> <driver name='qemu' type='raw' iothread='1'/> 
>>>> <source file='/var/lib/libvirt/images/iothrtest1.img'/> 
>>>> <target dev='vdb' bus='virtio'/> 
>>>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 
>>>> </disk> 
>>>> <disk type='file' device='disk'> 
>>>> <driver name='qemu' type='raw' iothread='2'/> 
>>>> <source file='/var/lib/libvirt/images/iothrtest2.img'/> 
>>>> <target dev='vdc' bus='virtio'/> 
>>>> </disk> 
>>>> <controller type='usb' index='0'/> 
>>>> <controller type='ide' index='0'/> 
>>>> <controller type='pci' index='0' model='pci-root'/> 
>>>> <memballoon model='none'/> 
>>>> </devices> 
>>>> </domain> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>> Cc: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>> Envoyé: Vendredi 12 Juin 2015 07:52:41 
>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> I agree with your rational, of one iothread per disk. CPU consumed in 
>>>> IOwait is pretty high in each VM. But I am not finding a way to set 
>>>> the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
>>>> As per libvirt documentation for setting iothreads, I can edit 
>>>> domain.xml directly and achieve the same effect. However in as in 
>>>> openstack env domain xml is created by nova with some additional 
>>>> metadata, so editing the domain xml using 'virsh edit' does not seems 
>>>> to work(I agree, it is not a very cloud way of doing things, but a 
>>>> hack). Changes made there vanish after saving them, due to reason 
>>>> libvirt validation fails on the same. 
>>>> 
>>>> #virsh dumpxml instance-000000c5 > vm.xml 
>>>> #virt-xml-validate vm.xml 
>>>> Relax-NG validity error : Extra element cpu in interleave 
>>>> vm.xml:1: element domain: Relax-NG validity error : Element domain 
>>>> failed to validate content 
>>>> vm.xml fails to validate 
>>>> 
>>>> Second approach I took was to setting QoS in volumes types. But there 
>>>> is no option to set iothreads per volume, there are parameter realted 
>>>> to max_read/wrirte ops/bytes. 
>>>> 
>>>> Thirdly, editing Nova flavor and proving extra specs like 
>>>> hw:cpu_socket/thread/core, can change guest CPU topology however again 
>>>> no way to set iothread. It does accept hw_disk_iothreads(no type check 
>>>> in place, i believe ), but can not pass the same in domain.xml. 
>>>> 
>>>> Could you suggest me a way to set the same. 
>>>> 
>>>> -Pushpesh 
>>>> 
>>>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 
>>>> < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > wrote: 
>>>>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-)
>>>>> 
>>>>> Sure no problem. 
>>>>> 
>>>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Somnath Roy" < Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org > 
>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "Irek Fasikhov" < malmyzh@gmail.com > 
>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>> Envoyé: Mercredi 10 Juin 2015 09:06:32 
>>>>> Objet: RE: rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi Alexandre, 
>>>>> Thanks for sharing the data. 
>>>>> I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 
>>>>> 
>>>>> Regards 
>>>>> Somnath 
>>>>> 
>>>>> -----Original Message----- 
>>>>> From: ceph-users [mailto: ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org ] On Behalf Of Alexandre DERUMIER 
>>>>> Sent: Tuesday, June 09, 2015 10:42 PM 
>>>>> To: Irek Fasikhov 
>>>>> Cc: ceph-devel; pushpesh sharma; ceph-users 
>>>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>>>> Very good work! 
>>>>>>> Do you have a rpm-file? 
>>>>>>> Thanks.
>>>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Irek Fasikhov" < malmyzh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>>> Cc: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org >, "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>> Envoyé: Mercredi 10 Juin 2015 07:21:42 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi, Alexandre. 
>>>>> 
>>>>> Very good work! 
>>>>> Do you have a rpm-file? 
>>>>> Thanks. 
>>>>> 
>>>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > : 
>>>>> 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 
>>>>> 
>>>>> 
>>>>> qemu : iothread : glibc : iops=34516 
>>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 
>>>>> ------------------------------------------------------ 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 
>>>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 
>>>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 
>>>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 
>>>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 
>>>>> | 99.99th=[ 3760] 
>>>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 
>>>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 
>>>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 
>>>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 
>>>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 
>>>>> | 99.99th=[ 3632] 
>>>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>>> À: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org > 
>>>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>> Envoyé: Mardi 9 Juin 2015 18:47:27 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> Hi Robert, 
>>>>> 
>>>>>>> What I found was that Ceph OSDs performed well with either tcmalloc or 
>>>>>>> jemalloc (except when RocksDB was built with jemalloc instead of 
>>>>>>> tcmalloc, I'm still working to dig into why that might be the case).
>>>>> yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 
>>>>> 
>>>>> 
>>>>> 
>>>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>>>> better for QEMU/KVM in the tests that we ran. [1]
>>>>> 
>>>>> 
>>>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 
>>>>> with qemu iothread, tcmalloc have a speed increase over glib 
>>>>> with qemu iothread, jemalloc have a speed decrease 
>>>>> 
>>>>> without iothread, jemalloc have a big speed increase 
>>>>> 
>>>>> this is with 
>>>>> -qemu 2.3 
>>>>> -tcmalloc 2.2.1 
>>>>> -jemmaloc 3.6 
>>>>> -libc6 2.19 
>>>>> 
>>>>> 
>>>>> qemu : no iothread : glibc : iops=33395 
>>>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%) 
>>>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%) 
>>>>> 
>>>>> qemu : iothread : glibc : iops=34516 
>>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) 
>>>>> qemu : iothread : jemmaloc : iops=28023 (-19%) 
>>>>> 
>>>>> 
>>>>> (The benefit of iothreads is that we can scale with more disks in 1vm) 
>>>>> 
>>>>> 
>>>>> fio results: 
>>>>> ------------ 
>>>>> 
>>>>> qemu : iothread : tcmalloc : iops=38676 
>>>>> ----------------------------------------- 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 
>>>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 
>>>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 
>>>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 
>>>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 
>>>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 
>>>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 
>>>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 
>>>>> | 99.99th=[ 3888] 
>>>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 
>>>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 
>>>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 
>>>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : no-iothread : tcmalloc : iops=34516 
>>>>> --------------------------------------------- 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 
>>>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 
>>>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 
>>>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 
>>>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 
>>>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 
>>>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 
>>>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 
>>>>> | 99.99th=[ 4320] 
>>>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 
>>>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 
>>>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 
>>>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : iothread : glibc : iops=34516 
>>>>> ------------------------------------- 
>>>>> 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 
>>>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 
>>>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 
>>>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 
>>>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 
>>>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 
>>>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 
>>>>> | 99.99th=[ 3984] 
>>>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 
>>>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 
>>>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 
>>>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : no iothread : glibc : iops=33395 
>>>>> ----------------------------------------- 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 
>>>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 
>>>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 
>>>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 
>>>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 
>>>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 
>>>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 
>>>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 
>>>>> | 99.99th=[ 4832] 
>>>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 
>>>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 
>>>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 
>>>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : iothread : jemmaloc : iops=28023 
>>>>> ---------------------------------------- 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 
>>>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 
>>>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 
>>>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 
>>>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 
>>>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 
>>>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 
>>>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 
>>>>> | 99.99th=[ 3760] 
>>>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 
>>>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 
>>>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 
>>>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 
>>>>> 
>>>>> 
>>>>> 
>>>>> qemu : non-iothread : jemmaloc : iops=42226 
>>>>> -------------------------------------------- 
>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 
>>>>> fio-2.1.11 
>>>>> Starting 1 process 
>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 
>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 
>>>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 
>>>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 
>>>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 
>>>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 
>>>>> clat percentiles (usec): 
>>>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 
>>>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 
>>>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 
>>>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 
>>>>> | 99.99th=[ 2608] 
>>>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 
>>>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 
>>>>> lat (msec) : 2=10.30%, 4=0.07% 
>>>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 
>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>> 
>>>>> Run status group 0 (all jobs): 
>>>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 
>>>>> 
>>>>> Disk stats (read/write): 
>>>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Robert LeBlanc" < robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org > 
>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>>> Cc: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>> Envoyé: Mardi 9 Juin 2015 18:00:29 
>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>> 
>>>>> -----BEGIN PGP SIGNED MESSAGE----- 
>>>>> Hash: SHA256 
>>>>> 
>>>>> I also saw a similar performance increase by using alternative memory 
>>>>> allocators. What I found was that Ceph OSDs performed well with either 
>>>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc 
>>>>> instead of tcmalloc, I'm still working to dig into why that might be 
>>>>> the case). 
>>>>> 
>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to 
>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 
>>>>> better for QEMU/KVM in the tests that we ran. [1] 
>>>>> 
>>>>> I'm currently looking into I/O bottlenecks around the 16KB range and 
>>>>> I'm seeing a lot of time in thread creation and destruction, the 
>>>>> memory allocators are quite a bit down the list (both fio with 
>>>>> ioengine rbd and on the OSDs). I wonder what the difference can be. 
>>>>> I've tried using the async messenger but there wasn't a huge 
>>>>> difference. [2] 
>>>>> 
>>>>> Further down the rabbit hole.... 
>>>>> 
>>>>> [1] https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg20197.html 
>>>>> [2] https://www.mail-archive.com/ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg23982.html 
>>>>> -----BEGIN PGP SIGNATURE----- 
>>>>> Version: Mailvelope v0.13.1 
>>>>> Comment: https://www.mailvelope.com 
>>>>> 
>>>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 
>>>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 
>>>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
>>>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 
>>>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 
>>>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 
>>>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 
>>>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 
>>>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 
>>>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 
>>>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 
>>>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 
>>>>> oSJX 
>>>>> =k281 
>>>>> -----END PGP SIGNATURE----- 
>>>>> ---------------- 
>>>>> Robert LeBlanc 
>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>>>> IOPS from 1 VM!
>>>>>> 
>>>>>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 
>>>>>> (I'm planning to send results in qemu soon) 
>>>>>> 
>>>>>>>> How fast are the SSDs in those 3 OSDs?
>>>>>> 
>>>>>> Theses results are with datas in buffer memory of osd nodes. 
>>>>>> 
>>>>>> When reading fulling on ssd (intel s3500), 
>>>>>> 
>>>>>> For 1 client, 
>>>>>> 
>>>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 
>>>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 
>>>>>> 
>>>>>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 
>>>>>> 
>>>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> small tip : 
>>>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 
>>>>>> 
>>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 
>>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 
>>>>>> 
>>>>>> as a lot of time is spent in malloc/free 
>>>>>> 
>>>>>> 
>>>>>> (qemu support also tcmalloc since some months , I'll bench it too 
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Mark Nelson" < mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > 
>>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org >, "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>>> Envoyé: Mardi 9 Juin 2015 13:36:31 
>>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>>> 
>>>>>> Hi All, 
>>>>>> 
>>>>>> In the past we've hit some performance issues with RBD cache that we've 
>>>>>> fixed, but we've never really tried pushing a single VM beyond 40+K read 
>>>>>> IOPS in testing (or at least I never have). I suspect there's a couple 
>>>>>> of possibilities as to why it might be slower, but perhaps joshd can 
>>>>>> chime in as he's more familiar with what that code looks like. 
>>>>>> 
>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K 
>>>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 
>>>>>> 
>>>>>> Mark 
>>>>>> 
>>>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 
>>>>>>> It's seem that the limit is mainly going in high queue depth (+- > 16) 
>>>>>>> 
>>>>>>> Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 
>>>>>>> rbd_cache is almost the same than without cache with queue depth <16 
>>>>>>> 
>>>>>>> 
>>>>>>> cache 
>>>>>>> ----- 
>>>>>>> qd1: 1651 
>>>>>>> qd2: 3482 
>>>>>>> qd4: 7958 
>>>>>>> qd8: 17912 
>>>>>>> qd16: 36020 
>>>>>>> qd32: 42765 
>>>>>>> qd64: 46169 
>>>>>>> 
>>>>>>> no cache 
>>>>>>> -------- 
>>>>>>> qd1: 1748 
>>>>>>> qd2: 3570 
>>>>>>> qd4: 8356 
>>>>>>> qd8: 17732 
>>>>>>> qd16: 41396 
>>>>>>> qd32: 78633 
>>>>>>> qd64: 79063 
>>>>>>> qd128: 79550 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>>>>> À: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21 
>>>>>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>>>> We tried adding more RBDs to single VM, but no luck.
>>>>>>> 
>>>>>>> If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 
>>>>>>> It's working for me, I can scale with adding more disks. 
>>>>>>> 
>>>>>>> 
>>>>>>> My bench here are done with fio-rbd on host. 
>>>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 
>>>>>>> 
>>>>>>> 
>>>>>>> I just wonder why I don't have performance decrease around 30k iops with 1osd. 
>>>>>>> 
>>>>>>> I'm going to see if this tracker 
>>>>>>> http://tracker.ceph.com/issues/11056 
>>>>>>> 
>>>>>>> could be the cause. 
>>>>>>> 
>>>>>>> (My master build was done some week ago) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "pushpesh sharma" < pushpesh.eck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > 
>>>>>>> À: "aderumier" < aderumier-U/x3PoR4x10AvxtiuMwx3w@public.gmane.org > 
>>>>>>> Cc: "ceph-devel" < ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >, "ceph-users" < ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > 
>>>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04 
>>>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k 
>>>>>>> 
>>>>>>> Hi Alexandre, 
>>>>>>> 
>>>>>>> We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 
>>>>>>> 
>>>>>>> Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 
>>>>>>> 
>>>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 
>>>>>>> 
>>>>>>> We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 
>>>>>>> 
>>>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 
>>>>>>> and rbd_cache=true seem to limit the iops around 40k 
>>>>>>> 
>>>>>>> 
>>>>>>> no cache 
>>>>>>> -------- 
>>>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops 
>>>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops 
>>>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops 
>>>>>>> 
>>>>>>> 
>>>>>>> cache 
>>>>>>> ----- 
>>>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops 
>>>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops 
>>>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Is it expected ? 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> fio result rbd_cache=false 3 osd 
>>>>>>> -------------------------------- 
>>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>>> fio-2.1.11 
>>>>>>> Starting 1 process 
>>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 
>>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 
>>>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 
>>>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 
>>>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 
>>>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 
>>>>>>> clat percentiles (usec): 
>>>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 
>>>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 
>>>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 
>>>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 
>>>>>>> | 99.99th=[ 1176] 
>>>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 
>>>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 
>>>>>>> lat (msec) : 2=0.03%, 4=0.01% 
>>>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 
>>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 
>>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 
>>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>>> 
>>>>>>> Run status group 0 (all jobs): 
>>>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 
>>>>>>> 
>>>>>>> Disk stats (read/write): 
>>>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 
>>>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> fio result rbd_cache=true 3osd 
>>>>>>> ------------------------------ 
>>>>>>> 
>>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 
>>>>>>> fio-2.1.11 
>>>>>>> Starting 1 process 
>>>>>>> rbd engine: RBD version: 0.1.9 
>>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 
>>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 
>>>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 
>>>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84 
>>>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 
>>>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 
>>>>>>> clat percentiles (usec): 
>>>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 
>>>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 
>>>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 
>>>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 
>>>>>>> | 99.99th=[ 2192] 
>>>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 
>>>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 
>>>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 
>>>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 
>>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 
>>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 
>>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 
>>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>>>> 
>>>>>>> Run status group 0 (all jobs): 
>>>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 
>>>>>>> 
>>>>>>> Disk stats (read/write): 
>>>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 
>>>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>>>> _______________________________________________ 
>>>>>> ceph-users mailing list 
>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> С уважением, Фасихов Ирек Нургаязович 
>>>>> Моб.: +79229045757 
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>> 
>>>>> ________________________________ 
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> -Pushpesh
>>> 
>>> 
>>> 
>>> -- 
>>> -Pushpesh
>> 
>> 
>> 
>> -- 
>> -Pushpesh 
>> 
>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> 
> 
> 
> 
> 
> 
> -- 
> С уважением, Фасихов Ирек Нургаязович 
> Моб.: +79229045757 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> С уважением, Фасихов Ирек Нургаязович 
> Моб.: +79229045757 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

[-- Attachment #1.2: Type: text/html, Size: 199304 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <B7D8B5F0-4AB9-449A-895D-CF87AE49BCF6-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                                                                       ` <B7D8B5F0-4AB9-449A-895D-CF87AE49BCF6-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
@ 2015-06-22  9:57                                                                         ` Alexandre DERUMIER
  0 siblings, 0 replies; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-22  9:57 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, pushpesh sharma, ceph-users

>>Oh so it only works for virtio disks? I'm using scsi with the virtio PCI controller.

It's working too with virtio-scsi, but it's not thread safe yet.
Also virtio-scsi disk hot-unplug crash qemu with iothread.
Paolo from qemu said that it should be ready in coming releases (qemu 2.6 - 2.7).

I have added the support in proxmox too for virtio-scsi, but don't expose it yet in gui.
(1 virtio-scsi controller by scsi disk, with 1 iothread by controller)

here the patches:
https://git.proxmox.com/?p=qemu-server.git;a=commit;h=6731a4cfa93a62c66ff42b6214bd34745feda088
https://git.proxmox.com/?p=qemu-server.git;a=commit;h=2733141ce318fd6670620b4a92f70ae0dc653f5f
https://git.proxmox.com/?p=qemu-server.git;a=commit;h=fc8b40fd5fba79110b34720c1e48e1785740fe28
https://git.proxmox.com/?p=qemu-server.git;a=commit;h=8bcf3068eb2d1da79231e6684800b958e7e3dcd7


----- Mail original -----
De: "Stefan Priebe" <s.priebe@profihost.ag>
À: "aderumier" <aderumier@odiso.com>
Cc: "Irek Fasikhov" <malmyzh@gmail.com>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "Somnath Roy" <Somnath.Roy@sandisk.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 22 Juin 2015 11:28:03
Objet: Re: rbd_cache, limiting read on high iops around 40k

Oh so it only works for virtio disks? I'm using scsi with the virtio PCI controller. 

Stefan 
Excuse my typo s ent from my mobile phone. 

Am 22.06.2015 um 11:26 schrieb Alexandre DERUMIER < aderumier@odiso.com >: 





BQ_BEGIN

BQ_BEGIN
In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 



BQ_END

BQ_BEGIN

BQ_BEGIN
Thanks. 

BQ_END

BQ_END

This small patch on top of qemu-server should be enough (I think it should apply on 3.4 sources without problem) 

https://git.proxmox.com/?p=qemu-server.git;a=commit;h=51f492cd6da0228129aaab1393b5c5844d75a53c 

No need to hack qemu-kvm 



----- Mail original ----- 
De: "Irek Fasikhov" < malmyzh@gmail.com > 
À: "aderumier" < aderumier@odiso.com > 
Cc: "Stefan Priebe" < s.priebe@profihost.ag >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "Somnath Roy" < Somnath.Roy@sandisk.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Lundi 22 Juin 2015 11:04:42 
Objet: Re: rbd_cache, limiting read on high iops around 40k 

| Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 
Alexandre, Useful option! 
In proxmox 3.4 will it be possible to add at least in the configuration file? Or it entails a change in the source code KVM? 
Thanks. 

2015-06-22 11:54 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 



BQ_BEGIN

BQ_BEGIN
It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 

BQ_END

BQ_END

Yes and no ;) 

Currently in proxmox 3.4, iothread:1 generate only 1 iothread for all disks. 

So, you'll have a small extra boost, but it'll not scale with multiple disks. 

Proxmox 4.0 will allow to enable|disable 1 iothread by disk. 



BQ_BEGIN

BQ_BEGIN
Does it also help for single disks or only multiple disks? 

BQ_END

BQ_END

Iothread can also help for single disk, because by default qemu use a main thread for disk but also other things(don't remember what exactly) 




----- Mail original ----- 
De: "Irek Fasikhov" < malmyzh@gmail.com > 
À: "Stefan Priebe" < s.priebe@profihost.ag > 
Cc: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "Somnath Roy" < Somnath.Roy@sandisk.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 
Envoyé: Lundi 22 Juin 2015 09:22:13 
Objet: Re: rbd_cache, limiting read on high iops around 40k 

It is already possible to do in proxmox 3.4 (with the latest updates qemu-kvm 2.2.x). But it is necessary to register in the conf file iothread:1. For single drives the ambiguous behavior of productivity. 

2015-06-22 10:12 GMT+03:00 Stefan Priebe - Profihost AG < s.priebe@profihost.ag > : 



Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER < aderumier@odiso.com >: 


BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Just an update, there seems to be no proper way to pass iothread 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
parameter from openstack-nova (not at least in Juno release). So a 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
default single iothread per VM is what all we have. So in conclusion a 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
nova instance max iops on ceph rbd will be limited to 30-40K. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
Thanks for the update. 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
For proxmox users, 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
I have added iothread option to gui for proxmox 4.0 

BQ_END

Can we make iothread the default? Does it also help for single disks or only multiple disks? 


BQ_BEGIN
and added jemalloc as default memory allocator 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
I have also send a jemmaloc patch to qemu dev mailing 

BQ_END

BQ_BEGIN
https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
(Help is welcome to push it in qemu upstream ! ) 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_BEGIN
De: "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_BEGIN
Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_BEGIN
Envoyé: Lundi 22 Juin 2015 07:58:47 

BQ_END

BQ_BEGIN
Objet: Re: rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
Just an update, there seems to be no proper way to pass iothread 

BQ_END

BQ_BEGIN
parameter from openstack-nova (not at least in Juno release). So a 

BQ_END

BQ_BEGIN
default single iothread per VM is what all we have. So in conclusion a 

BQ_END

BQ_BEGIN
nova instance max iops on ceph rbd will be limited to 30-40K. 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 

BQ_END

BQ_BEGIN
< aderumier@odiso.com > wrote: 

BQ_END

BQ_BEGIN

BQ_BEGIN
Hi, 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
some news about qemu with tcmalloc vs jemmaloc. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
I'm testing with multiple disks (with iothreads) in 1 qemu guest. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
And if tcmalloc is a little faster than jemmaloc, 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
The problem is that when I hit malloc bug, I'm around 4000-10000 iops, and only way to fix is is to restart qemu ... 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
De: "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
Envoyé: Vendredi 12 Juin 2015 08:58:21 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
Objet: Re: rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
Thanks, posted the question in openstack list. Hopefully will get some 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
expert opinion. 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
< aderumier@odiso.com > wrote: 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi, 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
here a libvirt xml sample from libvirt src 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(you need to define <iothreads> number, then assign then in disks). 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I don't use openstack, so I really don't known how it's working with it. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<domain type='qemu'> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<name>QEMUGuest1</name> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<memory unit='KiB'>219136</memory> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<currentMemory unit='KiB'>219136</currentMemory> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<vcpu placement='static'>2</vcpu> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<iothreads>2</iothreads> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<os> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<type arch='i686' machine='pc'>hvm</type> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<boot dev='hd'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
</os> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<clock offset='utc'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<on_poweroff>destroy</on_poweroff> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<on_reboot>restart</on_reboot> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<on_crash>destroy</on_crash> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<devices> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<emulator>/usr/bin/qemu</emulator> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<disk type='file' device='disk'> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<driver name='qemu' type='raw' iothread='1'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<source file='/var/lib/libvirt/images/iothrtest1.img'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<target dev='vdb' bus='virtio'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
</disk> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<disk type='file' device='disk'> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<driver name='qemu' type='raw' iothread='2'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<source file='/var/lib/libvirt/images/iothrtest2.img'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<target dev='vdc' bus='virtio'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
</disk> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<controller type='usb' index='0'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<controller type='ide' index='0'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<controller type='pci' index='0' model='pci-root'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
<memballoon model='none'/> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
</devices> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
</domain> 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "Somnath Roy" < Somnath.Roy@sandisk.com >, "Irek Fasikhov" < malmyzh@gmail.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Vendredi 12 Juin 2015 07:52:41 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi Alexandre, 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I agree with your rational, of one iothread per disk. CPU consumed in 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IOwait is pretty high in each VM. But I am not finding a way to set 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
the same on a nova instance. I am using openstack Juno with QEMU+KVM. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
As per libvirt documentation for setting iothreads, I can edit 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
domain.xml directly and achieve the same effect. However in as in 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
openstack env domain xml is created by nova with some additional 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
metadata, so editing the domain xml using 'virsh edit' does not seems 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
to work(I agree, it is not a very cloud way of doing things, but a 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
hack). Changes made there vanish after saving them, due to reason 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
libvirt validation fails on the same. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
#virsh dumpxml instance-000000c5 > vm.xml 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
#virt-xml-validate vm.xml 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Relax-NG validity error : Extra element cpu in interleave 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vm.xml:1: element domain: Relax-NG validity error : Element domain 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
failed to validate content 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vm.xml fails to validate 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Second approach I took was to setting QoS in volumes types. But there 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
is no option to set iothreads per volume, there are parameter realted 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
to max_read/wrirte ops/bytes. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Thirdly, editing Nova flavor and proving extra specs like 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
hw:cpu_socket/thread/core, can change guest CPU topology however again 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
no way to set iothread. It does accept hw_disk_iothreads(no type check 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
in place, i believe ), but can not pass the same in domain.xml. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Could you suggest me a way to set the same. 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-Pushpesh 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
< aderumier@odiso.com > wrote: 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Sure no problem. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "Somnath Roy" < Somnath.Roy@sandisk.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com >, "Irek Fasikhov" < malmyzh@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mercredi 10 Juin 2015 09:06:32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: RE: rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi Alexandre, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Thanks for sharing the data. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Regards 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Somnath 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-----Original Message----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
From: ceph-users [mailto: ceph-users-bounces@lists.ceph.com ] On Behalf Of Alexandre DERUMIER 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Sent: Tuesday, June 09, 2015 10:42 PM 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
To: Irek Fasikhov 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: ceph-devel; pushpesh sharma; ceph-users 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Very good work! 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Do you have a rpm-file? 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Thanks. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
no sorry, I'm have compiled it manually (and I'm using debian jessie as client) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "Irek Fasikhov" < malmyzh@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "Robert LeBlanc" < robert@leblancnet.us >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mercredi 10 Juin 2015 07:21:42 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi, Alexandre. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Very good work! 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Do you have a rpm-file? 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Thanks. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@odiso.com > : 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : glibc : iops=34516 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
------------------------------------------------------ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 3760] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, mint=26070msec, maxt=26070msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 3632] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, mint=36435msec, maxt=36435msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "Robert LeBlanc" < robert@leblancnet.us > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mardi 9 Juin 2015 18:47:27 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi Robert, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
What I found was that Ceph OSDs performed well with either tcmalloc or 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
jemalloc (except when RocksDB was built with jemalloc instead of 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
tcmalloc, I'm still working to dig into why that might be the case). 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
yes,from my test, for osd tcmalloc is a little faster (but very little) than jemalloc. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
However, I found that tcmalloc with QEMU/KVM was very detrimental to 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
better for QEMU/KVM in the tests that we ran. [1] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Just have done qemu test (4k randread - rbd_cache=off), I don't see speed regression with tcmalloc. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
with qemu iothread, tcmalloc have a speed increase over glib 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
with qemu iothread, jemalloc have a speed decrease 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
without iothread, jemalloc have a big speed increase 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
this is with 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-qemu 2.3 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-tcmalloc 2.2.1 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-jemmaloc 3.6 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-libc6 2.19 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no iothread : glibc : iops=33395 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no-iothread : tcmalloc : iops=34516 (+3%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no-iothread : jemmaloc : iops=42226 (+26%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : glibc : iops=34516 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : tcmalloc : iops=38676 (+12%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : jemmaloc : iops=28023 (-19%) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(The benefit of iothreads is that we can scale with more disks in 1vm) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio results: 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
------------ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : tcmalloc : iops=38676 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=152, max=5736, avg=822.12, stdev=289.34 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=231, max=5740, avg=826.10, stdev=289.08 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 3888] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=21.65%, 4=0.37%, 10=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, mint=33889msec, maxt=33889msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no-iothread : tcmalloc : iops=34516 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
--------------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=208, max=11858, avg=921.43, stdev=333.61 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=266, max=11862, avg=925.77, stdev=333.40 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 4320] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, mint=37974msec, maxt=37974msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : glibc : iops=34516 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=283, max=7515, avg=923.34, stdev=300.28 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=286, max=7519, avg=927.58, stdev=300.02 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 3984] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=34.65%, 4=0.71%, 10=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, mint=38051msec, maxt=38051msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : no iothread : glibc : iops=33395 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=139, max=12635, avg=952.85, stdev=335.51 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=303, max=12638, avg=957.01, stdev=335.29 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 4832] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, mint=39248msec, maxt=39248msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : iothread : jemmaloc : iops=28023 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
---------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 00m:01s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 3760] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, mint=46772msec, maxt=46772msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qemu : non-iothread : jemmaloc : iops=42226 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-------------------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=174, max=3841, avg=717.08, stdev=237.53 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=210, max=3844, avg=721.23, stdev=237.22 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 2608] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=10.30%, 4=0.07% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, mint=29599msec, maxt=29599msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "Robert LeBlanc" < robert@leblancnet.us > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "Mark Nelson" < mnelson@redhat.com >, "ceph-devel" < ceph-devel@vger.kernel.org >, "pushpesh sharma" < pushpesh.eck@gmail.com >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mardi 9 Juin 2015 18:00:29 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-----BEGIN PGP SIGNED MESSAGE----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hash: SHA256 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I also saw a similar performance increase by using alternative memory 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
allocators. What I found was that Ceph OSDs performed well with either 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
tcmalloc or jemalloc (except when RocksDB was built with jemalloc 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
instead of tcmalloc, I'm still working to dig into why that might be 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
the case). 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
However, I found that tcmalloc with QEMU/KVM was very detrimental to 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
better for QEMU/KVM in the tests that we ran. [1] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm currently looking into I/O bottlenecks around the 16KB range and 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm seeing a lot of time in thread creation and destruction, the 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
memory allocators are quite a bit down the list (both fio with 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ioengine rbd and on the OSDs). I wonder what the difference can be. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I've tried using the async messenger but there wasn't a huge 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
difference. [2] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Further down the rabbit hole.... 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.html 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
[2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.html 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-----BEGIN PGP SIGNATURE----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Version: Mailvelope v0.13.1 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Comment: https://www.mailvelope.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
oSJX 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
=k281 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-----END PGP SIGNATURE----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
---------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Robert LeBlanc 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Frankly, I'm a little impressed that without RBD cache we can hit 80K 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IOPS from 1 VM! 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll have overhead. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(I'm planning to send results in qemu soon) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
How fast are the SSDs in those 3 OSDs? 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Theses results are with datas in buffer memory of osd nodes. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
When reading fulling on ssd (intel s3500), 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
For 1 client, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm around 33k iops without cache and 32k iops with cache, with 1 osd. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm around 55k iops without cache and 38k iops with cache, with 3 osd. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
with multiple clients jobs, I can reach around 70kiops by osd , and 250k iops by osd when datas are in buffer. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
small tip : 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 20% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ... 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ... 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
as a lot of time is spent in malloc/free 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(qemu support also tcmalloc since some months , I'll bench it too 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html ) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'll try to send full bench results soon, from 1 to 18 ssd osd. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "Mark Nelson" < mnelson@redhat.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com >, "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mardi 9 Juin 2015 13:36:31 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi All, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
In the past we've hit some performance issues with RBD cache that we've 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fixed, but we've never really tried pushing a single VM beyond 40+K read 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IOPS in testing (or at least I never have). I suspect there's a couple 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
of possibilities as to why it might be slower, but perhaps joshd can 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
chime in as he's more familiar with what that code looks like. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Frankly, I'm a little impressed that without RBD cache we can hit 80K 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IOPS from 1 VM! How fast are the SSDs in those 3 OSDs? 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Mark 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote: 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
It's seem that the limit is mainly going in high queue depth (+- > 16) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Here the result in iops with 1client- 4krandread- 3osd - with differents queue depth size. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_cache is almost the same than without cache with queue depth <16 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cache 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd1: 1651 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd2: 3482 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd4: 7958 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd8: 17912 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd16: 36020 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd32: 42765 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd64: 46169 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
no cache 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd1: 1748 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd2: 3570 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd4: 8356 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd8: 17732 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd16: 41396 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd32: 78633 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd64: 79063 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
qd128: 79550 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mardi 9 Juin 2015 09:28:21 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
We tried adding more RBDs to single VM, but no luck. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
If you want to scale with more disks in a single qemu vm, you need to use iothread feature from qemu and assign 1 iothread by disk (works with virtio-blk). 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
It's working for me, I can scale with adding more disks. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
My bench here are done with fio-rbd on host. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I can scale up to 400k iops with 10clients-rbd_cache=off on a single host and around 250kiops 10clients-rbdcache=on. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I just wonder why I don't have performance decrease around 30k iops with 1osd. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm going to see if this tracker 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
http://tracker.ceph.com/issues/11056 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
could be the cause. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
(My master build was done some week ago) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- Mail original ----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
De: "pushpesh sharma" < pushpesh.eck@gmail.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
À: "aderumier" < aderumier@odiso.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Cc: "ceph-devel" < ceph-devel@vger.kernel.org >, "ceph-users" < ceph-users@lists.ceph.com > 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Envoyé: Mardi 9 Juin 2015 09:21:04 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Objet: Re: rbd_cache, limiting read on high iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi Alexandre, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
We have also seen something very similar on Hammer(0.94-1). We were doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For some strange reason it was not able to scale 4K- RR iops on each VM beyond 35-40k. We tried adding more RBDs to single VM, but no luck. However increasing number of VMs to 4 on a single hypervisor did scale to some extent. After this there was no much benefit we got from adding more VMs. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Here is the trend we have seen, x-axis is number of hypervisor, each hypervisor has 4 VM, each VM has 1 RBD:- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
VDbench is used as benchmarking tool. We were not saturating network and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and that is where we were suspecting of some throttling effect. However we haven't setted any such limits from nova or kvm end. We tried some CPU pinning and other KVM related tuning as well, but no luck. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
We tried the same experiment on a bare metal. It was 4K RR IOPs were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling beyond that point the numbers were actually degrading. (Single pipe more congestion effect) 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
We never suspected that rbd cache enable could be detrimental to performance. It would nice to route cause the problem if that is the case. 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Hi, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
I'm doing benchmark (ceph master branch), with randread 4k qdepth=32, 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
and rbd_cache=true seem to limit the iops around 40k 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
no cache 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=false - 1osd : 38300 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=false - 2osd : 69073 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=false - 3osd : 78292 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cache 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
----- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=true - 1osd : 38100 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=true - 2osd : 42457 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
1 client - rbd_cache=true - 3osd : 45823 iops 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Is it expected ? 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio result rbd_cache=false 3 osd 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-------------------------------- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd engine: RBD version: 0.1.9 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 07:48:42 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=5, max=530, avg=11.77, stdev= 6.77 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=70, max=2240, avg=336.08, stdev=94.82 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=101, max=2247, avg=347.84, stdev=95.49 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 1176] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, stdev=25196.21 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=0.03%, 4=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, mint=32698msec, maxt=32698msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio result rbd_cache=true 3osd 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
------------------------------ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
fio-2.1.11 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Starting 1 process 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd engine: RBD version: 0.1.9 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] [eta 00m:00s] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 07:47:30 2015 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
slat (usec): min=7, max=805, avg=21.26, stdev=15.84 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat (usec): min=101, max=4602, avg=478.55, stdev=143.73 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec): min=123, max=4669, avg=499.80, stdev=146.03 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
clat percentiles (usec): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288], 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
| 99.99th=[ 2192] 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, stdev=15079.93 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
lat (msec) : 2=0.19%, 4=0.01%, 10=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
latency : target=0, window=0, percentile=100.00%, depth=32 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Run status group 0 (all jobs): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, mint=55866msec, maxt=55866msec 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Disk stats (read/write): 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01% 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
_______________________________________________ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users mailing list 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users@lists.ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
_______________________________________________ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users mailing list 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users@lists.ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-- 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
С уважением, Фасихов Ирек Нургаязович 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
Моб.: +79229045757 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
_______________________________________________ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users mailing list 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
ceph-users@lists.ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
________________________________ 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 

BQ_END

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-- 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN

BQ_BEGIN
-Pushpesh 

BQ_END

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN


BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
-- 

BQ_END

BQ_END

BQ_BEGIN

BQ_BEGIN
-Pushpesh 

BQ_END

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
-- 

BQ_END

BQ_BEGIN
-Pushpesh 

BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN


BQ_END

BQ_BEGIN
-- 

BQ_END

BQ_BEGIN
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 

BQ_END






-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 








-- 
С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 

BQ_END

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
  2015-06-09 11:36             ` Mark Nelson
  2015-06-09 12:02               ` [ceph-users] " Alexandre DERUMIER
@ 2015-06-09 13:39               ` Jason Dillaman
       [not found]                 ` <1569135212.13362835.1433857190455.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Jason Dillaman @ 2015-06-09 13:39 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Alexandre DERUMIER, pushpesh sharma, ceph-devel, ceph-users

> In the past we've hit some performance issues with RBD cache that we've
> fixed, but we've never really tried pushing a single VM beyond 40+K read
> IOPS in testing (or at least I never have).  I suspect there's a couple
> of possibilities as to why it might be slower, but perhaps joshd can
> chime in as he's more familiar with what that code looks like.
> 

At high queue-depths and high IOPS, I would suspect that the bottleneck is the single, coarse-grained mutex protecting the cache data structures.  It's been a back burner item to refactor the current cache mutex into finer-grained locks.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <1569135212.13362835.1433857190455.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: rbd_cache, limiting read on high iops around 40k
       [not found]                 ` <1569135212.13362835.1433857190455.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-09 16:52                   ` Alexandre DERUMIER
  0 siblings, 0 replies; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-09 16:52 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: ceph-devel, pushpesh sharma, ceph-users

>>At high queue-depths and high IOPS, I would suspect that the bottleneck is the single, coarse-grained mutex protecting the cache data structures. It's been a back burner item to refactor the current cache mutex into finer->>grained locks. 
>>
>>Jason 

Thanks for the explain Jason.

Anyway, inside qemu, I'm around 35-40k with or without rbd_cache, so it's make not too much difference currently.
(maybe some other qemu bottleneck).
 

----- Mail original -----
De: "Jason Dillaman" <dillaman@redhat.com>
À: "Mark Nelson" <mnelson@redhat.com>
Cc: "aderumier" <aderumier@odiso.com>, "pushpesh sharma" <pushpesh.eck@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 9 Juin 2015 15:39:50
Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

> In the past we've hit some performance issues with RBD cache that we've 
> fixed, but we've never really tried pushing a single VM beyond 40+K read 
> IOPS in testing (or at least I never have). I suspect there's a couple 
> of possibilities as to why it might be slower, but perhaps joshd can 
> chime in as he's more familiar with what that code looks like. 
> 

At high queue-depths and high IOPS, I would suspect that the bottleneck is the single, coarse-grained mutex protecting the cache data structures. It's been a back burner item to refactor the current cache mutex into finer-grained locks. 

Jason 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-06-22  9:57 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-09  5:51 rbd_cache, limiting read on high iops around 40k Alexandre DERUMIER
     [not found] ` <1684793881.1564583.1433829106394.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-09  7:21   ` pushpesh sharma
     [not found]     ` <CAMc8nAWo-jnAHS5cLw5gDt57T3vZpiN79vFXc=pz=+Cjm6Ra6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-09  7:28       ` Alexandre DERUMIER
2015-06-09  8:36         ` [ceph-users] " Alexandre DERUMIER
     [not found]           ` <1897614581.1694878.1433838989184.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-09 11:36             ` Mark Nelson
2015-06-09 12:02               ` [ceph-users] " Alexandre DERUMIER
     [not found]                 ` <1208111516.1790161.1433851367996.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-09 16:00                   ` Robert LeBlanc
2015-06-09 16:47                     ` [ceph-users] " Alexandre DERUMIER
     [not found]                       ` <1058039366.2034449.1433868447253.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-10  4:10                         ` Alexandre DERUMIER
     [not found]                           ` <284297771.2095666.1433909407567.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-10  5:21                             ` Irek Fasikhov
     [not found]                               ` <CAF-rypxjbsH3GdUG474OgSZVjdzKyf_0n8-zAkAuGhk83TXQhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-10  5:41                                 ` Alexandre DERUMIER
     [not found]                                   ` <2010200873.2102614.1433914918985.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-10  7:06                                     ` Somnath Roy
2015-06-10  7:29                                       ` Alexandre DERUMIER
2015-06-12  5:52                                         ` pushpesh sharma
2015-06-12  6:03                                           ` Alexandre DERUMIER
2015-06-12  6:58                                             ` pushpesh sharma
2015-06-16 16:38                                               ` Alexandre DERUMIER
2015-06-22  5:58                                                 ` pushpesh sharma
2015-06-22  7:08                                                   ` Alexandre DERUMIER
2015-06-22  7:12                                                     ` Stefan Priebe - Profihost AG
     [not found]                                                       ` <942E436A-5668-4F76-91E7-FAA08CC0F48A-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
2015-06-22  7:22                                                         ` Irek Fasikhov
2015-06-22  8:54                                                           ` Alexandre DERUMIER
     [not found]                                                             ` <1581092206.1667776.1434963299884.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-22  9:04                                                               ` Irek Fasikhov
2015-06-22  9:26                                                                 ` Alexandre DERUMIER
     [not found]                                                                   ` <43279853.1688973.1434965164602.JavaMail.zimbra-M8QNeUgB6UTyG1zEObXtfA@public.gmane.org>
2015-06-22  9:28                                                                     ` Stefan Priebe - Profihost AG
     [not found]                                                                       ` <B7D8B5F0-4AB9-449A-895D-CF87AE49BCF6-2Lf/h1ldwEHR5kwTpVNS9A@public.gmane.org>
2015-06-22  9:57                                                                         ` Alexandre DERUMIER
2015-06-09 13:39               ` [ceph-users] " Jason Dillaman
     [not found]                 ` <1569135212.13362835.1433857190455.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-09 16:52                   ` Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.