From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexandre DERUMIER Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k Date: Tue, 9 Jun 2015 18:47:27 +0200 (CEST) Message-ID: <1058039366.2034449.1433868447253.JavaMail.zimbra@oxygem.tv> References: <1684793881.1564583.1433829106394.JavaMail.zimbra@oxygem.tv> <959572886.1627596.1433834901443.JavaMail.zimbra@oxygem.tv> <1897614581.1694878.1433838989184.JavaMail.zimbra@oxygem.tv> <5576CFBF.1070405@redhat.com> <1208111516.1790161.1433851367996.JavaMail.zimbra@oxygem.tv> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mailpro.odiso.net ([89.248.209.98]:36308 "EHLO mailpro.odiso.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753187AbbFIQra convert rfc822-to-8bit (ORCPT ); Tue, 9 Jun 2015 12:47:30 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Robert LeBlanc Cc: Mark Nelson , ceph-devel , pushpesh sharma , ceph-users Hi Robert, >>What I found was that Ceph OSDs performed well with either=20 >>tcmalloc or jemalloc (except when RocksDB was built with jemalloc=20 >>instead of tcmalloc, I'm still working to dig into why that might be=20 >>the case).=20 yes,from my test, for osd tcmalloc is a little faster (but very little)= than jemalloc. >>However, I found that tcmalloc with QEMU/KVM was very detrimental to=20 >>small I/O, but provided huge gains in I/O >=3D1MB. Jemalloc was much=20 >>better for QEMU/KVM in the tests that we ran. [1] Just have done qemu test (4k randread - rbd_cache=3Doff), I don't see s= peed regression with tcmalloc. with qemu iothread, tcmalloc have a speed increase over glib with qemu iothread, jemalloc have a speed decrease without iothread, jemalloc have a big speed increase this is with=20 -qemu 2.3 -tcmalloc 2.2.1 -jemmaloc 3.6 -libc6 2.19 qemu : no iothread : glibc : iops=3D33395 qemu : no-iothread : tcmalloc : iops=3D34516 (+3%) qemu : no-iothread : jemmaloc : iops=3D42226 (+26%) qemu : iothread : glibc : iops=3D34516 qemu : iothread : tcmalloc : iops=3D38676 (+12%) qemu : iothread : jemmaloc : iops=3D28023 (-19%) (The benefit of iothreads is that we can scale with more disks in 1vm) fio results: ------------ qemu : iothread : tcmalloc : iops=3D38676 ----------------------------------------- rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, ioe= ngine=3Dlibaio, iodepth=3D32 fio-2.1.11 Starting 1 process Jobs: 1 (f=3D0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 i= ops] [eta 00m:00s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1265: Tue = Jun 9 18:16:53 2015 read : io=3D5120.0MB, bw=3D154707KB/s, iops=3D38676, runt=3D 33889mse= c slat (usec): min=3D1, max=3D715, avg=3D 3.63, stdev=3D 3.42 clat (usec): min=3D152, max=3D5736, avg=3D822.12, stdev=3D289.34 lat (usec): min=3D231, max=3D5740, avg=3D826.10, stdev=3D289.08 clat percentiles (usec): | 1.00th=3D[ 402], 5.00th=3D[ 466], 10.00th=3D[ 510], 20.00th= =3D[ 572], | 30.00th=3D[ 636], 40.00th=3D[ 716], 50.00th=3D[ 780], 60.00th= =3D[ 852], | 70.00th=3D[ 932], 80.00th=3D[ 1020], 90.00th=3D[ 1160], 95.00th= =3D[ 1352], | 99.00th=3D[ 1800], 99.50th=3D[ 1944], 99.90th=3D[ 2256], 99.95th= =3D[ 2448], | 99.99th=3D[ 3888] bw (KB /s): min=3D123888, max=3D198584, per=3D100.00%, avg=3D15482= 4.40, stdev=3D16978.03 lat (usec) : 250=3D0.01%, 500=3D8.91%, 750=3D36.44%, 1000=3D32.63% lat (msec) : 2=3D21.65%, 4=3D0.37%, 10=3D0.01% cpu : usr=3D8.29%, sys=3D19.76%, ctx=3D55882, majf=3D0, minf= =3D39 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D154707KB/s, minb=3D154707KB/s, maxb=3D1= 54707KB/s, mint=3D33889msec, maxt=3D33889msec Disk stats (read/write): vdb: ios=3D1302739/0, merge=3D0/0, ticks=3D934444/0, in_queue=3D93409= 6, util=3D99.77% qemu : no-iothread : tcmalloc : iops=3D34516 --------------------------------------------- Jobs: 1 (f=3D1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 i= ops] [eta 00m:00s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D896: Tue J= un 9 18:19:08 2015 read : io=3D5120.0MB, bw=3D138065KB/s, iops=3D34516, runt=3D 37974mse= c slat (usec): min=3D1, max=3D708, avg=3D 3.98, stdev=3D 3.57 clat (usec): min=3D208, max=3D11858, avg=3D921.43, stdev=3D333.61 lat (usec): min=3D266, max=3D11862, avg=3D925.77, stdev=3D333.40 clat percentiles (usec): | 1.00th=3D[ 434], 5.00th=3D[ 510], 10.00th=3D[ 564], 20.00th= =3D[ 652], | 30.00th=3D[ 732], 40.00th=3D[ 812], 50.00th=3D[ 876], 60.00th= =3D[ 940], | 70.00th=3D[ 1020], 80.00th=3D[ 1112], 90.00th=3D[ 1320], 95.00th= =3D[ 1576], | 99.00th=3D[ 1992], 99.50th=3D[ 2128], 99.90th=3D[ 2736], 99.95th= =3D[ 3248], | 99.99th=3D[ 4320] bw (KB /s): min=3D77312, max=3D185576, per=3D99.74%, avg=3D137709.= 88, stdev=3D16883.77 lat (usec) : 250=3D0.01%, 500=3D4.36%, 750=3D27.61%, 1000=3D35.60% lat (msec) : 2=3D31.49%, 4=3D0.92%, 10=3D0.02%, 20=3D0.01% cpu : usr=3D7.19%, sys=3D19.52%, ctx=3D55903, majf=3D0, minf= =3D38 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D138064KB/s, minb=3D138064KB/s, maxb=3D1= 38064KB/s, mint=3D37974msec, maxt=3D37974msec Disk stats (read/write): vdb: ios=3D1309902/0, merge=3D0/0, ticks=3D1068768/0, in_queue=3D1068= 396, util=3D99.86% qemu : iothread : glibc : iops=3D34516 ------------------------------------- rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, ioe= ngine=3Dlibaio, iodepth=3D32 fio-2.1.11 Starting 1 process Jobs: 1 (f=3D1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 i= ops] [eta 00m:00s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D876: Tue J= un 9 18:24:01 2015 read : io=3D5120.0MB, bw=3D137786KB/s, iops=3D34446, runt=3D 38051mse= c slat (usec): min=3D1, max=3D496, avg=3D 3.88, stdev=3D 3.66 clat (usec): min=3D283, max=3D7515, avg=3D923.34, stdev=3D300.28 lat (usec): min=3D286, max=3D7519, avg=3D927.58, stdev=3D300.02 clat percentiles (usec): | 1.00th=3D[ 506], 5.00th=3D[ 564], 10.00th=3D[ 596], 20.00th= =3D[ 652], | 30.00th=3D[ 724], 40.00th=3D[ 804], 50.00th=3D[ 884], 60.00th= =3D[ 964], | 70.00th=3D[ 1048], 80.00th=3D[ 1144], 90.00th=3D[ 1304], 95.00th= =3D[ 1448], | 99.00th=3D[ 1896], 99.50th=3D[ 2096], 99.90th=3D[ 2480], 99.95th= =3D[ 2640], | 99.99th=3D[ 3984] bw (KB /s): min=3D102680, max=3D171112, per=3D100.00%, avg=3D13787= 7.78, stdev=3D15521.30 lat (usec) : 500=3D0.84%, 750=3D32.97%, 1000=3D30.82% lat (msec) : 2=3D34.65%, 4=3D0.71%, 10=3D0.01% cpu : usr=3D7.42%, sys=3D19.47%, ctx=3D52455, majf=3D0, minf= =3D38 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D137785KB/s, minb=3D137785KB/s, maxb=3D1= 37785KB/s, mint=3D38051msec, maxt=3D38051msec Disk stats (read/write): vdb: ios=3D1307426/0, merge=3D0/0, ticks=3D1051416/0, in_queue=3D1050= 972, util=3D99.85% qemu : no iothread : glibc : iops=3D33395 ----------------------------------------- rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, ioe= ngine=3Dlibaio, iodepth=3D32 fio-2.1.11 Starting 1 process Jobs: 1 (f=3D1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 i= ops] [eta 00m:00s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D886: Tue J= un 9 18:27:18 2015 read : io=3D5120.0MB, bw=3D133583KB/s, iops=3D33395, runt=3D 39248mse= c slat (usec): min=3D1, max=3D1054, avg=3D 3.86, stdev=3D 4.29 clat (usec): min=3D139, max=3D12635, avg=3D952.85, stdev=3D335.51 lat (usec): min=3D303, max=3D12638, avg=3D957.01, stdev=3D335.29 clat percentiles (usec): | 1.00th=3D[ 516], 5.00th=3D[ 564], 10.00th=3D[ 596], 20.00th= =3D[ 652], | 30.00th=3D[ 724], 40.00th=3D[ 820], 50.00th=3D[ 924], 60.00th= =3D[ 996], | 70.00th=3D[ 1080], 80.00th=3D[ 1176], 90.00th=3D[ 1336], 95.00th= =3D[ 1528], | 99.00th=3D[ 2096], 99.50th=3D[ 2320], 99.90th=3D[ 2672], 99.95th= =3D[ 2928], | 99.99th=3D[ 4832] bw (KB /s): min=3D98136, max=3D171624, per=3D100.00%, avg=3D133682= =2E64, stdev=3D19121.91 lat (usec) : 250=3D0.01%, 500=3D0.57%, 750=3D32.57%, 1000=3D26.98% lat (msec) : 2=3D38.59%, 4=3D1.28%, 10=3D0.01%, 20=3D0.01% cpu : usr=3D9.24%, sys=3D15.92%, ctx=3D51219, majf=3D0, minf= =3D38 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D133583KB/s, minb=3D133583KB/s, maxb=3D1= 33583KB/s, mint=3D39248msec, maxt=3D39248msec Disk stats (read/write): vdb: ios=3D1304526/0, merge=3D0/0, ticks=3D1075020/0, in_queue=3D1074= 536, util=3D99.84% qemu : iothread : jemmaloc : iops=3D28023 ---------------------------------------- rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, ioe= ngine=3Dlibaio, iodepth=3D32 fio-2.1.11 Starting 1 process Jobs: 1 (f=3D1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 io= ps] [eta 00m:01s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D899: Tue J= un 9 18:30:26 2015 read : io=3D5120.0MB, bw=3D112094KB/s, iops=3D28023, runt=3D 46772mse= c slat (usec): min=3D1, max=3D467, avg=3D 4.33, stdev=3D 4.77 clat (usec): min=3D253, max=3D11307, avg=3D1135.63, stdev=3D346.55 lat (usec): min=3D256, max=3D11309, avg=3D1140.39, stdev=3D346.22 clat percentiles (usec): | 1.00th=3D[ 510], 5.00th=3D[ 628], 10.00th=3D[ 700], 20.00th= =3D[ 820], | 30.00th=3D[ 924], 40.00th=3D[ 1032], 50.00th=3D[ 1128], 60.00th= =3D[ 1224], | 70.00th=3D[ 1320], 80.00th=3D[ 1416], 90.00th=3D[ 1560], 95.00th= =3D[ 1688], | 99.00th=3D[ 2096], 99.50th=3D[ 2224], 99.90th=3D[ 2544], 99.95th= =3D[ 2832], | 99.99th=3D[ 3760] bw (KB /s): min=3D91792, max=3D174416, per=3D99.90%, avg=3D111985.= 27, stdev=3D17381.70 lat (usec) : 500=3D0.80%, 750=3D13.10%, 1000=3D23.33% lat (msec) : 2=3D61.30%, 4=3D1.46%, 10=3D0.01%, 20=3D0.01% cpu : usr=3D7.12%, sys=3D17.43%, ctx=3D54507, majf=3D0, minf= =3D38 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D112094KB/s, minb=3D112094KB/s, maxb=3D1= 12094KB/s, mint=3D46772msec, maxt=3D46772msec Disk stats (read/write): vdb: ios=3D1309169/0, merge=3D0/0, ticks=3D1305796/0, in_queue=3D1305= 376, util=3D98.68% qemu : non-iothread : jemmaloc : iops=3D42226 -------------------------------------------- rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, ioe= ngine=3Dlibaio, iodepth=3D32 fio-2.1.11 Starting 1 process Jobs: 1 (f=3D1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 i= ops] [eta 00m:00s] rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D892: Tue J= un 9 18:34:11 2015 read : io=3D5120.0MB, bw=3D177130KB/s, iops=3D44282, runt=3D 29599mse= c slat (usec): min=3D1, max=3D527, avg=3D 3.80, stdev=3D 3.74 clat (usec): min=3D174, max=3D3841, avg=3D717.08, stdev=3D237.53 lat (usec): min=3D210, max=3D3844, avg=3D721.23, stdev=3D237.22 clat percentiles (usec): | 1.00th=3D[ 354], 5.00th=3D[ 422], 10.00th=3D[ 462], 20.00th= =3D[ 516], | 30.00th=3D[ 572], 40.00th=3D[ 628], 50.00th=3D[ 684], 60.00th= =3D[ 740], | 70.00th=3D[ 804], 80.00th=3D[ 884], 90.00th=3D[ 1004], 95.00th= =3D[ 1128], | 99.00th=3D[ 1544], 99.50th=3D[ 1672], 99.90th=3D[ 1928], 99.95th= =3D[ 2064], | 99.99th=3D[ 2608] bw (KB /s): min=3D138120, max=3D230816, per=3D100.00%, avg=3D17719= 2.14, stdev=3D23440.79 lat (usec) : 250=3D0.01%, 500=3D16.24%, 750=3D45.93%, 1000=3D27.46% lat (msec) : 2=3D10.30%, 4=3D0.07% cpu : usr=3D10.14%, sys=3D23.84%, ctx=3D60938, majf=3D0, min= f=3D39 IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.1%, 16=3D0.1%, 32=3D= 100.0%, >=3D64=3D0.0% submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.1%, = 64=3D0.0%, >=3D64=3D0.0% issued : total=3Dr=3D1310720/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d= =3D0 latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D= 32 Run status group 0 (all jobs): READ: io=3D5120.0MB, aggrb=3D177130KB/s, minb=3D177130KB/s, maxb=3D1= 77130KB/s, mint=3D29599msec, maxt=3D29599msec Disk stats (read/write): vdb: ios=3D1303992/0, merge=3D0/0, ticks=3D798008/0, in_queue=3D79763= 6, util=3D99.80% ----- Mail original ----- De: "Robert LeBlanc" =C3=80: "aderumier" Cc: "Mark Nelson" , "ceph-devel" , "pushpesh sharma" , "ceph-users" Envoy=C3=A9: Mardi 9 Juin 2015 18:00:29 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40= k -----BEGIN PGP SIGNED MESSAGE-----=20 Hash: SHA256=20 I also saw a similar performance increase by using alternative memory=20 allocators. What I found was that Ceph OSDs performed well with either=20 tcmalloc or jemalloc (except when RocksDB was built with jemalloc=20 instead of tcmalloc, I'm still working to dig into why that might be=20 the case).=20 However, I found that tcmalloc with QEMU/KVM was very detrimental to=20 small I/O, but provided huge gains in I/O >=3D1MB. Jemalloc was much=20 better for QEMU/KVM in the tests that we ran. [1]=20 I'm currently looking into I/O bottlenecks around the 16KB range and=20 I'm seeing a lot of time in thread creation and destruction, the=20 memory allocators are quite a bit down the list (both fio with=20 ioengine rbd and on the OSDs). I wonder what the difference can be.=20 I've tried using the async messenger but there wasn't a huge=20 difference. [2]=20 =46urther down the rabbit hole....=20 [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20197.htm= l=20 [2] https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23982.ht= ml=20 -----BEGIN PGP SIGNATURE-----=20 Version: Mailvelope v0.13.1=20 Comment: https://www.mailvelope.com=20 wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8=20 unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU=20 YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87=20 afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2=20 S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3=20 vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51=20 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO=20 qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3=20 Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b=20 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13=20 R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ=20 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4=20 oSJX=20 =3Dk281=20 -----END PGP SIGNATURE-----=20 ----------------=20 Robert LeBlanc=20 GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1=20 On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER wrote:=20 >>>Frankly, I'm a little impressed that without RBD cache we can hit 80= K=20 >>>IOPS from 1 VM!=20 >=20 > Note that theses result are not in a vm (fio-rbd on host), so in a vm= we'll have overhead.=20 > (I'm planning to send results in qemu soon)=20 >=20 >>>How fast are the SSDs in those 3 OSDs?=20 >=20 > Theses results are with datas in buffer memory of osd nodes.=20 >=20 > When reading fulling on ssd (intel s3500),=20 >=20 > For 1 client,=20 >=20 > I'm around 33k iops without cache and 32k iops with cache, with 1 osd= =2E=20 > I'm around 55k iops without cache and 38k iops with cache, with 3 osd= =2E=20 >=20 > with multiple clients jobs, I can reach around 70kiops by osd , and 2= 50k iops by osd when datas are in buffer.=20 >=20 > (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)=20 >=20 >=20 >=20 > small tip :=20 > I'm using tcmalloc for fio-rbd or rados bench to improve latencies by= around 20%=20 >=20 > LD_PRELOAD=3D/usr/lib/libtcmalloc_minimal.so.4 fio ...=20 > LD_PRELOAD=3D/usr/lib/libtcmalloc_minimal.so.4 rados bench ...=20 >=20 > as a lot of time is spent in malloc/free=20 >=20 >=20 > (qemu support also tcmalloc since some months , I'll bench it too=20 > https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html)=20 >=20 >=20 >=20 > I'll try to send full bench results soon, from 1 to 18 ssd osd.=20 >=20 >=20 >=20 >=20 > ----- Mail original -----=20 > De: "Mark Nelson" =20 > =C3=80: "aderumier" , "pushpesh sharma" =20 > Cc: "ceph-devel" , "ceph-users" =20 > Envoy=C3=A9: Mardi 9 Juin 2015 13:36:31=20 > Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around = 40k=20 >=20 > Hi All,=20 >=20 > In the past we've hit some performance issues with RBD cache that we'= ve=20 > fixed, but we've never really tried pushing a single VM beyond 40+K r= ead=20 > IOPS in testing (or at least I never have). I suspect there's a coupl= e=20 > of possibilities as to why it might be slower, but perhaps joshd can=20 > chime in as he's more familiar with what that code looks like.=20 >=20 > Frankly, I'm a little impressed that without RBD cache we can hit 80K= =20 > IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?=20 >=20 > Mark=20 >=20 > On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:=20 >> It's seem that the limit is mainly going in high queue depth (+- > 1= 6)=20 >>=20 >> Here the result in iops with 1client- 4krandread- 3osd - with differ= ents queue depth size.=20 >> rbd_cache is almost the same than without cache with queue depth <16= =20 >>=20 >>=20 >> cache=20 >> -----=20 >> qd1: 1651=20 >> qd2: 3482=20 >> qd4: 7958=20 >> qd8: 17912=20 >> qd16: 36020=20 >> qd32: 42765=20 >> qd64: 46169=20 >>=20 >> no cache=20 >> --------=20 >> qd1: 1748=20 >> qd2: 3570=20 >> qd4: 8356=20 >> qd8: 17732=20 >> qd16: 41396=20 >> qd32: 78633=20 >> qd64: 79063=20 >> qd128: 79550=20 >>=20 >>=20 >> ----- Mail original -----=20 >> De: "aderumier" =20 >> =C3=80: "pushpesh sharma" =20 >> Cc: "ceph-devel" , "ceph-users" =20 >> Envoy=C3=A9: Mardi 9 Juin 2015 09:28:21=20 >> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around= 40k=20 >>=20 >> Hi,=20 >>=20 >>>> We tried adding more RBDs to single VM, but no luck.=20 >>=20 >> If you want to scale with more disks in a single qemu vm, you need t= o use iothread feature from qemu and assign 1 iothread by disk (works w= ith virtio-blk).=20 >> It's working for me, I can scale with adding more disks.=20 >>=20 >>=20 >> My bench here are done with fio-rbd on host.=20 >> I can scale up to 400k iops with 10clients-rbd_cache=3Doff on a sing= le host and around 250kiops 10clients-rbdcache=3Don.=20 >>=20 >>=20 >> I just wonder why I don't have performance decrease around 30k iops = with 1osd.=20 >>=20 >> I'm going to see if this tracker=20 >> http://tracker.ceph.com/issues/11056=20 >>=20 >> could be the cause.=20 >>=20 >> (My master build was done some week ago)=20 >>=20 >>=20 >>=20 >> ----- Mail original -----=20 >> De: "pushpesh sharma" =20 >> =C3=80: "aderumier" =20 >> Cc: "ceph-devel" , "ceph-users" =20 >> Envoy=C3=A9: Mardi 9 Juin 2015 09:21:04=20 >> Objet: Re: rbd_cache, limiting read on high iops around 40k=20 >>=20 >> Hi Alexandre,=20 >>=20 >> We have also seen something very similar on Hammer(0.94-1). We were = doing some benchmarking for VMs hosted on hypervisor (QEMU-KVM, opensta= ck-juno). Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additiona= l storage. For some strange reason it was not able to scale 4K- RR iops= on each VM beyond 35-40k. We tried adding more RBDs to single VM, but = no luck. However increasing number of VMs to 4 on a single hypervisor d= id scale to some extent. After this there was no much benefit we got fr= om adding more VMs.=20 >>=20 >> Here is the trend we have seen, x-axis is number of hypervisor, each= hypervisor has 4 VM, each VM has 1 RBD:-=20 >>=20 >>=20 >>=20 >>=20 >> VDbench is used as benchmarking tool. We were not saturating network= and CPUs at OSD nodes. We were not able to saturate CPUs at hypervisor= s, and that is where we were suspecting of some throttling effect. Howe= ver we haven't setted any such limits from nova or kvm end. We tried so= me CPU pinning and other KVM related tuning as well, but no luck.=20 >>=20 >> We tried the same experiment on a bare metal. It was 4K RR IOPs were= scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than sc= aling beyond that point the numbers were actually degrading. (Single pi= pe more congestion effect)=20 >>=20 >> We never suspected that rbd cache enable could be detrimental to per= formance. It would nice to route cause the problem if that is the case.= =20 >>=20 >> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < aderumier@odis= o.com > wrote:=20 >>=20 >>=20 >> Hi,=20 >>=20 >> I'm doing benchmark (ceph master branch), with randread 4k qdepth=3D= 32,=20 >> and rbd_cache=3Dtrue seem to limit the iops around 40k=20 >>=20 >>=20 >> no cache=20 >> --------=20 >> 1 client - rbd_cache=3Dfalse - 1osd : 38300 iops=20 >> 1 client - rbd_cache=3Dfalse - 2osd : 69073 iops=20 >> 1 client - rbd_cache=3Dfalse - 3osd : 78292 iops=20 >>=20 >>=20 >> cache=20 >> -----=20 >> 1 client - rbd_cache=3Dtrue - 1osd : 38100 iops=20 >> 1 client - rbd_cache=3Dtrue - 2osd : 42457 iops=20 >> 1 client - rbd_cache=3Dtrue - 3osd : 45823 iops=20 >>=20 >>=20 >>=20 >> Is it expected ?=20 >>=20 >>=20 >>=20 >> fio result rbd_cache=3Dfalse 3 osd=20 >> --------------------------------=20 >> rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, = ioengine=3Drbd, iodepth=3D32=20 >> fio-2.1.11=20 >> Starting 1 process=20 >> rbd engine: RBD version: 0.1.9=20 >> Jobs: 1 (f=3D1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/= 0 iops] [eta 00m:00s]=20 >> rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D113548:= Tue Jun 9 07:48:42 2015=20 >> read : io=3D10000MB, bw=3D313169KB/s, iops=3D78292, runt=3D 32698mse= c=20 >> slat (usec): min=3D5, max=3D530, avg=3D11.77, stdev=3D 6.77=20 >> clat (usec): min=3D70, max=3D2240, avg=3D336.08, stdev=3D94.82=20 >> lat (usec): min=3D101, max=3D2247, avg=3D347.84, stdev=3D95.49=20 >> clat percentiles (usec):=20 >> | 1.00th=3D[ 173], 5.00th=3D[ 209], 10.00th=3D[ 231], 20.00th=3D[ 26= 2],=20 >> | 30.00th=3D[ 282], 40.00th=3D[ 302], 50.00th=3D[ 322], 60.00th=3D[ = 346],=20 >> | 70.00th=3D[ 370], 80.00th=3D[ 402], 90.00th=3D[ 454], 95.00th=3D[ = 506],=20 >> | 99.00th=3D[ 628], 99.50th=3D[ 692], 99.90th=3D[ 860], 99.95th=3D[ = 948],=20 >> | 99.99th=3D[ 1176]=20 >> bw (KB /s): min=3D238856, max=3D360448, per=3D100.00%, avg=3D313402.= 34, stdev=3D25196.21=20 >> lat (usec) : 100=3D0.01%, 250=3D15.94%, 500=3D78.60%, 750=3D5.19%, 1= 000=3D0.23%=20 >> lat (msec) : 2=3D0.03%, 4=3D0.01%=20 >> cpu : usr=3D74.48%, sys=3D13.25%, ctx=3D703225, majf=3D0, minf=3D124= 52=20 >> IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D0.8%, 16=3D87.0%, 32=3D= 12.1%, >=3D64=3D0.0%=20 >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D= 0.0%, >=3D64=3D0.0%=20 >> complete : 0=3D0.0%, 4=3D91.6%, 8=3D3.4%, 16=3D4.5%, 32=3D0.4%, 64=3D= 0.0%, >=3D64=3D0.0%=20 >> issued : total=3Dr=3D2560000/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d=3D0=20 >> latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D32=20 >>=20 >> Run status group 0 (all jobs):=20 >> READ: io=3D10000MB, aggrb=3D313169KB/s, minb=3D313169KB/s, maxb=3D31= 3169KB/s, mint=3D32698msec, maxt=3D32698msec=20 >>=20 >> Disk stats (read/write):=20 >> dm-0: ios=3D0/45, merge=3D0/0, ticks=3D0/0, in_queue=3D0, util=3D0.0= 0%, aggrios=3D0/24, aggrmerge=3D0/21, aggrticks=3D0/0, aggrin_queue=3D0= , aggrutil=3D0.00%=20 >> sda: ios=3D0/24, merge=3D0/21, ticks=3D0/0, in_queue=3D0, util=3D0.0= 0%=20 >>=20 >>=20 >>=20 >>=20 >> fio result rbd_cache=3Dtrue 3osd=20 >> ------------------------------=20 >>=20 >> rbd_iodepth32-test: (g=3D0): rw=3Drandread, bs=3D4K-4K/4K-4K/4K-4K, = ioengine=3Drbd, iodepth=3D32=20 >> fio-2.1.11=20 >> Starting 1 process=20 >> rbd engine: RBD version: 0.1.9=20 >> Jobs: 1 (f=3D1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/= 0 iops] [eta 00m:00s]=20 >> rbd_iodepth32-test: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D113389:= Tue Jun 9 07:47:30 2015=20 >> read : io=3D10000MB, bw=3D183296KB/s, iops=3D45823, runt=3D 55866mse= c=20 >> slat (usec): min=3D7, max=3D805, avg=3D21.26, stdev=3D15.84=20 >> clat (usec): min=3D101, max=3D4602, avg=3D478.55, stdev=3D143.73=20 >> lat (usec): min=3D123, max=3D4669, avg=3D499.80, stdev=3D146.03=20 >> clat percentiles (usec):=20 >> | 1.00th=3D[ 227], 5.00th=3D[ 274], 10.00th=3D[ 306], 20.00th=3D[ 35= 0],=20 >> | 30.00th=3D[ 390], 40.00th=3D[ 430], 50.00th=3D[ 470], 60.00th=3D[ = 506],=20 >> | 70.00th=3D[ 548], 80.00th=3D[ 596], 90.00th=3D[ 660], 95.00th=3D[ = 724],=20 >> | 99.00th=3D[ 844], 99.50th=3D[ 908], 99.90th=3D[ 1112], 99.95th=3D[= 1288],=20 >> | 99.99th=3D[ 2192]=20 >> bw (KB /s): min=3D115280, max=3D204416, per=3D100.00%, avg=3D183315.= 10, stdev=3D15079.93=20 >> lat (usec) : 250=3D2.42%, 500=3D55.61%, 750=3D38.48%, 1000=3D3.28%=20 >> lat (msec) : 2=3D0.19%, 4=3D0.01%, 10=3D0.01%=20 >> cpu : usr=3D60.27%, sys=3D12.01%, ctx=3D2995393, majf=3D0, minf=3D14= 100=20 >> IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.2%, 8=3D13.5%, 16=3D81.0%, 32=3D= 5.3%, >=3D64=3D0.0%=20 >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, 64=3D= 0.0%, >=3D64=3D0.0%=20 >> complete : 0=3D0.0%, 4=3D95.0%, 8=3D0.1%, 16=3D1.0%, 32=3D4.0%, 64=3D= 0.0%, >=3D64=3D0.0%=20 >> issued : total=3Dr=3D2560000/w=3D0/d=3D0, short=3Dr=3D0/w=3D0/d=3D0=20 >> latency : target=3D0, window=3D0, percentile=3D100.00%, depth=3D32=20 >>=20 >> Run status group 0 (all jobs):=20 >> READ: io=3D10000MB, aggrb=3D183295KB/s, minb=3D183295KB/s, maxb=3D18= 3295KB/s, mint=3D55866msec, maxt=3D55866msec=20 >>=20 >> Disk stats (read/write):=20 >> dm-0: ios=3D0/61, merge=3D0/0, ticks=3D0/8, in_queue=3D8, util=3D0.0= 1%, aggrios=3D0/29, aggrmerge=3D0/32, aggrticks=3D0/8, aggrin_queue=3D8= , aggrutil=3D0.01%=20 >> sda: ios=3D0/29, merge=3D0/32, ticks=3D0/8, in_queue=3D8, util=3D0.0= 1%=20 >>=20 > _______________________________________________=20 > ceph-users mailing list=20 > ceph-users@lists.ceph.com=20 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com=20 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html