* Ceph write performance
@ 2012-07-20 10:24 George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
` (3 more replies)
0 siblings, 4 replies; 31+ messages in thread
From: George Shuklin @ 2012-07-20 10:24 UTC (permalink / raw)
To: ceph-devel
Good day.
I've start to play with Ceph... And I found some kinda strange
performance issues. I'm not sure if this is due ceph limitation or my
bad setup.
Setup:
osd - xfs on ramdisk (only one osd)
mds - raid0 on 10 disks
mon - second raid0 on 10 disks
I've mount ceph share at localhost and run FIO (randwrite, 4k, iodepth=32)
What I've got: 1900 IOPS on writing (4k block, 1Gb span).
Normally fio shows about 200kIOPS writing on ramdisk.
Why it was so slow? I've done setup exactly like described here:
http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
(but one osd).
Thanks.
^ permalink raw reply [flat|nested] 31+ messages in thread[parent not found: <20120720104150.GA16630@oder.kd-bie.de>]
* Re: Ceph write performance [not found] ` <20120720104150.GA16630@oder.kd-bie.de> @ 2012-07-20 10:48 ` George Shuklin 2012-07-20 11:49 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: George Shuklin @ 2012-07-20 10:48 UTC (permalink / raw) To: Dieter Kasper (KD), ceph-devel On 20.07.2012 14:41, Dieter Kasper (KD) wrote: Good day. Thank you for attention. ramdisk size ~70Gb (modprobe brd rd_size=70000000) journal seems be on same device as storage size of OSD was unchanged (... means I create it by manual and do not make any specific changes) During test I watch IO load closely, IO on MDS/MON was insignificant (most of the time zero, sometimes few very mild peaks). Just in case, configs: ceph.conf: [osd] osd journal size = 1000 filestore xattr use omap = true [mon.a] host = srv1 mon addr = 192.168.0.1:6789 [osd.0] host = srv1 [mds.a] host = srv1 fio.ini: [test] blocksize=4k filename=/media/test size=16g fallocate=posix rw=randread direct=1 buffered=0 ioengine=libaio iodepth=32 Thanks for advising, I'll recheck with new settings. > George, > > please share more details of your config: > - RAM size of your system > - location of the journal > - size of your OSD > > Can you try (just for the 1st test) to > .. put the journal on RAM disk > .. put the MDS on RAM disk > .. put the MON on RAM disk > .. use btrfs for OSD > > As an alternative to isolate the bottleneck you can try to > - run without a journal > - use RBD instead Ceph-FS > + create a File System on top of the /dev/rbd0 > > Regards, > Dieter Kasper > > > On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote: >> Good day. >> >> I've start to play with Ceph... And I found some kinda strange >> performance issues. I'm not sure if this is due ceph limitation or my >> bad setup. >> >> Setup: >> >> osd - xfs on ramdisk (only one osd) >> mds - raid0 on 10 disks >> mon - second raid0 on 10 disks >> >> I've mount ceph share at localhost and run FIO (randwrite, 4k, iodepth=32) >> >> What I've got: 1900 IOPS on writing (4k block, 1Gb span). >> >> Normally fio shows about 200kIOPS writing on ramdisk. >> >> Why it was so slow? I've done setup exactly like described here: >> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster >> (but one osd). >> >> Thanks. >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance 2012-07-20 10:48 ` George Shuklin @ 2012-07-20 11:49 ` Mark Nelson 2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-07-20 11:49 UTC (permalink / raw) To: George Shuklin; +Cc: Dieter Kasper (KD), ceph-devel Hi George, I think you may find that the limitation is in the the filestore. It's one of the things I've been working on trying to track down as I've seen low performance on SSDs with small request sizes as well. You can use the test_filestore_workloadgen to specifically test the filestore code with small requests if you'd like. I'm not sure if it is included with the binary distribution but it can be compiled if you download the src. I think it's "make test_filestore_workloadgen" in the src directory. Mark On 7/20/12 5:48 AM, George Shuklin wrote: > On 20.07.2012 14:41, Dieter Kasper (KD) wrote: > > Good day. > > Thank you for attention. > > ramdisk size ~70Gb (modprobe brd rd_size=70000000) > journal seems be on same device as storage > size of OSD was unchanged (... means I create it by manual and do not > make any specific changes) > > During test I watch IO load closely, IO on MDS/MON was insignificant > (most of the time zero, sometimes few very mild peaks). > > Just in case, configs: > > ceph.conf: > > [osd] > osd journal size = 1000 > filestore xattr use omap = true > > [mon.a] > host = srv1 > mon addr = 192.168.0.1:6789 > > [osd.0] > host = srv1 > > [mds.a] > host = srv1 > > fio.ini: > [test] > blocksize=4k > filename=/media/test > size=16g > fallocate=posix > rw=randread > direct=1 > buffered=0 > ioengine=libaio > iodepth=32 > > > Thanks for advising, I'll recheck with new settings. > >> George, >> >> please share more details of your config: >> - RAM size of your system >> - location of the journal >> - size of your OSD >> >> Can you try (just for the 1st test) to >> .. put the journal on RAM disk >> .. put the MDS on RAM disk >> .. put the MON on RAM disk >> .. use btrfs for OSD >> >> As an alternative to isolate the bottleneck you can try to >> - run without a journal >> - use RBD instead Ceph-FS >> + create a File System on top of the /dev/rbd0 >> >> Regards, >> Dieter Kasper >> >> >> On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote: >>> Good day. >>> >>> I've start to play with Ceph... And I found some kinda strange >>> performance issues. I'm not sure if this is due ceph limitation or my >>> bad setup. >>> >>> Setup: >>> >>> osd - xfs on ramdisk (only one osd) >>> mds - raid0 on 10 disks >>> mon - second raid0 on 10 disks >>> >>> I've mount ceph share at localhost and run FIO (randwrite, 4k, >>> iodepth=32) >>> >>> What I've got: 1900 IOPS on writing (4k block, 1Gb span). >>> >>> Normally fio shows about 200kIOPS writing on ramdisk. >>> >>> Why it was so slow? I've done setup exactly like described here: >>> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster >>> (but one osd). >>> >>> Thanks. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance on RAM-DISK 2012-07-20 11:49 ` Mark Nelson @ 2012-07-20 20:36 ` Dieter Kasper 2012-07-20 21:28 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-07-20 20:36 UTC (permalink / raw) To: Mark Nelson; +Cc: George Shuklin, ceph-devel, Dieter Kasper (KD) [-- Attachment #1: Type: text/plain, Size: 5220 bytes --] Hi Mark, George, I can observe a similar (poor) Performance on my system with fio on /dev/rbd1 #--- seq. write RBD RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 41.1819 s, 255 MB/s #--- seq. read RBD RX37-0:~ # dd of=/dev/zero if=/dev/rbd1 bs=1024k count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 40.9595 s, 256 MB/s #--- seq. read /dev/ramX RX37-0:~ # dd of=/dev/zero if=/dev/ram0 bs=1024k count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 4.68389 s, 2.2 GB/s Does ceph-osd/filestore 'eat' 90% of my resources/bandwidth/latency ? RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1 (...) write: io=461592KB, bw=15371KB/s, iops=3842 , runt= 30030msec write: io=5120.0MB, bw=893927KB/s, iops=223481 , runt= 5865msec (on /dev/ram0) RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randread --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1 (...) read : io=698356KB, bw=23240KB/s, iops=5809 , runt= 30050msec read : io=5120.0MB, bw=1631.1MB/s, iops=417559 , runt= 3139msec (on /dev/ram0) RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=1m --size=5G --numjobs=4 --runtime=10 --group_reporting --name=file1 (...) write: io=6377.0MB, bw=217125KB/s, iops=212 , runt= 30075msec write: io=5120.0MB, bw=2114.9MB/s, iops=2114 , runt= 2421msec (on /dev/ram0) Where is the bottleneck ? What is filestore doing ? How can I disable the journal and write only to the btrfs OSDs ? (like as they would be SSDs) How can I get better performance ? Regards, Dieter P.S. I will try to get the "test_filestore_workloadgen" On Fri, Jul 20, 2012 at 06:49:30AM -0500, Mark Nelson wrote: > Hi George, > > I think you may find that the limitation is in the the filestore. > It's one of the things I've been working on trying to track down as > I've seen low performance on SSDs with small request sizes as well. > You can use the test_filestore_workloadgen to specifically test the > filestore code with small requests if you'd like. I'm not sure if > it is included with the binary distribution but it can be compiled > if you download the src. I think it's "make > test_filestore_workloadgen" in the src directory. > > Mark > > On 7/20/12 5:48 AM, George Shuklin wrote: > >On 20.07.2012 14:41, Dieter Kasper (KD) wrote: > > > >Good day. > > > >Thank you for attention. > > > >ramdisk size ~70Gb (modprobe brd rd_size=70000000) > >journal seems be on same device as storage > >size of OSD was unchanged (... means I create it by manual and do not > >make any specific changes) > > > >During test I watch IO load closely, IO on MDS/MON was insignificant > >(most of the time zero, sometimes few very mild peaks). > > > >Just in case, configs: > > > >ceph.conf: > > > >[osd] > > osd journal size = 1000 > > filestore xattr use omap = true > > > >[mon.a] > > host = srv1 > > mon addr = 192.168.0.1:6789 > > > >[osd.0] > > host = srv1 > > > >[mds.a] > > host = srv1 > > > >fio.ini: > >[test] > >blocksize=4k > >filename=/media/test > >size=16g > >fallocate=posix > >rw=randread > >direct=1 > >buffered=0 > >ioengine=libaio > >iodepth=32 > > > > > >Thanks for advising, I'll recheck with new settings. > > > >>George, > >> > >>please share more details of your config: > >>- RAM size of your system > >>- location of the journal > >>- size of your OSD > >> > >>Can you try (just for the 1st test) to > >>.. put the journal on RAM disk > >>.. put the MDS on RAM disk > >>.. put the MON on RAM disk > >>.. use btrfs for OSD > >> > >>As an alternative to isolate the bottleneck you can try to > >>- run without a journal > >>- use RBD instead Ceph-FS > >> + create a File System on top of the /dev/rbd0 > >> > >>Regards, > >>Dieter Kasper > >> > >> > >>On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote: > >>>Good day. > >>> > >>>I've start to play with Ceph... And I found some kinda strange > >>>performance issues. I'm not sure if this is due ceph limitation or my > >>>bad setup. > >>> > >>>Setup: > >>> > >>>osd - xfs on ramdisk (only one osd) > >>>mds - raid0 on 10 disks > >>>mon - second raid0 on 10 disks > >>> > >>>I've mount ceph share at localhost and run FIO (randwrite, 4k, > >>>iodepth=32) > >>> > >>>What I've got: 1900 IOPS on writing (4k block, 1Gb span). > >>> > >>>Normally fio shows about 200kIOPS writing on ramdisk. > >>> > >>>Why it was so slow? I've done setup exactly like described here: > >>>http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster > >>>(but one osd). > >>> > >>>Thanks. > >>>-- > >>>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>the body of a message to majordomo@vger.kernel.org > >>>More majordomo info at http://vger.kernel.org/majordomo-info.html > > > >-- > >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > [-- Attachment #2: ceph.conf --] [-- Type: text/plain, Size: 864 bytes --] [global] pid file = /var/run/ceph/$name.pid debug ms = 0 auth supported = cephx keyring = /etc/ceph/keyring.client [mon] mon data = /tmp/mon$id [mon.a] host = localhost mon addr = 127.0.0.1:6789 [osd] journal dio = false osd data = /data/$name osd journal = /mnt/osd.journal/$name/journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name # debug osd = 20 # debug ms = 1 ; message traffic # debug filestore = 20 ; local object storage # debug journal = 20 ; local journaling # debug monc = 5 ; monitor interaction, startup [osd.0] host = localhost btrfs devs = /dev/ram0 [osd.1] host = localhost btrfs devs = /dev/ram1 [osd.2] host = localhost btrfs devs = /dev/ram2 [mds.a] host = localhost ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance on RAM-DISK 2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper @ 2012-07-20 21:28 ` Mark Nelson 0 siblings, 0 replies; 31+ messages in thread From: Mark Nelson @ 2012-07-20 21:28 UTC (permalink / raw) To: Dieter Kasper; +Cc: George Shuklin, ceph-devel On 07/20/2012 03:36 PM, Dieter Kasper wrote: > Hi Mark, George, > > I can observe a similar (poor) Performance on my system with fio on /dev/rbd1 > > #--- seq. write RBD > RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=10000 > 10000+0 records in > 10000+0 records out > 10485760000 bytes (10 GB) copied, 41.1819 s, 255 MB/s > > #--- seq. read RBD > RX37-0:~ # dd of=/dev/zero if=/dev/rbd1 bs=1024k count=10000 > 10000+0 records in > 10000+0 records out > 10485760000 bytes (10 GB) copied, 40.9595 s, 256 MB/s > > #--- seq. read /dev/ramX > RX37-0:~ # dd of=/dev/zero if=/dev/ram0 bs=1024k count=10000 > 10000+0 records in > 10000+0 records out > 10485760000 bytes (10 GB) copied, 4.68389 s, 2.2 GB/s > > Does ceph-osd/filestore 'eat' 90% of my resources/bandwidth/latency ? > Well, there are multiple layers involved here, so it's possible that some of the code for RBD is playing a part in this too. I have specifically seen slow performance with smaller requests with the filestore though, so that is where I'm focusing my energy right now. > > RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1 > (...) > write: io=461592KB, bw=15371KB/s, iops=3842 , runt= 30030msec > write: io=5120.0MB, bw=893927KB/s, iops=223481 , runt= 5865msec (on /dev/ram0) > > > RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randread --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1 > (...) > read : io=698356KB, bw=23240KB/s, iops=5809 , runt= 30050msec > read : io=5120.0MB, bw=1631.1MB/s, iops=417559 , runt= 3139msec (on /dev/ram0) > > > RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=1m --size=5G --numjobs=4 --runtime=10 --group_reporting --name=file1 > (...) > write: io=6377.0MB, bw=217125KB/s, iops=212 , runt= 30075msec > write: io=5120.0MB, bw=2114.9MB/s, iops=2114 , runt= 2421msec (on /dev/ram0) > > > Where is the bottleneck ? > What is filestore doing ? > How can I disable the journal and write only to the btrfs OSDs ? (like as they would be SSDs) > How can I get better performance ? Not yet sure where the bottleneck is, but we are actively looking into it. Sadly the process has been complicated by potential bottleneck in our test hardware that could be masking real issues in the code. > > > Regards, > Dieter > > P.S. I will try to get the "test_filestore_workloadgen" > > > On Fri, Jul 20, 2012 at 06:49:30AM -0500, Mark Nelson wrote: >> Hi George, >> >> I think you may find that the limitation is in the the filestore. >> It's one of the things I've been working on trying to track down as >> I've seen low performance on SSDs with small request sizes as well. >> You can use the test_filestore_workloadgen to specifically test the >> filestore code with small requests if you'd like. I'm not sure if >> it is included with the binary distribution but it can be compiled >> if you download the src. I think it's "make >> test_filestore_workloadgen" in the src directory. >> >> Mark >> >> On 7/20/12 5:48 AM, George Shuklin wrote: >>> On 20.07.2012 14:41, Dieter Kasper (KD) wrote: >>> >>> Good day. >>> >>> Thank you for attention. >>> >>> ramdisk size ~70Gb (modprobe brd rd_size=70000000) >>> journal seems be on same device as storage >>> size of OSD was unchanged (... means I create it by manual and do not >>> make any specific changes) >>> >>> During test I watch IO load closely, IO on MDS/MON was insignificant >>> (most of the time zero, sometimes few very mild peaks). >>> >>> Just in case, configs: >>> >>> ceph.conf: >>> >>> [osd] >>> osd journal size = 1000 >>> filestore xattr use omap = true >>> >>> [mon.a] >>> host = srv1 >>> mon addr = 192.168.0.1:6789 >>> >>> [osd.0] >>> host = srv1 >>> >>> [mds.a] >>> host = srv1 >>> >>> fio.ini: >>> [test] >>> blocksize=4k >>> filename=/media/test >>> size=16g >>> fallocate=posix >>> rw=randread >>> direct=1 >>> buffered=0 >>> ioengine=libaio >>> iodepth=32 >>> >>> >>> Thanks for advising, I'll recheck with new settings. >>> >>>> George, >>>> >>>> please share more details of your config: >>>> - RAM size of your system >>>> - location of the journal >>>> - size of your OSD >>>> >>>> Can you try (just for the 1st test) to >>>> .. put the journal on RAM disk >>>> .. put the MDS on RAM disk >>>> .. put the MON on RAM disk >>>> .. use btrfs for OSD >>>> >>>> As an alternative to isolate the bottleneck you can try to >>>> - run without a journal >>>> - use RBD instead Ceph-FS >>>> + create a File System on top of the /dev/rbd0 >>>> >>>> Regards, >>>> Dieter Kasper >>>> >>>> >>>> On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote: >>>>> Good day. >>>>> >>>>> I've start to play with Ceph... And I found some kinda strange >>>>> performance issues. I'm not sure if this is due ceph limitation or my >>>>> bad setup. >>>>> >>>>> Setup: >>>>> >>>>> osd - xfs on ramdisk (only one osd) >>>>> mds - raid0 on 10 disks >>>>> mon - second raid0 on 10 disks >>>>> >>>>> I've mount ceph share at localhost and run FIO (randwrite, 4k, >>>>> iodepth=32) >>>>> >>>>> What I've got: 1900 IOPS on writing (4k block, 1Gb span). >>>>> >>>>> Normally fio shows about 200kIOPS writing on ramdisk. >>>>> >>>>> Why it was so slow? I've done setup exactly like described here: >>>>> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster >>>>> (but one osd). >>>>> >>>>> Thanks. >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Mark Nelson Performance Engineer Inktank ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance 2012-07-20 10:24 Ceph write performance George Shuklin [not found] ` <20120720104150.GA16630@oder.kd-bie.de> @ 2012-07-20 15:53 ` Matthew Richardson 2012-07-20 16:37 ` Gregory Farnum 2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper 3 siblings, 0 replies; 31+ messages in thread From: Matthew Richardson @ 2012-07-20 15:53 UTC (permalink / raw) To: ceph-devel [-- Attachment #1: Type: text/plain, Size: 793 bytes --] On 20/07/12 11:24, George Shuklin wrote: > Good day. > > I've start to play with Ceph... And I found some kinda strange > performance issues. I'm not sure if this is due ceph limitation or my > bad setup. I'm seeing a similar problem which looks like a potential bug, which someone else seems to have already reported (http://www.spinics.net/lists/ceph-devel/msg07335.html and http://www.spinics.net/lists/ceph-devel/msg07691.html) The problem only seems to hit for me when I do random writes - can you try fio with sequential writes (rw=write) and see if your problem also disappears? It might help confirm this as an issue. Thanks, Matthew -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance 2012-07-20 10:24 Ceph write performance George Shuklin [not found] ` <20120720104150.GA16630@oder.kd-bie.de> 2012-07-20 15:53 ` Ceph write performance Matthew Richardson @ 2012-07-20 16:37 ` Gregory Farnum 2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper 3 siblings, 0 replies; 31+ messages in thread From: Gregory Farnum @ 2012-07-20 16:37 UTC (permalink / raw) To: George Shuklin; +Cc: ceph-devel On Fri, Jul 20, 2012 at 3:24 AM, George Shuklin <shuklin@selectel.ru> wrote: > Good day. > > I've start to play with Ceph... And I found some kinda strange performance > issues. I'm not sure if this is due ceph limitation or my bad setup. > > Setup: > > osd - xfs on ramdisk (only one osd) > mds - raid0 on 10 disks > mon - second raid0 on 10 disks I'm not going to butt in on the performance discussion, but just FYI, the MDS does not use any local storage — it puts everything on the OSDs. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* RBD performance - tuning hints 2012-07-20 10:24 Ceph write performance George Shuklin ` (2 preceding siblings ...) 2012-07-20 16:37 ` Gregory Farnum @ 2012-08-28 17:48 ` Dieter Kasper 2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner 2012-08-29 8:50 ` Alexandre DERUMIER 3 siblings, 2 replies; 31+ messages in thread From: Dieter Kasper @ 2012-08-28 17:48 UTC (permalink / raw) To: ceph-devel@vger.kernel.org; +Cc: Dieter Kasper (KD) [-- Attachment #1: Type: text/plain, Size: 1527 bytes --] Hi, on my 4-node system (SSD + 10GbE, see bench-config.txt for details) I can observe a pretty nice rados bench performance (see bench-rados.txt for details): Bandwidth (MB/sec): 961.710 Max bandwidth (MB/sec): 1040 Min bandwidth (MB/sec): 772 Also the bandwidth performance generated with fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} .... is acceptable, e.g. fio_write_4m_16 795 MB/s fio_randwrite_8m_128 717 MB/s fio_randwrite_8m_16 714 MB/s fio_randwrite_2m_32 692 MB/s But, the write IOPS seems to be limited around 19k ... RBD 4M 64k (= optimal_io_size) fio_randread_512_128 53286 55925 fio_randread_4k_128 51110 44382 fio_randread_8k_128 30854 29938 fio_randwrite_512_128 18888 2386 fio_randwrite_512_64 18844 2582 fio_randwrite_8k_64 17350 2445 (...) fio_read_4k_128 10073 53151 fio_read_4k_64 9500 39757 fio_read_4k_32 9220 23650 (...) fio_read_4k_16 9122 14322 fio_write_4k_128 2190 14306 fio_read_8k_32 706 13894 fio_write_4k_64 2197 12297 fio_write_8k_64 3563 11705 fio_write_8k_128 3444 11219 Any hints for tuning the IOPS (read and/or write) would be appreciated. How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) Kind Regards, -Dieter [-- Attachment #2: bench-rados.txt --] [-- Type: text/plain, Size: 1746 bytes --] rados bench -p pbench 60 write Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 228 212 847.857 848 0.042984 0.0684383 2 16 451 435 869.88 892 0.084162 0.0700566 3 16 695 679 905.223 976 0.057677 0.0695337 4 16 942 926 925.894 988 0.038117 0.0685357 5 16 1162 1146 916.7 880 0.042098 0.0693864 6 16 1400 1384 922.569 952 0.063983 0.0689167 7 16 1644 1628 930.189 976 0.065745 0.0684646 8 16 1895 1879 939.404 1004 0.051277 0.0677953 9 16 2145 2129 946.127 1000 0.055165 0.067354 (...) 57 16 13704 13688 960.47 996 0.082716 0.0665862 58 16 13954 13938 961.15 1000 0.041879 0.0665307 59 16 14194 14178 961.129 960 0.046657 0.0664642 2012-08-28 17:32:18.620060min lat: 0.030234 max lat: 3.17834 avg lat: 0.0664676 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 60 16 14446 14430 961.909 1008 0.051635 0.0664676 Total time run: 60.084612 Total writes made: 14446 Write size: 4194304 Bandwidth (MB/sec): 961.710 Stddev Bandwidth: 54.0809 Max bandwidth (MB/sec): 1040 Min bandwidth (MB/sec): 772 Average Latency: 0.0665337 Stddev Latency: 0.0800225 Max latency: 3.17834 Min latency: 0.030234 [-- Attachment #3: bench-config.txt --] [-- Type: text/plain, Size: 26557 bytes --] --- RX37-3c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-3 3.0.41-5.1-default #1 SMP Wed Aug 22 00:54:03 UTC 2012 (9c63123) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856332 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdm [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 198232151949312 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 39 C Blocks sent to initiator = 188127268306944 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 42 C Blocks sent to initiator = 241646771896320 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 33 C Blocks sent to initiator = 202151376715776 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 186279543177216 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 200414079221760 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 40 C Blocks sent to initiator = 301595287879680 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 30 C Blocks sent to initiator = 190686448058368 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdm on /data/osd.30 type btrfs (rw,noatime) /dev/sdn on /data/osd.31 type btrfs (rw,noatime) /dev/sdo on /data/osd.32 type btrfs (rw,noatime) /dev/sdp on /data/osd.33 type btrfs (rw,noatime) /dev/sdq on /data/osd.34 type btrfs (rw,noatime) /dev/sdr on /data/osd.35 type btrfs (rw,noatime) /dev/sds on /data/osd.36 type btrfs (rw,noatime) /dev/sdt on /data/osd.37 type btrfs (rw,noatime) --- RX37-4c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-4 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856432 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdd [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sde [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdf [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdg [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdh [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdi [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdj [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdk Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 33 C Blocks sent to initiator = 326270260871168 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 29 C Blocks sent to initiator = 230247207272448 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 168513041858560 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 171904673513472 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 30 C Blocks sent to initiator = 175995797635072 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 206814587125760 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 26 C Blocks sent to initiator = 239652363567104 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 32 C Blocks sent to initiator = 221954917269504 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdd on /data/osd.40 type btrfs (rw,noatime) /dev/sde on /data/osd.41 type btrfs (rw,noatime) /dev/sdf on /data/osd.42 type btrfs (rw,noatime) /dev/sdg on /data/osd.43 type btrfs (rw,noatime) /dev/sdh on /data/osd.44 type btrfs (rw,noatime) /dev/sdi on /data/osd.45 type btrfs (rw,noatime) /dev/sdj on /data/osd.46 type btrfs (rw,noatime) /dev/sdk on /data/osd.47 type btrfs (rw,noatime) --- RX37-5c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-5 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 74226012 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 195550280417280 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 177656960122880 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 41 C Blocks sent to initiator = 238550402465792 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 31 C Blocks sent to initiator = 226579741409280 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 33 C Blocks sent to initiator = 186652383248384 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 219684389519360 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 39 C Blocks sent to initiator = 223471107833856 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 29 C Blocks sent to initiator = 190300723085312 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdo on /data/osd.50 type btrfs (rw,noatime) /dev/sdp on /data/osd.51 type btrfs (rw,noatime) /dev/sdq on /data/osd.52 type btrfs (rw,noatime) /dev/sdr on /data/osd.53 type btrfs (rw,noatime) /dev/sds on /data/osd.54 type btrfs (rw,noatime) /dev/sdt on /data/osd.55 type btrfs (rw,noatime) /dev/sdu on /data/osd.56 type btrfs (rw,noatime) /dev/sdv on /data/osd.57 type btrfs (rw,noatime) --- RX37-6c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-6 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856344 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 41 C Blocks sent to initiator = 195597608943616 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 197325225984000 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 42 C Blocks sent to initiator = 182463498289152 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 45 C Blocks sent to initiator = 250870398713856 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 209343584665600 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 33 C Blocks sent to initiator = 226728102330368 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 43 C Blocks sent to initiator = 213839006138368 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 38 C Blocks sent to initiator = 179503745728512 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdn on /data/osd.60 type btrfs (rw,noatime) /dev/sdo on /data/osd.61 type btrfs (rw,noatime) /dev/sdp on /data/osd.62 type btrfs (rw,noatime) /dev/sdq on /data/osd.63 type btrfs (rw,noatime) /dev/sdr on /data/osd.64 type btrfs (rw,noatime) /dev/sds on /data/osd.65 type btrfs (rw,noatime) /dev/sdt on /data/osd.66 type btrfs (rw,noatime) /dev/sdu on /data/osd.67 type btrfs (rw,noatime) --- RX37-7c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-7 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856344 kB optimal_io_size: 4194304 65536 scheduler: [noop] deadline cfq noop deadline [cfq] [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq --- RX37-8c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-8 3.0.36-16-default #1 SMP Wed Jul 18 00:18:54 UTC 2012 (544e41f) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 65952088 kB optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq -------------------------------------------------------------------------------- dumped osdmap epoch 15 epoch 15 fsid 7ab4662b-0575-4875-b59d-3bef85bb918d created 2012-08-26 15:10:43.529294 modifed 2012-08-26 15:11:09.537529 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 max_osd 68 osd.30 up in weight 1 up_from 2 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6800/7884 192.168.114.52:6800/7884 192.168.114.52:6801/7884 exists,up f1912b6b-2abf-4eef-83e0-8657d78e48f8 osd.31 up in weight 1 up_from 4 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6801/8057 192.168.114.52:6802/8057 192.168.114.52:6803/8057 exists,up 2a254612-5242-4ae8-8ba7-3fe2eaa3eec5 osd.32 up in weight 1 up_from 3 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6802/8225 192.168.114.52:6804/8225 192.168.114.52:6805/8225 exists,up d41508ee-131c-47b8-9218-8f81bc7f7716 osd.33 up in weight 1 up_from 3 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6803/8415 192.168.114.52:6806/8415 192.168.114.52:6807/8415 exists,up 2e5a96be-ca3a-4c7d-8895-b61c07d858ac osd.34 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6804/8588 192.168.114.52:6808/8588 192.168.114.52:6809/8588 exists,up 214d8253-ad9b-4268-ba67-365ae9bc612a osd.35 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6805/8777 192.168.114.52:6810/8777 192.168.114.52:6811/8777 exists,up 9d328117-581a-4fdb-bee8-e373e74ee013 osd.36 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6806/8966 192.168.114.52:6812/8966 192.168.114.52:6813/8966 exists,up 0d046c45-ddd3-4c24-814c-36ace0632167 osd.37 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6807/9155 192.168.114.52:6814/9155 192.168.114.52:6815/9155 exists,up 2265a65a-624c-4729-bf64-47850270b4a9 osd.40 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6800/14455 192.168.114.53:6800/14455 192.168.114.53:6801/14455 exists,up e782364f-c5ee-4181-98ba-8e8009a789db osd.41 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6801/14639 192.168.114.53:6802/14639 192.168.114.53:6803/14639 exists,up 3154b1e5-e49a-417a-9b80-d64995afb2c8 osd.42 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6802/14816 192.168.114.53:6804/14816 192.168.114.53:6805/14816 exists,up a7cab833-70b2-4067-83a3-a8a7b7ccb1c2 osd.43 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6803/15013 192.168.114.53:6806/15013 192.168.114.53:6807/15013 exists,up 5afeea03-5a5d-4643-bbde-aaadda1bde01 osd.44 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6804/15190 192.168.114.53:6808/15190 192.168.114.53:6809/15190 exists,up 5b1a90a2-596d-40d4-b33d-cf74142f7e96 osd.45 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6805/15420 192.168.114.53:6810/15420 192.168.114.53:6811/15420 exists,up e4d85019-c8d4-4dc8-bec3-ceaddab60b99 osd.46 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6806/15623 192.168.114.53:6812/15623 192.168.114.53:6813/15623 exists,up 0a1b6a02-1b70-457f-9602-8f02e00d7ae1 osd.47 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6807/15826 192.168.114.53:6814/15826 192.168.114.53:6815/15826 exists,up 7be9d381-8c38-440c-ae22-fc29a9349351 osd.50 up in weight 1 up_from 5 up_thru 12 down_at 0 last_clean_interval [0,0) 192.168.113.54:6800/1915 192.168.114.54:6800/1915 192.168.114.54:6801/1915 exists,up 7653343d-5602-4a6e-ac69-a278dab28c8c osd.51 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6801/2155 192.168.114.54:6802/2155 192.168.114.54:6803/2155 exists,up a58bfbfb-8f21-4939-8ca1-b8209be68a30 osd.52 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6802/2322 192.168.114.54:6804/2322 192.168.114.54:6805/2322 exists,up 81daeb73-23f4-4f68-b56b-7d5a1b95e7e0 osd.53 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6803/2515 192.168.114.54:6806/2515 192.168.114.54:6807/2515 exists,up b3978c52-f689-45e8-9ee2-681e3bdeeeb2 osd.54 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6804/2702 192.168.114.54:6808/2702 192.168.114.54:6809/2702 exists,up 205b59d3-176a-4048-84c5-81dd181a8e71 osd.55 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6805/2889 192.168.114.54:6810/2889 192.168.114.54:6811/2889 exists,up cd4d82de-0da8-48b0-a54f-d1372b611958 osd.56 up in weight 1 up_from 6 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.54:6806/3082 192.168.114.54:6812/3082 192.168.114.54:6813/3082 exists,up b82b38a6-64ad-487a-899b-6c62ebe6bb13 osd.57 up in weight 1 up_from 6 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.54:6807/3269 192.168.114.54:6814/3269 192.168.114.54:6815/3269 exists,up c155cf46-d287-4439-a39e-ff80c22e0caa osd.60 up in weight 1 up_from 7 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6800/30607 192.168.114.55:6800/30607 192.168.114.55:6801/30607 exists,up ab8370bf-c722-4eab-9842-498b6dfef765 osd.61 up in weight 1 up_from 7 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6801/30801 192.168.114.55:6802/30801 192.168.114.55:6803/30801 exists,up a189a254-efcd-4129-867e-384cd0765d19 osd.62 up in weight 1 up_from 8 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6802/30946 192.168.114.55:6804/30946 192.168.114.55:6805/30946 exists,up 2ddc9000-a5be-4c7f-9362-2c525b93db7f osd.63 up in weight 1 up_from 9 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6803/31139 192.168.114.55:6806/31139 192.168.114.55:6807/31139 exists,up 5c4661fb-4c6c-411d-bf46-b4ead15a019a osd.64 up in weight 1 up_from 9 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6804/31332 192.168.114.55:6808/31332 192.168.114.55:6809/31332 exists,up b67f9e9b-d0f6-41b9-ac7f-0c355950316f osd.65 up in weight 1 up_from 10 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6805/31525 192.168.114.55:6810/31525 192.168.114.55:6811/31525 exists,up 9e179b5f-b0ca-4799-8b02-13fc3a78eda5 osd.66 up in weight 1 up_from 10 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6806/31814 192.168.114.55:6812/31814 192.168.114.55:6813/31814 exists,up e300060b-ac96-4ed0-9670-ffe3d7547a18 osd.67 up in weight 1 up_from 11 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6807/32063 192.168.114.55:6814/32063 192.168.114.55:6815/32063 exists,up f87f78b3-61ba-403a-b012-ddd055ced47f ceph.conf ---content--- # global [global] # enable secure authentication auth supported = none # allow ourselves to open a lot of files #max open files = 1100000 max open files = 131072 # set log file log file = /ceph/log/$name.log # log_to_syslog = true # uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid # If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible #ms bind ipv6 = true public network = 192.168.113.0/24 cluster network = 192.168.114.0/24 # monitors # You need at least one. You need at least three if you want to # tolerate any node failures. Always create an odd number. [mon] mon data = /ceph/$name # If you are using for example the RADOS Gateway and want to have your newly created # pools a higher replication level, you can set a default #osd pool default size = 3 # You can also specify a CRUSH rule for new pools # Wiki: http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH #osd pool default crush rule = 0 # Timing is critical for monitors, but if you want to allow the clocks to drift a # bit more, you can specify the max drift. #mon clock drift allowed = 1 # Tell the monitor to backoff from this warning for 30 seconds #mon clock drift warn backoff = 30 # logging, for debugging monitor crashes, in order of # their likelihood of being helpful :) #debug ms = 1 #debug mon = 20 #debug paxos = 20 #debug auth = 20 debug optracker = 0 [mon.0] host = RX37-3c mon addr = 192.168.113.52:6789 [mon.1] host = RX37-7c mon addr = 192.168.113.56:6789 [mon.2] host = RX37-8c mon addr = 192.168.113.57:6789 # mds # You need at least one. Define two to get a standby. [mds] # mds data = /ceph/$name # where the mds keeps it's secret encryption keys #keyring = /data/keyring.$name # mds logging to debug issues. #debug ms = 1 #debug mds = 20 debug optracker = 0 [mds.0] host = RX37-8c # osd # You need at least one. Two if you want data to be replicated. # Define as many as you like. [osd] # This is where the btrfs volume will be mounted. osd data = /data/$name # journal dio = true # osd op threads = 24 # osd disk threads = 24 # filestore op threads = 6 # filestore queue max ops = 24 # Ideally, make this a separate disk or partition. A few # hundred MB should be enough; more if you have fast or many # disks. You can use a file under the osd data dir if need be # (e.g. /data/$name/journal), but it will be slower than a # separate disk or partition. # This is an example of a file-based journal. # osd journal = /ceph/$name/journal # osd journal size = 2048 # journal size, in megabytes # If you want to run the journal on a tmpfs, disable DirectIO #journal dio = false # You can change the number of recovery operations to speed up recovery # or slow it down if your machines can't handle it # osd recovery max active = 3 # osd logging to debug osd issues, in order of likelihood of being # helpful #debug ms = 1 #debug osd = 20 #debug filestore = 20 #debug journal = 20 debug optracker = 0 fstype = btrfs [osd.30] host = RX37-3c devs = /dev/sdm osd journal = /dev/ram0 [osd.31] host = RX37-3c devs = /dev/sdn osd journal = /dev/ram1 [osd.32] host = RX37-3c devs = /dev/sdo osd journal = /dev/ram2 [osd.33] host = RX37-3c devs = /dev/sdp osd journal = /dev/ram3 [osd.34] host = RX37-3c devs = /dev/sdq osd journal = /dev/ram4 [osd.35] host = RX37-3c devs = /dev/sdr osd journal = /dev/ram5 [osd.36] host = RX37-3c devs = /dev/sds osd journal = /dev/ram6 [osd.37] host = RX37-3c devs = /dev/sdt osd journal = /dev/ram7 [osd.40] host = RX37-4c devs = /dev/sdd osd journal = /dev/ram0 [osd.41] host = RX37-4c devs = /dev/sde osd journal = /dev/ram1 [osd.42] host = RX37-4c devs = /dev/sdf osd journal = /dev/ram2 [osd.43] host = RX37-4c devs = /dev/sdg osd journal = /dev/ram3 [osd.44] host = RX37-4c devs = /dev/sdh osd journal = /dev/ram4 [osd.45] host = RX37-4c devs = /dev/sdi osd journal = /dev/ram5 [osd.46] host = RX37-4c devs = /dev/sdj osd journal = /dev/ram6 [osd.47] host = RX37-4c devs = /dev/sdk osd journal = /dev/ram7 [osd.50] host = RX37-5c devs = /dev/sdo osd journal = /dev/ram0 [osd.51] host = RX37-5c devs = /dev/sdp osd journal = /dev/ram1 [osd.52] host = RX37-5c devs = /dev/sdq osd journal = /dev/ram2 [osd.53] host = RX37-5c devs = /dev/sdr osd journal = /dev/ram3 [osd.54] host = RX37-5c devs = /dev/sds osd journal = /dev/ram4 [osd.55] host = RX37-5c devs = /dev/sdt osd journal = /dev/ram5 [osd.56] host = RX37-5c devs = /dev/sdu osd journal = /dev/ram6 [osd.57] host = RX37-5c devs = /dev/sdv osd journal = /dev/ram7 [osd.60] host = RX37-6c devs = /dev/sdn osd journal = /dev/ram0 [osd.61] host = RX37-6c devs = /dev/sdo osd journal = /dev/ram1 [osd.62] host = RX37-6c devs = /dev/sdp osd journal = /dev/ram2 [osd.63] host = RX37-6c devs = /dev/sdq osd journal = /dev/ram3 [osd.64] host = RX37-6c devs = /dev/sdr osd journal = /dev/ram4 [osd.65] host = RX37-6c devs = /dev/sds osd journal = /dev/ram5 [osd.66] host = RX37-6c devs = /dev/sdt osd journal = /dev/ram6 [osd.67] host = RX37-6c devs = /dev/sdu osd journal = /dev/ram7 devs = /dev/sdc [client.01] client hostname = RX37-7c ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper @ 2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner 2012-08-28 19:04 ` Dieter Kasper 2012-08-29 8:50 ` Alexandre DERUMIER 1 sibling, 1 reply; 31+ messages in thread From: Smart Weblications GmbH - Florian Wiessner @ 2012-08-28 18:53 UTC (permalink / raw) To: Dieter Kasper, ceph-devel Am 28.08.2012 19:48, schrieb Dieter Kasper: > Hi, > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > I can observe a pretty nice rados bench performance > (see bench-rados.txt for details): i'd like to know which 10GE Switch you have used? Do you use 10GE-Base-T? -- Mit freundlichen Grüßen, Florian Wiessner Smart Weblications GmbH Martinsberger Str. 1 D-95119 Naila fon.: +49 9282 9638 200 fax.: +49 9282 9638 205 24/7: +49 900 144 000 00 - 0,99 EUR/Min* http://www.smart-weblications.de -- Sitz der Gesellschaft: Naila Geschäftsführer: Florian Wiessner HRB-Nr.: HRB 3840 Amtsgericht Hof *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner @ 2012-08-28 19:04 ` Dieter Kasper 0 siblings, 0 replies; 31+ messages in thread From: Dieter Kasper @ 2012-08-28 19:04 UTC (permalink / raw) To: Smart Weblications GmbH - Florian Wiessner; +Cc: ceph-devel@vger.kernel.org On Tue, Aug 28, 2012 at 08:53:46PM +0200, Smart Weblications GmbH - Florian Wiessner wrote: > Am 28.08.2012 19:48, schrieb Dieter Kasper: > > Hi, > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > I can observe a pretty nice rados bench performance > > (see bench-rados.txt for details): > > i'd like to know which 10GE Switch you have used? Do you use 10GE-Base-T? http://www.brocade.com/products/all/switches/product-details/turboiron-24x-switch/index.page Mit freundlichen Grüßen Dieter Kasper > > > > > -- > > Mit freundlichen Grüßen, > > Florian Wiessner > > Smart Weblications GmbH > Martinsberger Str. 1 > D-95119 Naila > > fon.: +49 9282 9638 200 > fax.: +49 9282 9638 205 > 24/7: +49 900 144 000 00 - 0,99 EUR/Min* > http://www.smart-weblications.de > > -- > Sitz der Gesellschaft: Naila > Geschäftsführer: Florian Wiessner > HRB-Nr.: HRB 3840 Amtsgericht Hof > *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper 2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner @ 2012-08-29 8:50 ` Alexandre DERUMIER 2012-08-29 17:37 ` Josh Durgin 2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper 1 sibling, 2 replies; 31+ messages in thread From: Alexandre DERUMIER @ 2012-08-29 8:50 UTC (permalink / raw) To: Dieter Kasper; +Cc: ceph-devel Nice results ! (can you make same benchmark from a qemu-kvm guest with virtio-driver ? I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) I think you can try to tune these values filestore max sync interval = 30 filestore min sync interval = 29 filestore flusher = false filestore queue max ops = 10000 ----- Mail original ----- De: "Dieter Kasper" <d.kasper@kabelmail.de> À: ceph-devel@vger.kernel.org Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> Envoyé: Mardi 28 Août 2012 19:48:42 Objet: RBD performance - tuning hints Hi, on my 4-node system (SSD + 10GbE, see bench-config.txt for details) I can observe a pretty nice rados bench performance (see bench-rados.txt for details): Bandwidth (MB/sec): 961.710 Max bandwidth (MB/sec): 1040 Min bandwidth (MB/sec): 772 Also the bandwidth performance generated with fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} .... is acceptable, e.g. fio_write_4m_16 795 MB/s fio_randwrite_8m_128 717 MB/s fio_randwrite_8m_16 714 MB/s fio_randwrite_2m_32 692 MB/s But, the write IOPS seems to be limited around 19k ... RBD 4M 64k (= optimal_io_size) fio_randread_512_128 53286 55925 fio_randread_4k_128 51110 44382 fio_randread_8k_128 30854 29938 fio_randwrite_512_128 18888 2386 fio_randwrite_512_64 18844 2582 fio_randwrite_8k_64 17350 2445 (...) fio_read_4k_128 10073 53151 fio_read_4k_64 9500 39757 fio_read_4k_32 9220 23650 (...) fio_read_4k_16 9122 14322 fio_write_4k_128 2190 14306 fio_read_8k_32 706 13894 fio_write_4k_64 2197 12297 fio_write_8k_64 3563 11705 fio_write_8k_128 3444 11219 Any hints for tuning the IOPS (read and/or write) would be appreciated. How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) Kind Regards, -Dieter -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-29 8:50 ` Alexandre DERUMIER @ 2012-08-29 17:37 ` Josh Durgin 2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper 2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper 1 sibling, 1 reply; 31+ messages in thread From: Josh Durgin @ 2012-08-29 17:37 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Dieter Kasper, ceph-devel On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote: > Nice results ! > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > I think you can try to tune these values > > filestore max sync interval = 30 > filestore min sync interval = 29 > filestore flusher = false > filestore queue max ops = 10000 Increasing filestore_op_threads might help as well. > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: ceph-devel@vger.kernel.org > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > Envoyé: Mardi 28 Août 2012 19:48:42 > Objet: RBD performance - tuning hints > > Hi, > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > I can observe a pretty nice rados bench performance > (see bench-rados.txt for details): > > Bandwidth (MB/sec): 961.710 > Max bandwidth (MB/sec): 1040 > Min bandwidth (MB/sec): 772 > > > Also the bandwidth performance generated with > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > .... is acceptable, e.g. > fio_write_4m_16 795 MB/s > fio_randwrite_8m_128 717 MB/s > fio_randwrite_8m_16 714 MB/s > fio_randwrite_2m_32 692 MB/s > > > But, the write IOPS seems to be limited around 19k ... > RBD 4M 64k (= optimal_io_size) > fio_randread_512_128 53286 55925 > fio_randread_4k_128 51110 44382 > fio_randread_8k_128 30854 29938 > fio_randwrite_512_128 18888 2386 > fio_randwrite_512_64 18844 2582 > fio_randwrite_8k_64 17350 2445 > (...) > fio_read_4k_128 10073 53151 > fio_read_4k_64 9500 39757 > fio_read_4k_32 9220 23650 > (...) > fio_read_4k_16 9122 14322 > fio_write_4k_128 2190 14306 > fio_read_8k_32 706 13894 > fio_write_4k_64 2197 12297 > fio_write_8k_64 3563 11705 > fio_write_8k_128 3444 11219 > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > Kind Regards, > -Dieter > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc 2012-08-29 17:37 ` Josh Durgin @ 2012-08-29 19:29 ` Dieter Kasper 2012-08-29 22:34 ` Samuel Just 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-29 19:29 UTC (permalink / raw) To: Josh Durgin Cc: Alexandre DERUMIER, ceph-devel@vger.kernel.org, Dieter Kasper (KD) Hi Josh, thanks for the hint. Can you please spend a view words about the meaing of these parameters ? - filestore min/max sync interval = int/float ? seconds ? of what ? - filestore flusher = false - filestore queue max ops = 10000 what is 'one op' ? queue in front of what ? - filestore op threads = what are useful values here ? - journal dio = true/false - osd op threads = - osd disk threads = Kind Regards, -Dieter On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote: > On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote: > > Nice results ! > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > I think you can try to tune these values > > > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > Increasing filestore_op_threads might help as well. > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: ceph-devel@vger.kernel.org > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > Envoyé: Mardi 28 Août 2012 19:48:42 > > Objet: RBD performance - tuning hints > > > > Hi, > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > I can observe a pretty nice rados bench performance > > (see bench-rados.txt for details): > > > > Bandwidth (MB/sec): 961.710 > > Max bandwidth (MB/sec): 1040 > > Min bandwidth (MB/sec): 772 > > > > > > Also the bandwidth performance generated with > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > .... is acceptable, e.g. > > fio_write_4m_16 795 MB/s > > fio_randwrite_8m_128 717 MB/s > > fio_randwrite_8m_16 714 MB/s > > fio_randwrite_2m_32 692 MB/s > > > > > > But, the write IOPS seems to be limited around 19k ... > > RBD 4M 64k (= optimal_io_size) > > fio_randread_512_128 53286 55925 > > fio_randread_4k_128 51110 44382 > > fio_randread_8k_128 30854 29938 > > fio_randwrite_512_128 18888 2386 > > fio_randwrite_512_64 18844 2582 > > fio_randwrite_8k_64 17350 2445 > > (...) > > fio_read_4k_128 10073 53151 > > fio_read_4k_64 9500 39757 > > fio_read_4k_32 9220 23650 > > (...) > > fio_read_4k_16 9122 14322 > > fio_write_4k_128 2190 14306 > > fio_read_8k_32 706 13894 > > fio_write_4k_64 2197 12297 > > fio_write_8k_64 3563 11705 > > fio_write_8k_128 3444 11219 > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > Kind Regards, > > -Dieter > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc 2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper @ 2012-08-29 22:34 ` Samuel Just 2012-08-30 15:08 ` Dieter Kasper 0 siblings, 1 reply; 31+ messages in thread From: Samuel Just @ 2012-08-29 22:34 UTC (permalink / raw) To: Dieter Kasper; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org filestore [min|max] sync interval: Periodically, the filestore needs to quiesce writes and do a syncfs in order to create a consistent commit point up to which it can free journal entries. Syncing more frequently tends to reduce the time required to do the sync, and reduces the amount of data that needs to remain in the journal. Less frequent syncs would allow the backing filesystem to better coalesce small writes and metadata updates hopefully resulting in more efficient syncs. 'filestore max sync interval' defines the maximum time period between syncs, 'filestore min sync interval' defines the minimum time period between syncs. filestore flusher: The filestore flusher forces data from large writes to be written out using sync_file_range before the sync in order to (hopefully) reduce the cost of the eventual sync. In practice, disabling 'filestore flusher' seems to improve performance in some cases. filestore queue max ops: 'filestore queue max ops' defines the number of in progress ops the filestore will accept before blocking on queueing new ones. This mostly shouldn't have much of an effect on performance and should probably be ignored. filestore op threads: 'filestore op threads' defines the number of threads used to submit filesystem operations in parallel. journal dio: 'journal dio' enables using O_DIRECT for writing to the journal. This should usually be enabled. If possible, 'journal aio' should also be enabled to allow use of libaio to do asynchronous writes. osd op threads: 'osd op threads' defines the size of the thread pool used to service OSD operations such as client requests. Increasing this may increase the rate of request processing. osd disk threads: 'osd disk threads' defines the number of threads used to perform background disk intensive osd operations such as scrubbing and snap trimming. On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote: > Hi Josh, > > thanks for the hint. > Can you please spend a view words about the meaing of these parameters ? > - filestore min/max sync interval = int/float ? seconds ? of what ? > - filestore flusher = false > - filestore queue max ops = 10000 > what is 'one op' ? queue in front of what ? > - filestore op threads = > what are useful values here ? > > - journal dio = true/false > - osd op threads = > - osd disk threads = > > > Kind Regards, > -Dieter > > > On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote: >> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote: >> > Nice results ! >> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? >> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >> > >> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> > I think you can try to tune these values >> > >> > filestore max sync interval = 30 >> > filestore min sync interval = 29 >> > filestore flusher = false >> > filestore queue max ops = 10000 >> >> Increasing filestore_op_threads might help as well. >> >> > ----- Mail original ----- >> > >> > De: "Dieter Kasper" <d.kasper@kabelmail.de> >> > À: ceph-devel@vger.kernel.org >> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> >> > Envoyé: Mardi 28 Août 2012 19:48:42 >> > Objet: RBD performance - tuning hints >> > >> > Hi, >> > >> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) >> > I can observe a pretty nice rados bench performance >> > (see bench-rados.txt for details): >> > >> > Bandwidth (MB/sec): 961.710 >> > Max bandwidth (MB/sec): 1040 >> > Min bandwidth (MB/sec): 772 >> > >> > >> > Also the bandwidth performance generated with >> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} >> > >> > .... is acceptable, e.g. >> > fio_write_4m_16 795 MB/s >> > fio_randwrite_8m_128 717 MB/s >> > fio_randwrite_8m_16 714 MB/s >> > fio_randwrite_2m_32 692 MB/s >> > >> > >> > But, the write IOPS seems to be limited around 19k ... >> > RBD 4M 64k (= optimal_io_size) >> > fio_randread_512_128 53286 55925 >> > fio_randread_4k_128 51110 44382 >> > fio_randread_8k_128 30854 29938 >> > fio_randwrite_512_128 18888 2386 >> > fio_randwrite_512_64 18844 2582 >> > fio_randwrite_8k_64 17350 2445 >> > (...) >> > fio_read_4k_128 10073 53151 >> > fio_read_4k_64 9500 39757 >> > fio_read_4k_32 9220 23650 >> > (...) >> > fio_read_4k_16 9122 14322 >> > fio_write_4k_128 2190 14306 >> > fio_read_8k_32 706 13894 >> > fio_write_4k_64 2197 12297 >> > fio_write_8k_64 3563 11705 >> > fio_write_8k_128 3444 11219 >> > >> > >> > Any hints for tuning the IOPS (read and/or write) would be appreciated. >> > >> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> > >> > >> > Kind Regards, >> > -Dieter >> > >> > >> > >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc 2012-08-29 22:34 ` Samuel Just @ 2012-08-30 15:08 ` Dieter Kasper 2012-08-30 20:39 ` Samuel Just 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-30 15:08 UTC (permalink / raw) To: Samuel Just; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org Samuel, thank you very much for this explicitely description! As far as I understand the journal acts as a ringbuffer in front of the OSD. Using time as a parameter to trigger sync might not be the best for a dynamic Storage subsystem. On a high workload e.g. 10/20 for min/max might be optimal for for 4 nodes with 10 OSDs each, but not after adding 4 additional nodes. Are there parameters to trigger the syncs to OSD in relation to the fill grade of the journal ? e.g. filestore [min|max] sync percent: Do not sync before min-% full; sync after max-% full What would happen if I set "filestore [min|max] sync interval" to 999999 ? Will the journal sync start at 100% full or at X% ? What is 'X' by defaut ? How can I set 'X' ? Best Regards, -Dieter On Thu, Aug 30, 2012 at 12:34:43AM +0200, Samuel Just wrote: > filestore [min|max] sync interval: > > Periodically, the filestore needs to quiesce writes and do a syncfs in > order to create > a consistent commit point up to which it can free journal entries. Syncing more > frequently tends to reduce the time required to do the sync, and > reduces the amount > of data that needs to remain in the journal. Less frequent syncs > would allow the > backing filesystem to better coalesce small writes and metadata > updates hopefully > resulting in more efficient syncs. 'filestore max sync interval' > defines the maximum > time period between syncs, 'filestore min sync interval' defines the > minimum time > period between syncs. > > filestore flusher: > > The filestore flusher forces data from large writes to be written out > using sync_file_range > before the sync in order to (hopefully) reduce the cost of the > eventual sync. In practice, > disabling 'filestore flusher' seems to improve performance in some cases. > > filestore queue max ops: > > 'filestore queue max ops' defines the number of in progress ops the > filestore will accept > before blocking on queueing new ones. This mostly shouldn't have much > of an effect > on performance and should probably be ignored. > > filestore op threads: > > 'filestore op threads' defines the number of threads used to submit > filesystem operations > in parallel. > > journal dio: > > 'journal dio' enables using O_DIRECT for writing to the journal. This > should usually > be enabled. If possible, 'journal aio' should also be enabled to > allow use of libaio > to do asynchronous writes. > > osd op threads: > > 'osd op threads' defines the size of the thread pool used to service > OSD operations > such as client requests. Increasing this may increase the rate of > request processing. > > osd disk threads: > > 'osd disk threads' defines the number of threads used to perform background disk > intensive osd operations such as scrubbing and snap trimming. > > On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote: > > Hi Josh, > > > > thanks for the hint. > > Can you please spend a view words about the meaing of these parameters ? > > - filestore min/max sync interval = int/float ? seconds ? of what ? > > - filestore flusher = false > > - filestore queue max ops = 10000 > > what is 'one op' ? queue in front of what ? > > - filestore op threads = > > what are useful values here ? > > > > - journal dio = true/false > > - osd op threads = > > - osd disk threads = > > > > > > Kind Regards, > > -Dieter > > > > > > On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote: > >> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote: > >> > Nice results ! > >> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > >> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > >> > > >> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > >> > I think you can try to tune these values > >> > > >> > filestore max sync interval = 30 > >> > filestore min sync interval = 29 > >> > filestore flusher = false > >> > filestore queue max ops = 10000 > >> > >> Increasing filestore_op_threads might help as well. > >> > >> > ----- Mail original ----- > >> > > >> > De: "Dieter Kasper" <d.kasper@kabelmail.de> > >> > À: ceph-devel@vger.kernel.org > >> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > >> > Envoyé: Mardi 28 Août 2012 19:48:42 > >> > Objet: RBD performance - tuning hints > >> > > >> > Hi, > >> > > >> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > >> > I can observe a pretty nice rados bench performance > >> > (see bench-rados.txt for details): > >> > > >> > Bandwidth (MB/sec): 961.710 > >> > Max bandwidth (MB/sec): 1040 > >> > Min bandwidth (MB/sec): 772 > >> > > >> > > >> > Also the bandwidth performance generated with > >> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > >> > > >> > .... is acceptable, e.g. > >> > fio_write_4m_16 795 MB/s > >> > fio_randwrite_8m_128 717 MB/s > >> > fio_randwrite_8m_16 714 MB/s > >> > fio_randwrite_2m_32 692 MB/s > >> > > >> > > >> > But, the write IOPS seems to be limited around 19k ... > >> > RBD 4M 64k (= optimal_io_size) > >> > fio_randread_512_128 53286 55925 > >> > fio_randread_4k_128 51110 44382 > >> > fio_randread_8k_128 30854 29938 > >> > fio_randwrite_512_128 18888 2386 > >> > fio_randwrite_512_64 18844 2582 > >> > fio_randwrite_8k_64 17350 2445 > >> > (...) > >> > fio_read_4k_128 10073 53151 > >> > fio_read_4k_64 9500 39757 > >> > fio_read_4k_32 9220 23650 > >> > (...) > >> > fio_read_4k_16 9122 14322 > >> > fio_write_4k_128 2190 14306 > >> > fio_read_8k_32 706 13894 > >> > fio_write_4k_64 2197 12297 > >> > fio_write_8k_64 3563 11705 > >> > fio_write_8k_128 3444 11219 > >> > > >> > > >> > Any hints for tuning the IOPS (read and/or write) would be appreciated. > >> > > >> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > >> > > >> > > >> > Kind Regards, > >> > -Dieter > >> > > >> > > >> > > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc 2012-08-30 15:08 ` Dieter Kasper @ 2012-08-30 20:39 ` Samuel Just 0 siblings, 0 replies; 31+ messages in thread From: Samuel Just @ 2012-08-30 20:39 UTC (permalink / raw) To: Dieter Kasper; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org Ah, those are just min and max. Sync is also triggered when the journal hits the half-full mark. We could make the percentage configurable in the future. -Sam On Thu, Aug 30, 2012 at 8:08 AM, Dieter Kasper <d.kasper@kabelmail.de> wrote: > Samuel, > > thank you very much for this explicitely description! > > As far as I understand the journal acts as a ringbuffer in front of the OSD. > Using time as a parameter to trigger sync might not be the best for > a dynamic Storage subsystem. On a high workload e.g. 10/20 for min/max > might be optimal for for 4 nodes with 10 OSDs each, > but not after adding 4 additional nodes. > > Are there parameters to trigger the syncs to OSD > in relation to the fill grade of the journal ? > e.g. > filestore [min|max] sync percent: > > Do not sync before min-% full; sync after max-% full > > What would happen if I set "filestore [min|max] sync interval" to 999999 ? > Will the journal sync start at 100% full or at X% ? > What is 'X' by defaut ? > How can I set 'X' ? > > Best Regards, > -Dieter > > > On Thu, Aug 30, 2012 at 12:34:43AM +0200, Samuel Just wrote: >> filestore [min|max] sync interval: >> >> Periodically, the filestore needs to quiesce writes and do a syncfs in >> order to create >> a consistent commit point up to which it can free journal entries. Syncing more >> frequently tends to reduce the time required to do the sync, and >> reduces the amount >> of data that needs to remain in the journal. Less frequent syncs >> would allow the >> backing filesystem to better coalesce small writes and metadata >> updates hopefully >> resulting in more efficient syncs. 'filestore max sync interval' >> defines the maximum >> time period between syncs, 'filestore min sync interval' defines the >> minimum time >> period between syncs. >> >> filestore flusher: >> >> The filestore flusher forces data from large writes to be written out >> using sync_file_range >> before the sync in order to (hopefully) reduce the cost of the >> eventual sync. In practice, >> disabling 'filestore flusher' seems to improve performance in some cases. >> >> filestore queue max ops: >> >> 'filestore queue max ops' defines the number of in progress ops the >> filestore will accept >> before blocking on queueing new ones. This mostly shouldn't have much >> of an effect >> on performance and should probably be ignored. >> >> filestore op threads: >> >> 'filestore op threads' defines the number of threads used to submit >> filesystem operations >> in parallel. >> >> journal dio: >> >> 'journal dio' enables using O_DIRECT for writing to the journal. This >> should usually >> be enabled. If possible, 'journal aio' should also be enabled to >> allow use of libaio >> to do asynchronous writes. >> >> osd op threads: >> >> 'osd op threads' defines the size of the thread pool used to service >> OSD operations >> such as client requests. Increasing this may increase the rate of >> request processing. >> >> osd disk threads: >> >> 'osd disk threads' defines the number of threads used to perform background disk >> intensive osd operations such as scrubbing and snap trimming. >> >> On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote: >> > Hi Josh, >> > >> > thanks for the hint. >> > Can you please spend a view words about the meaing of these parameters ? >> > - filestore min/max sync interval = int/float ? seconds ? of what ? >> > - filestore flusher = false >> > - filestore queue max ops = 10000 >> > what is 'one op' ? queue in front of what ? >> > - filestore op threads = >> > what are useful values here ? >> > >> > - journal dio = true/false >> > - osd op threads = >> > - osd disk threads = >> > >> > >> > Kind Regards, >> > -Dieter >> > >> > >> > On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote: >> >> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote: >> >> > Nice results ! >> >> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? >> >> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >> >> > >> >> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> >> > I think you can try to tune these values >> >> > >> >> > filestore max sync interval = 30 >> >> > filestore min sync interval = 29 >> >> > filestore flusher = false >> >> > filestore queue max ops = 10000 >> >> >> >> Increasing filestore_op_threads might help as well. >> >> >> >> > ----- Mail original ----- >> >> > >> >> > De: "Dieter Kasper" <d.kasper@kabelmail.de> >> >> > À: ceph-devel@vger.kernel.org >> >> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> >> >> > Envoyé: Mardi 28 Août 2012 19:48:42 >> >> > Objet: RBD performance - tuning hints >> >> > >> >> > Hi, >> >> > >> >> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) >> >> > I can observe a pretty nice rados bench performance >> >> > (see bench-rados.txt for details): >> >> > >> >> > Bandwidth (MB/sec): 961.710 >> >> > Max bandwidth (MB/sec): 1040 >> >> > Min bandwidth (MB/sec): 772 >> >> > >> >> > >> >> > Also the bandwidth performance generated with >> >> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} >> >> > >> >> > .... is acceptable, e.g. >> >> > fio_write_4m_16 795 MB/s >> >> > fio_randwrite_8m_128 717 MB/s >> >> > fio_randwrite_8m_16 714 MB/s >> >> > fio_randwrite_2m_32 692 MB/s >> >> > >> >> > >> >> > But, the write IOPS seems to be limited around 19k ... >> >> > RBD 4M 64k (= optimal_io_size) >> >> > fio_randread_512_128 53286 55925 >> >> > fio_randread_4k_128 51110 44382 >> >> > fio_randread_8k_128 30854 29938 >> >> > fio_randwrite_512_128 18888 2386 >> >> > fio_randwrite_512_64 18844 2582 >> >> > fio_randwrite_8k_64 17350 2445 >> >> > (...) >> >> > fio_read_4k_128 10073 53151 >> >> > fio_read_4k_64 9500 39757 >> >> > fio_read_4k_32 9220 23650 >> >> > (...) >> >> > fio_read_4k_16 9122 14322 >> >> > fio_write_4k_128 2190 14306 >> >> > fio_read_8k_32 706 13894 >> >> > fio_write_4k_64 2197 12297 >> >> > fio_write_8k_64 3563 11705 >> >> > fio_write_8k_128 3444 11219 >> >> > >> >> > >> >> > Any hints for tuning the IOPS (read and/or write) would be appreciated. >> >> > >> >> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> >> > >> >> > >> >> > Kind Regards, >> >> > -Dieter >> >> > >> >> > >> >> > >> >> >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-29 8:50 ` Alexandre DERUMIER 2012-08-29 17:37 ` Josh Durgin @ 2012-08-30 14:56 ` Dieter Kasper 2012-08-30 15:28 ` Alexandre DERUMIER 1 sibling, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-30 14:56 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org Hi Alexandre, with the 4 filestore parameter below some fio values could be increased: filestore max sync interval = 30 filestore min sync interval = 29 filestore flusher = false filestore queue max ops = 10000 ###### IOPS fio_read_4k_64: 9373 fio_read_4k_128: 9939 fio_randwrite_8k_16: 12376 fio_randwrite_4k_16: 13315 fio_randwrite_512_32: 13660 fio_randwrite_8k_32: 17318 fio_randwrite_4k_32: 18057 fio_randwrite_8k_64: 19693 fio_randwrite_512_64: 20015 <<< fio_randwrite_4k_64: 20024 <<< fio_randwrite_8k_128: 20547 <<< fio_randwrite_4k_128: 20839 <<< fio_randwrite_512_128: 21417 <<< fio_randread_8k_128: 48872 fio_randread_4k_128: 50002 fio_randread_512_128: 51202 ###### MB/s fio_randread_2m_32: 628 fio_read_4m_64: 630 fio_randread_8m_32: 633 fio_read_2m_32: 637 fio_read_4m_16: 640 fio_randread_4m_16: 652 fio_write_2m_32: 660 fio_randread_4m_32: 677 fio_read_4m_32: 678 (...) fio_write_4m_64: 771 fio_randwrite_2m_64: 789 fio_write_8m_128: 796 fio_write_4m_32: 802 fio_randwrite_4m_128: 807 <<< fio_randwrite_2m_32: 811 <<< fio_write_2m_128: 833 <<< fio_write_8m_64: 901 <<< Best Regards, -Dieter On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > Nice results ! > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > I think you can try to tune these values > > filestore max sync interval = 30 > filestore min sync interval = 29 > filestore flusher = false > filestore queue max ops = 10000 > > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: ceph-devel@vger.kernel.org > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > Envoyé: Mardi 28 Août 2012 19:48:42 > Objet: RBD performance - tuning hints > > Hi, > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > I can observe a pretty nice rados bench performance > (see bench-rados.txt for details): > > Bandwidth (MB/sec): 961.710 > Max bandwidth (MB/sec): 1040 > Min bandwidth (MB/sec): 772 > > > Also the bandwidth performance generated with > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > .... is acceptable, e.g. > fio_write_4m_16 795 MB/s > fio_randwrite_8m_128 717 MB/s > fio_randwrite_8m_16 714 MB/s > fio_randwrite_2m_32 692 MB/s > > > But, the write IOPS seems to be limited around 19k ... > RBD 4M 64k (= optimal_io_size) > fio_randread_512_128 53286 55925 > fio_randread_4k_128 51110 44382 > fio_randread_8k_128 30854 29938 > fio_randwrite_512_128 18888 2386 > fio_randwrite_512_64 18844 2582 > fio_randwrite_8k_64 17350 2445 > (...) > fio_read_4k_128 10073 53151 > fio_read_4k_64 9500 39757 > fio_read_4k_32 9220 23650 > (...) > fio_read_4k_16 9122 14322 > fio_write_4k_128 2190 14306 > fio_read_8k_32 706 13894 > fio_write_4k_64 2197 12297 > fio_write_8k_64 3563 11705 > fio_write_8k_128 3444 11219 > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > Kind Regards, > -Dieter > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper @ 2012-08-30 15:28 ` Alexandre DERUMIER 2012-08-30 15:33 ` Dieter Kasper 0 siblings, 1 reply; 31+ messages in thread From: Alexandre DERUMIER @ 2012-08-30 15:28 UTC (permalink / raw) To: Dieter Kasper; +Cc: ceph-devel Thanks for the report ! vs your first benchmark, it's with RBD 4M or 64K ? (how much ssd by node?) ----- Mail original ----- De: "Dieter Kasper" <d.kasper@kabelmail.de> À: "Alexandre DERUMIER" <aderumier@odiso.com> Cc: ceph-devel@vger.kernel.org Envoyé: Jeudi 30 Août 2012 16:56:34 Objet: Re: RBD performance - tuning hints Hi Alexandre, with the 4 filestore parameter below some fio values could be increased: filestore max sync interval = 30 filestore min sync interval = 29 filestore flusher = false filestore queue max ops = 10000 ###### IOPS fio_read_4k_64: 9373 fio_read_4k_128: 9939 fio_randwrite_8k_16: 12376 fio_randwrite_4k_16: 13315 fio_randwrite_512_32: 13660 fio_randwrite_8k_32: 17318 fio_randwrite_4k_32: 18057 fio_randwrite_8k_64: 19693 fio_randwrite_512_64: 20015 <<< fio_randwrite_4k_64: 20024 <<< fio_randwrite_8k_128: 20547 <<< fio_randwrite_4k_128: 20839 <<< fio_randwrite_512_128: 21417 <<< fio_randread_8k_128: 48872 fio_randread_4k_128: 50002 fio_randread_512_128: 51202 ###### MB/s fio_randread_2m_32: 628 fio_read_4m_64: 630 fio_randread_8m_32: 633 fio_read_2m_32: 637 fio_read_4m_16: 640 fio_randread_4m_16: 652 fio_write_2m_32: 660 fio_randread_4m_32: 677 fio_read_4m_32: 678 (...) fio_write_4m_64: 771 fio_randwrite_2m_64: 789 fio_write_8m_128: 796 fio_write_4m_32: 802 fio_randwrite_4m_128: 807 <<< fio_randwrite_2m_32: 811 <<< fio_write_2m_128: 833 <<< fio_write_8m_64: 901 <<< Best Regards, -Dieter On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > Nice results ! > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > I think you can try to tune these values > > filestore max sync interval = 30 > filestore min sync interval = 29 > filestore flusher = false > filestore queue max ops = 10000 > > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: ceph-devel@vger.kernel.org > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > Envoyé: Mardi 28 Août 2012 19:48:42 > Objet: RBD performance - tuning hints > > Hi, > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > I can observe a pretty nice rados bench performance > (see bench-rados.txt for details): > > Bandwidth (MB/sec): 961.710 > Max bandwidth (MB/sec): 1040 > Min bandwidth (MB/sec): 772 > > > Also the bandwidth performance generated with > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > .... is acceptable, e.g. > fio_write_4m_16 795 MB/s > fio_randwrite_8m_128 717 MB/s > fio_randwrite_8m_16 714 MB/s > fio_randwrite_2m_32 692 MB/s > > > But, the write IOPS seems to be limited around 19k ... > RBD 4M 64k (= optimal_io_size) > fio_randread_512_128 53286 55925 > fio_randread_4k_128 51110 44382 > fio_randread_8k_128 30854 29938 > fio_randwrite_512_128 18888 2386 > fio_randwrite_512_64 18844 2582 > fio_randwrite_8k_64 17350 2445 > (...) > fio_read_4k_128 10073 53151 > fio_read_4k_64 9500 39757 > fio_read_4k_32 9220 23650 > (...) > fio_read_4k_16 9122 14322 > fio_write_4k_128 2190 14306 > fio_read_8k_32 706 13894 > fio_write_4k_64 2197 12297 > fio_write_8k_64 3563 11705 > fio_write_8k_128 3444 11219 > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > Kind Regards, > -Dieter > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 15:28 ` Alexandre DERUMIER @ 2012-08-30 15:33 ` Dieter Kasper 2012-08-30 15:46 ` Alexandre DERUMIER 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-30 15:33 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel [-- Attachment #1: Type: text/plain, Size: 5048 bytes --] On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: > Thanks for the report ! > > vs your first benchmark, it's with RBD 4M or 64K ? with 4MB (see attached config info) Cheers, -Dieter > > (how much ssd by node?) 8x SSD, 200GB each > > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org > Envoyé: Jeudi 30 Août 2012 16:56:34 > Objet: Re: RBD performance - tuning hints > > Hi Alexandre, > > with the 4 filestore parameter below some fio values could be increased: > filestore max sync interval = 30 > filestore min sync interval = 29 > filestore flusher = false > filestore queue max ops = 10000 > > ###### IOPS > fio_read_4k_64: 9373 > fio_read_4k_128: 9939 > fio_randwrite_8k_16: 12376 > fio_randwrite_4k_16: 13315 > fio_randwrite_512_32: 13660 > fio_randwrite_8k_32: 17318 > fio_randwrite_4k_32: 18057 > fio_randwrite_8k_64: 19693 > fio_randwrite_512_64: 20015 <<< > fio_randwrite_4k_64: 20024 <<< > fio_randwrite_8k_128: 20547 <<< > fio_randwrite_4k_128: 20839 <<< > fio_randwrite_512_128: 21417 <<< > fio_randread_8k_128: 48872 > fio_randread_4k_128: 50002 > fio_randread_512_128: 51202 > > ###### MB/s > fio_randread_2m_32: 628 > fio_read_4m_64: 630 > fio_randread_8m_32: 633 > fio_read_2m_32: 637 > fio_read_4m_16: 640 > fio_randread_4m_16: 652 > fio_write_2m_32: 660 > fio_randread_4m_32: 677 > fio_read_4m_32: 678 > (...) > fio_write_4m_64: 771 > fio_randwrite_2m_64: 789 > fio_write_8m_128: 796 > fio_write_4m_32: 802 > fio_randwrite_4m_128: 807 <<< > fio_randwrite_2m_32: 811 <<< > fio_write_2m_128: 833 <<< > fio_write_8m_64: 901 <<< > > Best Regards, > -Dieter > > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > > Nice results ! > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > I think you can try to tune these values > > > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > > > > > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: ceph-devel@vger.kernel.org > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > Envoyé: Mardi 28 Août 2012 19:48:42 > > Objet: RBD performance - tuning hints > > > > Hi, > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > I can observe a pretty nice rados bench performance > > (see bench-rados.txt for details): > > > > Bandwidth (MB/sec): 961.710 > > Max bandwidth (MB/sec): 1040 > > Min bandwidth (MB/sec): 772 > > > > > > Also the bandwidth performance generated with > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > .... is acceptable, e.g. > > fio_write_4m_16 795 MB/s > > fio_randwrite_8m_128 717 MB/s > > fio_randwrite_8m_16 714 MB/s > > fio_randwrite_2m_32 692 MB/s > > > > > > But, the write IOPS seems to be limited around 19k ... > > RBD 4M 64k (= optimal_io_size) > > fio_randread_512_128 53286 55925 > > fio_randread_4k_128 51110 44382 > > fio_randread_8k_128 30854 29938 > > fio_randwrite_512_128 18888 2386 > > fio_randwrite_512_64 18844 2582 > > fio_randwrite_8k_64 17350 2445 > > (...) > > fio_read_4k_128 10073 53151 > > fio_read_4k_64 9500 39757 > > fio_read_4k_32 9220 23650 > > (...) > > fio_read_4k_16 9122 14322 > > fio_write_4k_128 2190 14306 > > fio_read_8k_32 706 13894 > > fio_write_4k_64 2197 12297 > > fio_write_8k_64 3563 11705 > > fio_write_8k_128 3444 11219 > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > Kind Regards, > > -Dieter > > > > > > > > -- > > > > -- > > > > > > > > > > > > Alexandre D e rumier > > > > Ingénieur Systèmes et Réseaux > > > > > > Fixe : 03 20 68 88 85 > > > > Fax : 03 20 68 90 88 > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > 12 rue Marivaux 75002 Paris > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > [-- Attachment #2: hwconf.txt --] [-- Type: text/plain, Size: 26784 bytes --] --- RX37-3c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-3 3.0.41-5.1-default #1 SMP Wed Aug 22 00:54:03 UTC 2012 (9c63123) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856332 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdm [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 38 C Blocks sent to initiator = 257379169992704 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 40 C Blocks sent to initiator = 238453816033280 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 43 C Blocks sent to initiator = 297650494636032 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 254438979665920 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 35 C Blocks sent to initiator = 238876987752448 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 259011676995584 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 41 C Blocks sent to initiator = 359638046343168 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 31 C Blocks sent to initiator = 247008082264064 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdm on /data/osd.30 type xfs (rw,noatime) /dev/sdn on /data/osd.31 type xfs (rw,noatime) /dev/sdo on /data/osd.32 type xfs (rw,noatime) /dev/sdp on /data/osd.33 type xfs (rw,noatime) /dev/sdq on /data/osd.34 type xfs (rw,noatime) /dev/sdr on /data/osd.35 type xfs (rw,noatime) /dev/sds on /data/osd.36 type xfs (rw,noatime) /dev/sdt on /data/osd.37 type xfs (rw,noatime) --- RX37-4c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-4 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856432 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdd [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sde [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdf [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdg [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdh [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdi [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdj [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdk Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 389173798240256 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 30 C Blocks sent to initiator = 286249688498176 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 35 C Blocks sent to initiator = 220455000604672 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 38 C Blocks sent to initiator = 223169319272448 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 31 C Blocks sent to initiator = 232096593346560 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 264802534424576 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 27 C Blocks sent to initiator = 288896512425984 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 32 C Blocks sent to initiator = 282331621359616 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdd on /data/osd.40 type xfs (rw,noatime) /dev/sde on /data/osd.41 type xfs (rw,noatime) /dev/sdf on /data/osd.42 type xfs (rw,noatime) /dev/sdg on /data/osd.43 type xfs (rw,noatime) /dev/sdh on /data/osd.44 type xfs (rw,noatime) /dev/sdi on /data/osd.45 type xfs (rw,noatime) /dev/sdj on /data/osd.46 type xfs (rw,noatime) /dev/sdk on /data/osd.47 type xfs (rw,noatime) --- RX37-5c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-5 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 74226012 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 247461838848000 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 38 C Blocks sent to initiator = 231320898764800 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 41 C Blocks sent to initiator = 290086906232832 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 32 C Blocks sent to initiator = 287719053852672 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 33 C Blocks sent to initiator = 243922265702400 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 35 C Blocks sent to initiator = 272285122428928 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 40 C Blocks sent to initiator = 279561266790400 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 29 C Blocks sent to initiator = 247978778427392 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdo on /data/osd.50 type xfs (rw,noatime) /dev/sdp on /data/osd.51 type xfs (rw,noatime) /dev/sdq on /data/osd.52 type xfs (rw,noatime) /dev/sdr on /data/osd.53 type xfs (rw,noatime) /dev/sds on /data/osd.54 type xfs (rw,noatime) /dev/sdt on /data/osd.55 type xfs (rw,noatime) /dev/sdu on /data/osd.56 type xfs (rw,noatime) /dev/sdv on /data/osd.57 type xfs (rw,noatime) --- RX37-6c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-6 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 32856344 kB Disk /dev/ram0: 2048 MB, 2048000000 bytes Disk /dev/ram1: 2048 MB, 2048000000 bytes Disk /dev/ram2: 2048 MB, 2048000000 bytes Disk /dev/ram3: 2048 MB, 2048000000 bytes Disk /dev/ram4: 2048 MB, 2048000000 bytes Disk /dev/ram5: 2048 MB, 2048000000 bytes Disk /dev/ram6: 2048 MB, 2048000000 bytes Disk /dev/ram7: 2048 MB, 2048000000 bytes [10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn [10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo [10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp [10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt [11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 41 C Blocks sent to initiator = 259148495192064 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 36 C Blocks sent to initiator = 250183472381952 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 43 C Blocks sent to initiator = 232864704626688 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 46 C Blocks sent to initiator = 313614921629696 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 37 C Blocks sent to initiator = 269851218149376 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 34 C Blocks sent to initiator = 278551060283392 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 43 C Blocks sent to initiator = 267839076302848 Device: INTEL(R) SSD 910 200GB Version: a411 Current Drive Temperature: 39 C Blocks sent to initiator = 233988811653120 optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq /dev/sdn on /data/osd.60 type xfs (rw,noatime) /dev/sdo on /data/osd.61 type xfs (rw,noatime) /dev/sdp on /data/osd.62 type xfs (rw,noatime) /dev/sdq on /data/osd.63 type xfs (rw,noatime) /dev/sdr on /data/osd.64 type xfs (rw,noatime) /dev/sds on /data/osd.65 type xfs (rw,noatime) /dev/sdt on /data/osd.66 type xfs (rw,noatime) /dev/sdu on /data/osd.67 type xfs (rw,noatime) --- RX37-7c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-7 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 1.20 GHz (asserted by call to hardware). MemTotal: 32856344 kB optimal_io_size: 4194304 4194304 4194304 scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] noop deadline [cfq] --- RX37-8c -------------------------------------------------------------------- ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) Linux RX37-8 3.0.36-16-default #1 SMP Wed Jul 18 00:18:54 UTC 2012 (544e41f) x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Logial CPUs: 12 current CPU frequency is 2.30 GHz (asserted by call to hardware). MemTotal: 65952088 kB optimal_io_size: scheduler: [noop] deadline cfq [noop] deadline cfq [noop] deadline cfq -------------------------------------------------------------------------------- dumped osdmap epoch 19 epoch 19 fsid 31dc8e8c-45cb-4b94-b581-a9258964f1a6 created 2012-08-29 22:08:58.870313 modifed 2012-08-29 22:09:50.084564 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 pool 3 'pbench' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 768 pgp_num 768 last_change 18 owner 0 max_osd 68 osd.30 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6800/24876 192.168.114.52:6800/24876 192.168.114.52:6801/24876 exists,up 0a9a6db3-1c0d-4d66-ac99-bd900076c42c osd.31 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6801/25090 192.168.114.52:6802/25090 192.168.114.52:6803/25090 exists,up 0adab61b-c1c3-479f-b58e-42bec92bd5b0 osd.32 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6802/25276 192.168.114.52:6804/25276 192.168.114.52:6805/25276 exists,up 331bf096-d785-4ae8-b790-d746a0abb694 osd.33 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6803/25464 192.168.114.52:6806/25464 192.168.114.52:6807/25464 exists,up a1f9ea5b-e0db-474c-b7bc-6cb3d3a213a4 osd.34 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6804/25650 192.168.114.52:6808/25650 192.168.114.52:6809/25650 exists,up dcbe68e7-fef3-430d-a857-560db28de27f osd.35 up in weight 1 up_from 2 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6805/25838 192.168.114.52:6810/25838 192.168.114.52:6811/25838 exists,up ab1589d0-e725-4484-8f5d-f65bc5c64643 osd.36 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6806/26026 192.168.114.52:6812/26026 192.168.114.52:6813/26026 exists,up 2eea079f-bcfe-48a4-abb5-a15c7daf80ba osd.37 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6807/26218 192.168.114.52:6814/26218 192.168.114.52:6815/26218 exists,up 9822d872-79a6-4cd3-898f-2e905fbce44a osd.40 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6800/18525 192.168.114.53:6800/18525 192.168.114.53:6801/18525 exists,up 0f0c61ea-4d78-429c-9928-b3422ad2dec7 osd.41 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6801/18750 192.168.114.53:6802/18750 192.168.114.53:6803/18750 exists,up 3935c6a7-61ff-4c97-88b9-472051ba8b6c osd.42 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6802/18946 192.168.114.53:6804/18946 192.168.114.53:6805/18946 exists,up 3efc6383-5097-4e95-9af2-e0e7bc9ddc10 osd.43 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6803/19154 192.168.114.53:6806/19154 192.168.114.53:6807/19154 exists,up cdb8cf82-077b-40c2-adbc-fae29ba41645 osd.44 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6804/19350 192.168.114.53:6808/19350 192.168.114.53:6809/19350 exists,up 5ab69e45-a73a-4cd4-9837-2d54fb4ea4ec osd.45 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6805/19546 192.168.114.53:6810/19546 192.168.114.53:6811/19546 exists,up ec3d2118-6f46-4ef8-a431-553710f33a18 osd.46 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6806/19766 192.168.114.53:6812/19766 192.168.114.53:6813/19766 exists,up dcd94df3-b679-46a6-b670-5269a29913c1 osd.47 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6807/19968 192.168.114.53:6814/19968 192.168.114.53:6815/19968 exists,up 41019d97-c4f3-4c8d-9189-bae642c31678 osd.50 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6800/3848 192.168.114.54:6800/3848 192.168.114.54:6801/3848 exists,up 0b9ebe8e-9cb8-440d-948e-d4c8aa16b407 osd.51 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6801/4061 192.168.114.54:6802/4061 192.168.114.54:6803/4061 exists,up 3c2e8031-d01d-4bf9-965e-1b77563d5f8f osd.52 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6802/4248 192.168.114.54:6804/4248 192.168.114.54:6805/4248 exists,up 4d641c3c-0a7a-4b20-b047-9042b61685bb osd.53 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6803/4446 192.168.114.54:6806/4446 192.168.114.54:6807/4446 exists,up e335a6e9-9c32-48c6-8f15-11aa84a6287d osd.54 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6804/4632 192.168.114.54:6808/4632 192.168.114.54:6809/4632 exists,up 16f3955c-9eee-442b-86d8-cbbc5938efbf osd.55 up in weight 1 up_from 6 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6805/4836 192.168.114.54:6810/4836 192.168.114.54:6811/4836 exists,up 83e59145-9ff8-4c0b-b066-2b2e4e9c9953 osd.56 up in weight 1 up_from 6 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6806/5029 192.168.114.54:6812/5029 192.168.114.54:6813/5029 exists,up dfdeb186-5c96-4466-b4d3-5f32fa712792 osd.57 up in weight 1 up_from 7 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6807/5351 192.168.114.54:6814/5351 192.168.114.54:6815/5351 exists,up adf7a484-b0f1-4bf7-a8e7-2c1e64dfb77f osd.60 up in weight 1 up_from 7 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6800/31038 192.168.114.55:6800/31038 192.168.114.55:6801/31038 exists,up e9b949c8-1b47-4749-9408-1e9f7b89b0e6 osd.61 up in weight 1 up_from 8 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6801/31257 192.168.114.55:6802/31257 192.168.114.55:6803/31257 exists,up 19fcad53-d951-4645-a6d5-7dad1deba6fb osd.62 up in weight 1 up_from 8 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6802/31449 192.168.114.55:6804/31449 192.168.114.55:6805/31449 exists,up 7e98db0e-2ae2-473d-9b03-798ec472b29b osd.63 up in weight 1 up_from 9 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6803/31641 192.168.114.55:6806/31641 192.168.114.55:6807/31641 exists,up 9abc714c-06e4-40ba-8afe-8465209e0272 osd.64 up in weight 1 up_from 9 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6804/31937 192.168.114.55:6808/31937 192.168.114.55:6809/31937 exists,up 6a20e4b1-d1e9-4f69-b903-b403136ddb1d osd.65 up in weight 1 up_from 10 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6805/32175 192.168.114.55:6810/32175 192.168.114.55:6811/32175 exists,up e95ad5b2-6866-4161-8060-781a31d7ece2 osd.66 up in weight 1 up_from 10 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6806/32487 192.168.114.55:6812/32487 192.168.114.55:6813/32487 exists,up f3126979-ecd6-45de-b0bf-54cb2b0af042 osd.67 up in weight 1 up_from 11 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6807/32679 192.168.114.55:6814/32679 192.168.114.55:6815/32679 exists,up 37d3f121-b6f4-4c6f-ac9b-30533e8fa60a ceph.conf ---content--- # global [global] # enable secure authentication auth supported = none # allow ourselves to open a lot of files #max open files = 1100000 max open files = 131072 # set log file log file = /ceph/log/$name.log # log_to_syslog = true # uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid # If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible #ms bind ipv6 = true public network = 192.168.113.0/24 cluster network = 192.168.114.0/24 # monitors # You need at least one. You need at least three if you want to # tolerate any node failures. Always create an odd number. [mon] mon data = /ceph/$name # If you are using for example the RADOS Gateway and want to have your newly created # pools a higher replication level, you can set a default #osd pool default size = 3 # You can also specify a CRUSH rule for new pools # Wiki: http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH #osd pool default crush rule = 0 # Timing is critical for monitors, but if you want to allow the clocks to drift a # bit more, you can specify the max drift. #mon clock drift allowed = 1 # Tell the monitor to backoff from this warning for 30 seconds #mon clock drift warn backoff = 30 # logging, for debugging monitor crashes, in order of # their likelihood of being helpful :) #debug ms = 1 #debug mon = 20 #debug paxos = 20 #debug auth = 20 debug optracker = 0 [mon.0] host = RX37-3c mon addr = 192.168.113.52:6789 [mon.1] host = RX37-7c mon addr = 192.168.113.56:6789 [mon.2] host = RX37-8c mon addr = 192.168.113.57:6789 # mds # You need at least one. Define two to get a standby. [mds] # mds data = /ceph/$name # where the mds keeps it's secret encryption keys #keyring = /data/keyring.$name # mds logging to debug issues. #debug ms = 1 #debug mds = 20 debug optracker = 0 [mds.0] host = RX37-8c # osd # You need at least one. Two if you want data to be replicated. # Define as many as you like. [osd] # This is where the btrfs volume will be mounted. osd data = /data/$name # journal dio = true # osd op threads = 24 # osd disk threads = 24 # filestore op threads = 6 # filestore queue max ops = 24 filestore max sync interval = 30 filestore min sync interval = 29 filestore flusher = false filestore queue max ops = 10000 # Ideally, make this a separate disk or partition. A few # hundred MB should be enough; more if you have fast or many # disks. You can use a file under the osd data dir if need be # (e.g. /data/$name/journal), but it will be slower than a # separate disk or partition. # This is an example of a file-based journal. # osd journal = /ceph/$name/journal # osd journal size = 2048 # journal size, in megabytes # If you want to run the journal on a tmpfs, disable DirectIO #journal dio = false # You can change the number of recovery operations to speed up recovery # or slow it down if your machines can't handle it # osd recovery max active = 3 # osd logging to debug osd issues, in order of likelihood of being # helpful #debug ms = 1 #debug osd = 20 #debug filestore = 20 #debug journal = 20 debug optracker = 0 fstype = xfs [osd.30] host = RX37-3c devs = /dev/sdm osd journal = /dev/ram0 [osd.31] host = RX37-3c devs = /dev/sdn osd journal = /dev/ram1 [osd.32] host = RX37-3c devs = /dev/sdo osd journal = /dev/ram2 [osd.33] host = RX37-3c devs = /dev/sdp osd journal = /dev/ram3 [osd.34] host = RX37-3c devs = /dev/sdq osd journal = /dev/ram4 [osd.35] host = RX37-3c devs = /dev/sdr osd journal = /dev/ram5 [osd.36] host = RX37-3c devs = /dev/sds osd journal = /dev/ram6 [osd.37] host = RX37-3c devs = /dev/sdt osd journal = /dev/ram7 [osd.40] host = RX37-4c devs = /dev/sdd osd journal = /dev/ram0 [osd.41] host = RX37-4c devs = /dev/sde osd journal = /dev/ram1 [osd.42] host = RX37-4c devs = /dev/sdf osd journal = /dev/ram2 [osd.43] host = RX37-4c devs = /dev/sdg osd journal = /dev/ram3 [osd.44] host = RX37-4c devs = /dev/sdh osd journal = /dev/ram4 [osd.45] host = RX37-4c devs = /dev/sdi osd journal = /dev/ram5 [osd.46] host = RX37-4c devs = /dev/sdj osd journal = /dev/ram6 [osd.47] host = RX37-4c devs = /dev/sdk osd journal = /dev/ram7 [osd.50] host = RX37-5c devs = /dev/sdo osd journal = /dev/ram0 [osd.51] host = RX37-5c devs = /dev/sdp osd journal = /dev/ram1 [osd.52] host = RX37-5c devs = /dev/sdq osd journal = /dev/ram2 [osd.53] host = RX37-5c devs = /dev/sdr osd journal = /dev/ram3 [osd.54] host = RX37-5c devs = /dev/sds osd journal = /dev/ram4 [osd.55] host = RX37-5c devs = /dev/sdt osd journal = /dev/ram5 [osd.56] host = RX37-5c devs = /dev/sdu osd journal = /dev/ram6 [osd.57] host = RX37-5c devs = /dev/sdv osd journal = /dev/ram7 [osd.60] host = RX37-6c devs = /dev/sdn osd journal = /dev/ram0 [osd.61] host = RX37-6c devs = /dev/sdo osd journal = /dev/ram1 [osd.62] host = RX37-6c devs = /dev/sdp osd journal = /dev/ram2 [osd.63] host = RX37-6c devs = /dev/sdq osd journal = /dev/ram3 [osd.64] host = RX37-6c devs = /dev/sdr osd journal = /dev/ram4 [osd.65] host = RX37-6c devs = /dev/sds osd journal = /dev/ram5 [osd.66] host = RX37-6c devs = /dev/sdt osd journal = /dev/ram6 [osd.67] host = RX37-6c devs = /dev/sdu osd journal = /dev/ram7 devs = /dev/sdc [client.01] client hostname = RX37-7c ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 15:33 ` Dieter Kasper @ 2012-08-30 15:46 ` Alexandre DERUMIER 2012-08-30 16:02 ` Dieter Kasper 0 siblings, 1 reply; 31+ messages in thread From: Alexandre DERUMIER @ 2012-08-30 15:46 UTC (permalink / raw) To: Dieter Kasper; +Cc: ceph-devel Thanks >> 8x SSD, 200GB each 20000 iops seem pretty low,no ? for @intank: Is their a bottleneck somewhere in ceph ? I said that, because I would like to know if it's scale by adding new nodes. Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) ----- Mail original ----- De: "Dieter Kasper" <d.kasper@kabelmail.de> À: "Alexandre DERUMIER" <aderumier@odiso.com> Cc: ceph-devel@vger.kernel.org Envoyé: Jeudi 30 Août 2012 17:33:42 Objet: Re: RBD performance - tuning hints On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: > Thanks for the report ! > > vs your first benchmark, it's with RBD 4M or 64K ? with 4MB (see attached config info) Cheers, -Dieter > > (how much ssd by node?) 8x SSD, 200GB each > > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org > Envoyé: Jeudi 30 Août 2012 16:56:34 > Objet: Re: RBD performance - tuning hints > > Hi Alexandre, > > with the 4 filestore parameter below some fio values could be increased: > filestore max sync interval = 30 > filestore min sync interval = 29 > filestore flusher = false > filestore queue max ops = 10000 > > ###### IOPS > fio_read_4k_64: 9373 > fio_read_4k_128: 9939 > fio_randwrite_8k_16: 12376 > fio_randwrite_4k_16: 13315 > fio_randwrite_512_32: 13660 > fio_randwrite_8k_32: 17318 > fio_randwrite_4k_32: 18057 > fio_randwrite_8k_64: 19693 > fio_randwrite_512_64: 20015 <<< > fio_randwrite_4k_64: 20024 <<< > fio_randwrite_8k_128: 20547 <<< > fio_randwrite_4k_128: 20839 <<< > fio_randwrite_512_128: 21417 <<< > fio_randread_8k_128: 48872 > fio_randread_4k_128: 50002 > fio_randread_512_128: 51202 > > ###### MB/s > fio_randread_2m_32: 628 > fio_read_4m_64: 630 > fio_randread_8m_32: 633 > fio_read_2m_32: 637 > fio_read_4m_16: 640 > fio_randread_4m_16: 652 > fio_write_2m_32: 660 > fio_randread_4m_32: 677 > fio_read_4m_32: 678 > (...) > fio_write_4m_64: 771 > fio_randwrite_2m_64: 789 > fio_write_8m_128: 796 > fio_write_4m_32: 802 > fio_randwrite_4m_128: 807 <<< > fio_randwrite_2m_32: 811 <<< > fio_write_2m_128: 833 <<< > fio_write_8m_64: 901 <<< > > Best Regards, > -Dieter > > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > > Nice results ! > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > I think you can try to tune these values > > > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > > > > > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: ceph-devel@vger.kernel.org > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > Envoyé: Mardi 28 Août 2012 19:48:42 > > Objet: RBD performance - tuning hints > > > > Hi, > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > I can observe a pretty nice rados bench performance > > (see bench-rados.txt for details): > > > > Bandwidth (MB/sec): 961.710 > > Max bandwidth (MB/sec): 1040 > > Min bandwidth (MB/sec): 772 > > > > > > Also the bandwidth performance generated with > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > .... is acceptable, e.g. > > fio_write_4m_16 795 MB/s > > fio_randwrite_8m_128 717 MB/s > > fio_randwrite_8m_16 714 MB/s > > fio_randwrite_2m_32 692 MB/s > > > > > > But, the write IOPS seems to be limited around 19k ... > > RBD 4M 64k (= optimal_io_size) > > fio_randread_512_128 53286 55925 > > fio_randread_4k_128 51110 44382 > > fio_randread_8k_128 30854 29938 > > fio_randwrite_512_128 18888 2386 > > fio_randwrite_512_64 18844 2582 > > fio_randwrite_8k_64 17350 2445 > > (...) > > fio_read_4k_128 10073 53151 > > fio_read_4k_64 9500 39757 > > fio_read_4k_32 9220 23650 > > (...) > > fio_read_4k_16 9122 14322 > > fio_write_4k_128 2190 14306 > > fio_read_8k_32 706 13894 > > fio_write_4k_64 2197 12297 > > fio_write_8k_64 3563 11705 > > fio_write_8k_128 3444 11219 > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > Kind Regards, > > -Dieter > > > > > > > > -- > > > > -- > > > > > > > > > > > > Alexandre D e rumier > > > > Ingénieur Systèmes et Réseaux > > > > > > Fixe : 03 20 68 88 85 > > > > Fax : 03 20 68 90 88 > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > 12 rue Marivaux 75002 Paris > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 15:46 ` Alexandre DERUMIER @ 2012-08-30 16:02 ` Dieter Kasper 2012-08-30 16:12 ` Alexandre DERUMIER 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-30 16:02 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org, Andreas Bluemle On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: > Thanks > > >> 8x SSD, 200GB each > > 20000 iops seem pretty low,no ? well, you have to compare - pure a SSD (via PCIe or SAS-6G) vs. - Ceph-Journal, which goes 2x over 10GbE with IP Client -> primary-copy -> 2nd-copy (= redundancy over Ethernet distance) I'm curious about the answer from Inktank, -Dieter > > > for @intank: > > Is their a bottleneck somewhere in ceph ? Maybe "SimpleMessenger dispatching: cause of performance problems?" from Thu, 16 Aug 2012 18:08:39 +0200 by <andreas.bluemle@itxperts.de> can be an answer. Especially if a small number of OSDs is used. > > I said that, because I would like to know if it's scale by adding new nodes. > > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org > Envoyé: Jeudi 30 Août 2012 17:33:42 > Objet: Re: RBD performance - tuning hints > > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: > > Thanks for the report ! > > > > vs your first benchmark, it's with RBD 4M or 64K ? > with 4MB (see attached config info) > > Cheers, > -Dieter > > > > > (how much ssd by node?) > 8x SSD, 200GB each > > > > > > > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: "Alexandre DERUMIER" <aderumier@odiso.com> > > Cc: ceph-devel@vger.kernel.org > > Envoyé: Jeudi 30 Août 2012 16:56:34 > > Objet: Re: RBD performance - tuning hints > > > > Hi Alexandre, > > > > with the 4 filestore parameter below some fio values could be increased: > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > > > ###### IOPS > > fio_read_4k_64: 9373 > > fio_read_4k_128: 9939 > > fio_randwrite_8k_16: 12376 > > fio_randwrite_4k_16: 13315 > > fio_randwrite_512_32: 13660 > > fio_randwrite_8k_32: 17318 > > fio_randwrite_4k_32: 18057 > > fio_randwrite_8k_64: 19693 > > fio_randwrite_512_64: 20015 <<< > > fio_randwrite_4k_64: 20024 <<< > > fio_randwrite_8k_128: 20547 <<< > > fio_randwrite_4k_128: 20839 <<< > > fio_randwrite_512_128: 21417 <<< > > fio_randread_8k_128: 48872 > > fio_randread_4k_128: 50002 > > fio_randread_512_128: 51202 > > > > ###### MB/s > > fio_randread_2m_32: 628 > > fio_read_4m_64: 630 > > fio_randread_8m_32: 633 > > fio_read_2m_32: 637 > > fio_read_4m_16: 640 > > fio_randread_4m_16: 652 > > fio_write_2m_32: 660 > > fio_randread_4m_32: 677 > > fio_read_4m_32: 678 > > (...) > > fio_write_4m_64: 771 > > fio_randwrite_2m_64: 789 > > fio_write_8m_128: 796 > > fio_write_4m_32: 802 > > fio_randwrite_4m_128: 807 <<< > > fio_randwrite_2m_32: 811 <<< > > fio_write_2m_128: 833 <<< > > fio_write_8m_64: 901 <<< > > > > Best Regards, > > -Dieter > > > > > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > > > Nice results ! > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > I think you can try to tune these values > > > > > > filestore max sync interval = 30 > > > filestore min sync interval = 29 > > > filestore flusher = false > > > filestore queue max ops = 10000 > > > > > > > > > > > > ----- Mail original ----- > > > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > > À: ceph-devel@vger.kernel.org > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > > Envoyé: Mardi 28 Août 2012 19:48:42 > > > Objet: RBD performance - tuning hints > > > > > > Hi, > > > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > > I can observe a pretty nice rados bench performance > > > (see bench-rados.txt for details): > > > > > > Bandwidth (MB/sec): 961.710 > > > Max bandwidth (MB/sec): 1040 > > > Min bandwidth (MB/sec): 772 > > > > > > > > > Also the bandwidth performance generated with > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > > > .... is acceptable, e.g. > > > fio_write_4m_16 795 MB/s > > > fio_randwrite_8m_128 717 MB/s > > > fio_randwrite_8m_16 714 MB/s > > > fio_randwrite_2m_32 692 MB/s > > > > > > > > > But, the write IOPS seems to be limited around 19k ... > > > RBD 4M 64k (= optimal_io_size) > > > fio_randread_512_128 53286 55925 > > > fio_randread_4k_128 51110 44382 > > > fio_randread_8k_128 30854 29938 > > > fio_randwrite_512_128 18888 2386 > > > fio_randwrite_512_64 18844 2582 > > > fio_randwrite_8k_64 17350 2445 > > > (...) > > > fio_read_4k_128 10073 53151 > > > fio_read_4k_64 9500 39757 > > > fio_read_4k_32 9220 23650 > > > (...) > > > fio_read_4k_16 9122 14322 > > > fio_write_4k_128 2190 14306 > > > fio_read_8k_32 706 13894 > > > fio_write_4k_64 2197 12297 > > > fio_write_8k_64 3563 11705 > > > fio_write_8k_128 3444 11219 > > > > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > > > > Kind Regards, > > > -Dieter > > > > > > > > > > > > -- > > > > > > -- > > > > > > > > > > > > > > > > > > Alexandre D e rumier > > > > > > Ingénieur Systèmes et Réseaux > > > > > > > > > Fixe : 03 20 68 88 85 > > > > > > Fax : 03 20 68 90 88 > > > > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > > 12 rue Marivaux 75002 Paris > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > -- > > > > > > > > > > > > Alexandre D e rumier > > > > Ingénieur Systèmes et Réseaux > > > > > > Fixe : 03 20 68 88 85 > > > > Fax : 03 20 68 90 88 > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > 12 rue Marivaux 75002 Paris > > > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 16:02 ` Dieter Kasper @ 2012-08-30 16:12 ` Alexandre DERUMIER 2012-08-30 16:16 ` Josh Durgin 2012-08-30 16:48 ` Dieter Kasper 0 siblings, 2 replies; 31+ messages in thread From: Alexandre DERUMIER @ 2012-08-30 16:12 UTC (permalink / raw) To: Dieter Kasper; +Cc: ceph-devel, Andreas Bluemle >>well, you have to compare >>- pure a SSD (via PCIe or SAS-6G) vs. >>- Ceph-Journal, which goes 2x over 10GbE with IP >> Client -> primary-copy -> 2nd-copy >> (= redundancy over Ethernet distance) Sure but the first osd ack to the client,before replicating to the others osd. Client -> primary-copy -> 2nd-copy <-ack primary-copy -> 2nd-copy -> 3st-copy Or I'm wrong ? ----- Mail original ----- De: "Dieter Kasper" <d.kasper@kabelmail.de> À: "Alexandre DERUMIER" <aderumier@odiso.com> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> Envoyé: Jeudi 30 Août 2012 18:02:05 Objet: Re: RBD performance - tuning hints On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: > Thanks > > >> 8x SSD, 200GB each > > 20000 iops seem pretty low,no ? well, you have to compare - pure a SSD (via PCIe or SAS-6G) vs. - Ceph-Journal, which goes 2x over 10GbE with IP Client -> primary-copy -> 2nd-copy (= redundancy over Ethernet distance) I'm curious about the answer from Inktank, -Dieter > > > for @intank: > > Is their a bottleneck somewhere in ceph ? Maybe "SimpleMessenger dispatching: cause of performance problems?" from Thu, 16 Aug 2012 18:08:39 +0200 by <andreas.bluemle@itxperts.de> can be an answer. Especially if a small number of OSDs is used. > > I said that, because I would like to know if it's scale by adding new nodes. > > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org > Envoyé: Jeudi 30 Août 2012 17:33:42 > Objet: Re: RBD performance - tuning hints > > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: > > Thanks for the report ! > > > > vs your first benchmark, it's with RBD 4M or 64K ? > with 4MB (see attached config info) > > Cheers, > -Dieter > > > > > (how much ssd by node?) > 8x SSD, 200GB each > > > > > > > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: "Alexandre DERUMIER" <aderumier@odiso.com> > > Cc: ceph-devel@vger.kernel.org > > Envoyé: Jeudi 30 Août 2012 16:56:34 > > Objet: Re: RBD performance - tuning hints > > > > Hi Alexandre, > > > > with the 4 filestore parameter below some fio values could be increased: > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > > > ###### IOPS > > fio_read_4k_64: 9373 > > fio_read_4k_128: 9939 > > fio_randwrite_8k_16: 12376 > > fio_randwrite_4k_16: 13315 > > fio_randwrite_512_32: 13660 > > fio_randwrite_8k_32: 17318 > > fio_randwrite_4k_32: 18057 > > fio_randwrite_8k_64: 19693 > > fio_randwrite_512_64: 20015 <<< > > fio_randwrite_4k_64: 20024 <<< > > fio_randwrite_8k_128: 20547 <<< > > fio_randwrite_4k_128: 20839 <<< > > fio_randwrite_512_128: 21417 <<< > > fio_randread_8k_128: 48872 > > fio_randread_4k_128: 50002 > > fio_randread_512_128: 51202 > > > > ###### MB/s > > fio_randread_2m_32: 628 > > fio_read_4m_64: 630 > > fio_randread_8m_32: 633 > > fio_read_2m_32: 637 > > fio_read_4m_16: 640 > > fio_randread_4m_16: 652 > > fio_write_2m_32: 660 > > fio_randread_4m_32: 677 > > fio_read_4m_32: 678 > > (...) > > fio_write_4m_64: 771 > > fio_randwrite_2m_64: 789 > > fio_write_8m_128: 796 > > fio_write_4m_32: 802 > > fio_randwrite_4m_128: 807 <<< > > fio_randwrite_2m_32: 811 <<< > > fio_write_2m_128: 833 <<< > > fio_write_8m_64: 901 <<< > > > > Best Regards, > > -Dieter > > > > > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > > > Nice results ! > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > I think you can try to tune these values > > > > > > filestore max sync interval = 30 > > > filestore min sync interval = 29 > > > filestore flusher = false > > > filestore queue max ops = 10000 > > > > > > > > > > > > ----- Mail original ----- > > > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > > À: ceph-devel@vger.kernel.org > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > > Envoyé: Mardi 28 Août 2012 19:48:42 > > > Objet: RBD performance - tuning hints > > > > > > Hi, > > > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > > I can observe a pretty nice rados bench performance > > > (see bench-rados.txt for details): > > > > > > Bandwidth (MB/sec): 961.710 > > > Max bandwidth (MB/sec): 1040 > > > Min bandwidth (MB/sec): 772 > > > > > > > > > Also the bandwidth performance generated with > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > > > .... is acceptable, e.g. > > > fio_write_4m_16 795 MB/s > > > fio_randwrite_8m_128 717 MB/s > > > fio_randwrite_8m_16 714 MB/s > > > fio_randwrite_2m_32 692 MB/s > > > > > > > > > But, the write IOPS seems to be limited around 19k ... > > > RBD 4M 64k (= optimal_io_size) > > > fio_randread_512_128 53286 55925 > > > fio_randread_4k_128 51110 44382 > > > fio_randread_8k_128 30854 29938 > > > fio_randwrite_512_128 18888 2386 > > > fio_randwrite_512_64 18844 2582 > > > fio_randwrite_8k_64 17350 2445 > > > (...) > > > fio_read_4k_128 10073 53151 > > > fio_read_4k_64 9500 39757 > > > fio_read_4k_32 9220 23650 > > > (...) > > > fio_read_4k_16 9122 14322 > > > fio_write_4k_128 2190 14306 > > > fio_read_8k_32 706 13894 > > > fio_write_4k_64 2197 12297 > > > fio_write_8k_64 3563 11705 > > > fio_write_8k_128 3444 11219 > > > > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > > > > Kind Regards, > > > -Dieter > > > > > > > > > > > > -- > > > > > > -- > > > > > > > > > > > > > > > > > > Alexandre D e rumier > > > > > > Ingénieur Systèmes et Réseaux > > > > > > > > > Fixe : 03 20 68 88 85 > > > > > > Fax : 03 20 68 90 88 > > > > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > > 12 rue Marivaux 75002 Paris > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > -- > > > > > > > > > > > > Alexandre D e rumier > > > > Ingénieur Systèmes et Réseaux > > > > > > Fixe : 03 20 68 88 85 > > > > Fax : 03 20 68 90 88 > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > 12 rue Marivaux 75002 Paris > > > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 16:12 ` Alexandre DERUMIER @ 2012-08-30 16:16 ` Josh Durgin 2012-08-31 7:46 ` Alexandre DERUMIER 2012-08-30 16:48 ` Dieter Kasper 1 sibling, 1 reply; 31+ messages in thread From: Josh Durgin @ 2012-08-30 16:16 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Dieter Kasper, ceph-devel, Andreas Bluemle On 08/30/2012 09:12 AM, Alexandre DERUMIER wrote: >>> well, you have to compare >>> - pure a SSD (via PCIe or SAS-6G) vs. >>> - Ceph-Journal, which goes 2x over 10GbE with IP >>> Client -> primary-copy -> 2nd-copy >>> (= redundancy over Ethernet distance) > > Sure but the first osd ack to the client,before replicating to the others osd. > > Client -> primary-copy -> 2nd-copy > <-ack > primary-copy -> 2nd-copy > -> 3st-copy > > Or I'm wrong ? RBD waits for the data to be on disk on all replicas. It's pretty easy to relax this to in memory on all replicas, but there's no option for that right now. Josh > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> > Envoyé: Jeudi 30 Août 2012 18:02:05 > Objet: Re: RBD performance - tuning hints > > On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: >> Thanks >> >>>> 8x SSD, 200GB each >> >> 20000 iops seem pretty low,no ? > well, you have to compare > - pure a SSD (via PCIe or SAS-6G) vs. > - Ceph-Journal, which goes 2x over 10GbE with IP > Client -> primary-copy -> 2nd-copy > (= redundancy over Ethernet distance) > > I'm curious about the answer from Inktank, > > -Dieter > >> >> >> for @intank: >> >> Is their a bottleneck somewhere in ceph ? > Maybe "SimpleMessenger dispatching: cause of performance problems?" > from Thu, 16 Aug 2012 18:08:39 +0200 > by <andreas.bluemle@itxperts.de> > can be an answer. > Especially if a small number of OSDs is used. > >> >> I said that, because I would like to know if it's scale by adding new nodes. >> >> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) >> >> >> ----- Mail original ----- >> >> De: "Dieter Kasper" <d.kasper@kabelmail.de> >> À: "Alexandre DERUMIER" <aderumier@odiso.com> >> Cc: ceph-devel@vger.kernel.org >> Envoyé: Jeudi 30 Août 2012 17:33:42 >> Objet: Re: RBD performance - tuning hints >> >> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: >>> Thanks for the report ! >>> >>> vs your first benchmark, it's with RBD 4M or 64K ? >> with 4MB (see attached config info) >> >> Cheers, >> -Dieter >> >>> >>> (how much ssd by node?) >> 8x SSD, 200GB each >> >>> >>> >>> >>> ----- Mail original ----- >>> >>> De: "Dieter Kasper" <d.kasper@kabelmail.de> >>> À: "Alexandre DERUMIER" <aderumier@odiso.com> >>> Cc: ceph-devel@vger.kernel.org >>> Envoyé: Jeudi 30 Août 2012 16:56:34 >>> Objet: Re: RBD performance - tuning hints >>> >>> Hi Alexandre, >>> >>> with the 4 filestore parameter below some fio values could be increased: >>> filestore max sync interval = 30 >>> filestore min sync interval = 29 >>> filestore flusher = false >>> filestore queue max ops = 10000 >>> >>> ###### IOPS >>> fio_read_4k_64: 9373 >>> fio_read_4k_128: 9939 >>> fio_randwrite_8k_16: 12376 >>> fio_randwrite_4k_16: 13315 >>> fio_randwrite_512_32: 13660 >>> fio_randwrite_8k_32: 17318 >>> fio_randwrite_4k_32: 18057 >>> fio_randwrite_8k_64: 19693 >>> fio_randwrite_512_64: 20015 <<< >>> fio_randwrite_4k_64: 20024 <<< >>> fio_randwrite_8k_128: 20547 <<< >>> fio_randwrite_4k_128: 20839 <<< >>> fio_randwrite_512_128: 21417 <<< >>> fio_randread_8k_128: 48872 >>> fio_randread_4k_128: 50002 >>> fio_randread_512_128: 51202 >>> >>> ###### MB/s >>> fio_randread_2m_32: 628 >>> fio_read_4m_64: 630 >>> fio_randread_8m_32: 633 >>> fio_read_2m_32: 637 >>> fio_read_4m_16: 640 >>> fio_randread_4m_16: 652 >>> fio_write_2m_32: 660 >>> fio_randread_4m_32: 677 >>> fio_read_4m_32: 678 >>> (...) >>> fio_write_4m_64: 771 >>> fio_randwrite_2m_64: 789 >>> fio_write_8m_128: 796 >>> fio_write_4m_32: 802 >>> fio_randwrite_4m_128: 807 <<< >>> fio_randwrite_2m_32: 811 <<< >>> fio_write_2m_128: 833 <<< >>> fio_write_8m_64: 901 <<< >>> >>> Best Regards, >>> -Dieter >>> >>> >>> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: >>>> Nice results ! >>>> (can you make same benchmark from a qemu-kvm guest with virtio-driver ? >>>> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >>>> >>>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >>>> I think you can try to tune these values >>>> >>>> filestore max sync interval = 30 >>>> filestore min sync interval = 29 >>>> filestore flusher = false >>>> filestore queue max ops = 10000 >>>> >>>> >>>> >>>> ----- Mail original ----- >>>> >>>> De: "Dieter Kasper" <d.kasper@kabelmail.de> >>>> À: ceph-devel@vger.kernel.org >>>> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> >>>> Envoyé: Mardi 28 Août 2012 19:48:42 >>>> Objet: RBD performance - tuning hints >>>> >>>> Hi, >>>> >>>> on my 4-node system (SSD + 10GbE, see bench-config.txt for details) >>>> I can observe a pretty nice rados bench performance >>>> (see bench-rados.txt for details): >>>> >>>> Bandwidth (MB/sec): 961.710 >>>> Max bandwidth (MB/sec): 1040 >>>> Min bandwidth (MB/sec): 772 >>>> >>>> >>>> Also the bandwidth performance generated with >>>> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} >>>> >>>> .... is acceptable, e.g. >>>> fio_write_4m_16 795 MB/s >>>> fio_randwrite_8m_128 717 MB/s >>>> fio_randwrite_8m_16 714 MB/s >>>> fio_randwrite_2m_32 692 MB/s >>>> >>>> >>>> But, the write IOPS seems to be limited around 19k ... >>>> RBD 4M 64k (= optimal_io_size) >>>> fio_randread_512_128 53286 55925 >>>> fio_randread_4k_128 51110 44382 >>>> fio_randread_8k_128 30854 29938 >>>> fio_randwrite_512_128 18888 2386 >>>> fio_randwrite_512_64 18844 2582 >>>> fio_randwrite_8k_64 17350 2445 >>>> (...) >>>> fio_read_4k_128 10073 53151 >>>> fio_read_4k_64 9500 39757 >>>> fio_read_4k_32 9220 23650 >>>> (...) >>>> fio_read_4k_16 9122 14322 >>>> fio_write_4k_128 2190 14306 >>>> fio_read_8k_32 706 13894 >>>> fio_write_4k_64 2197 12297 >>>> fio_write_8k_64 3563 11705 >>>> fio_write_8k_128 3444 11219 >>>> >>>> >>>> Any hints for tuning the IOPS (read and/or write) would be appreciated. >>>> >>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >>>> >>>> >>>> Kind Regards, >>>> -Dieter >>>> >>>> >>>> >>>> -- >>>> >>>> -- >>>> >>>> >>>> >>>> >>>> >>>> Alexandre D e rumier >>>> >>>> Ingénieur Systèmes et Réseaux >>>> >>>> >>>> Fixe : 03 20 68 88 85 >>>> >>>> Fax : 03 20 68 90 88 >>>> >>>> >>>> 45 Bvd du Général Leclerc 59100 Roubaix >>>> 12 rue Marivaux 75002 Paris >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> >>> -- >>> >>> -- >>> >>> >>> >>> >>> >>> Alexandre D e rumier >>> >>> Ingénieur Systèmes et Réseaux >>> >>> >>> Fixe : 03 20 68 88 85 >>> >>> Fax : 03 20 68 90 88 >>> >>> >>> 45 Bvd du Général Leclerc 59100 Roubaix >>> 12 rue Marivaux 75002 Paris >>> >> >> >> >> -- >> >> -- >> >> >> >> >> >> Alexandre D e rumier >> >> Ingénieur Systèmes et Réseaux >> >> >> Fixe : 03 20 68 88 85 >> >> Fax : 03 20 68 90 88 >> >> >> 45 Bvd du Général Leclerc 59100 Roubaix >> 12 rue Marivaux 75002 Paris >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 16:16 ` Josh Durgin @ 2012-08-31 7:46 ` Alexandre DERUMIER 2012-08-31 8:11 ` Dietmar Maurer 0 siblings, 1 reply; 31+ messages in thread From: Alexandre DERUMIER @ 2012-08-31 7:46 UTC (permalink / raw) To: Josh Durgin; +Cc: Dieter Kasper, ceph-devel, Andreas Bluemle >>RBD waits for the data to be on disk on all replicas. It's pretty easy >>to relax this to in memory on all replicas, but there's no option for >>that right now. Ok, thanks, I miss that. When you say disk, you mean journal ? ----- Mail original ----- De: "Josh Durgin" <josh.durgin@inktank.com> À: "Alexandre DERUMIER" <aderumier@odiso.com> Cc: "Dieter Kasper" <d.kasper@kabelmail.de>, ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> Envoyé: Jeudi 30 Août 2012 18:16:47 Objet: Re: RBD performance - tuning hints On 08/30/2012 09:12 AM, Alexandre DERUMIER wrote: >>> well, you have to compare >>> - pure a SSD (via PCIe or SAS-6G) vs. >>> - Ceph-Journal, which goes 2x over 10GbE with IP >>> Client -> primary-copy -> 2nd-copy >>> (= redundancy over Ethernet distance) > > Sure but the first osd ack to the client,before replicating to the others osd. > > Client -> primary-copy -> 2nd-copy > <-ack > primary-copy -> 2nd-copy > -> 3st-copy > > Or I'm wrong ? RBD waits for the data to be on disk on all replicas. It's pretty easy to relax this to in memory on all replicas, but there's no option for that right now. Josh > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> > Envoyé: Jeudi 30 Août 2012 18:02:05 > Objet: Re: RBD performance - tuning hints > > On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: >> Thanks >> >>>> 8x SSD, 200GB each >> >> 20000 iops seem pretty low,no ? > well, you have to compare > - pure a SSD (via PCIe or SAS-6G) vs. > - Ceph-Journal, which goes 2x over 10GbE with IP > Client -> primary-copy -> 2nd-copy > (= redundancy over Ethernet distance) > > I'm curious about the answer from Inktank, > > -Dieter > >> >> >> for @intank: >> >> Is their a bottleneck somewhere in ceph ? > Maybe "SimpleMessenger dispatching: cause of performance problems?" > from Thu, 16 Aug 2012 18:08:39 +0200 > by <andreas.bluemle@itxperts.de> > can be an answer. > Especially if a small number of OSDs is used. > >> >> I said that, because I would like to know if it's scale by adding new nodes. >> >> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) >> >> >> ----- Mail original ----- >> >> De: "Dieter Kasper" <d.kasper@kabelmail.de> >> À: "Alexandre DERUMIER" <aderumier@odiso.com> >> Cc: ceph-devel@vger.kernel.org >> Envoyé: Jeudi 30 Août 2012 17:33:42 >> Objet: Re: RBD performance - tuning hints >> >> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: >>> Thanks for the report ! >>> >>> vs your first benchmark, it's with RBD 4M or 64K ? >> with 4MB (see attached config info) >> >> Cheers, >> -Dieter >> >>> >>> (how much ssd by node?) >> 8x SSD, 200GB each >> >>> >>> >>> >>> ----- Mail original ----- >>> >>> De: "Dieter Kasper" <d.kasper@kabelmail.de> >>> À: "Alexandre DERUMIER" <aderumier@odiso.com> >>> Cc: ceph-devel@vger.kernel.org >>> Envoyé: Jeudi 30 Août 2012 16:56:34 >>> Objet: Re: RBD performance - tuning hints >>> >>> Hi Alexandre, >>> >>> with the 4 filestore parameter below some fio values could be increased: >>> filestore max sync interval = 30 >>> filestore min sync interval = 29 >>> filestore flusher = false >>> filestore queue max ops = 10000 >>> >>> ###### IOPS >>> fio_read_4k_64: 9373 >>> fio_read_4k_128: 9939 >>> fio_randwrite_8k_16: 12376 >>> fio_randwrite_4k_16: 13315 >>> fio_randwrite_512_32: 13660 >>> fio_randwrite_8k_32: 17318 >>> fio_randwrite_4k_32: 18057 >>> fio_randwrite_8k_64: 19693 >>> fio_randwrite_512_64: 20015 <<< >>> fio_randwrite_4k_64: 20024 <<< >>> fio_randwrite_8k_128: 20547 <<< >>> fio_randwrite_4k_128: 20839 <<< >>> fio_randwrite_512_128: 21417 <<< >>> fio_randread_8k_128: 48872 >>> fio_randread_4k_128: 50002 >>> fio_randread_512_128: 51202 >>> >>> ###### MB/s >>> fio_randread_2m_32: 628 >>> fio_read_4m_64: 630 >>> fio_randread_8m_32: 633 >>> fio_read_2m_32: 637 >>> fio_read_4m_16: 640 >>> fio_randread_4m_16: 652 >>> fio_write_2m_32: 660 >>> fio_randread_4m_32: 677 >>> fio_read_4m_32: 678 >>> (...) >>> fio_write_4m_64: 771 >>> fio_randwrite_2m_64: 789 >>> fio_write_8m_128: 796 >>> fio_write_4m_32: 802 >>> fio_randwrite_4m_128: 807 <<< >>> fio_randwrite_2m_32: 811 <<< >>> fio_write_2m_128: 833 <<< >>> fio_write_8m_64: 901 <<< >>> >>> Best Regards, >>> -Dieter >>> >>> >>> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: >>>> Nice results ! >>>> (can you make same benchmark from a qemu-kvm guest with virtio-driver ? >>>> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >>>> >>>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >>>> I think you can try to tune these values >>>> >>>> filestore max sync interval = 30 >>>> filestore min sync interval = 29 >>>> filestore flusher = false >>>> filestore queue max ops = 10000 >>>> >>>> >>>> >>>> ----- Mail original ----- >>>> >>>> De: "Dieter Kasper" <d.kasper@kabelmail.de> >>>> À: ceph-devel@vger.kernel.org >>>> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> >>>> Envoyé: Mardi 28 Août 2012 19:48:42 >>>> Objet: RBD performance - tuning hints >>>> >>>> Hi, >>>> >>>> on my 4-node system (SSD + 10GbE, see bench-config.txt for details) >>>> I can observe a pretty nice rados bench performance >>>> (see bench-rados.txt for details): >>>> >>>> Bandwidth (MB/sec): 961.710 >>>> Max bandwidth (MB/sec): 1040 >>>> Min bandwidth (MB/sec): 772 >>>> >>>> >>>> Also the bandwidth performance generated with >>>> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} >>>> >>>> .... is acceptable, e.g. >>>> fio_write_4m_16 795 MB/s >>>> fio_randwrite_8m_128 717 MB/s >>>> fio_randwrite_8m_16 714 MB/s >>>> fio_randwrite_2m_32 692 MB/s >>>> >>>> >>>> But, the write IOPS seems to be limited around 19k ... >>>> RBD 4M 64k (= optimal_io_size) >>>> fio_randread_512_128 53286 55925 >>>> fio_randread_4k_128 51110 44382 >>>> fio_randread_8k_128 30854 29938 >>>> fio_randwrite_512_128 18888 2386 >>>> fio_randwrite_512_64 18844 2582 >>>> fio_randwrite_8k_64 17350 2445 >>>> (...) >>>> fio_read_4k_128 10073 53151 >>>> fio_read_4k_64 9500 39757 >>>> fio_read_4k_32 9220 23650 >>>> (...) >>>> fio_read_4k_16 9122 14322 >>>> fio_write_4k_128 2190 14306 >>>> fio_read_8k_32 706 13894 >>>> fio_write_4k_64 2197 12297 >>>> fio_write_8k_64 3563 11705 >>>> fio_write_8k_128 3444 11219 >>>> >>>> >>>> Any hints for tuning the IOPS (read and/or write) would be appreciated. >>>> >>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >>>> >>>> >>>> Kind Regards, >>>> -Dieter >>>> >>>> >>>> >>>> -- >>>> >>>> -- >>>> >>>> >>>> >>>> >>>> >>>> Alexandre D e rumier >>>> >>>> Ingénieur Systèmes et Réseaux >>>> >>>> >>>> Fixe : 03 20 68 88 85 >>>> >>>> Fax : 03 20 68 90 88 >>>> >>>> >>>> 45 Bvd du Général Leclerc 59100 Roubaix >>>> 12 rue Marivaux 75002 Paris >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> >>> -- >>> >>> -- >>> >>> >>> >>> >>> >>> Alexandre D e rumier >>> >>> Ingénieur Systèmes et Réseaux >>> >>> >>> Fixe : 03 20 68 88 85 >>> >>> Fax : 03 20 68 90 88 >>> >>> >>> 45 Bvd du Général Leclerc 59100 Roubaix >>> 12 rue Marivaux 75002 Paris >>> >> >> >> >> -- >> >> -- >> >> >> >> >> >> Alexandre D e rumier >> >> Ingénieur Systèmes et Réseaux >> >> >> Fixe : 03 20 68 88 85 >> >> Fax : 03 20 68 90 88 >> >> >> 45 Bvd du Général Leclerc 59100 Roubaix >> 12 rue Marivaux 75002 Paris >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: RBD performance - tuning hints 2012-08-31 7:46 ` Alexandre DERUMIER @ 2012-08-31 8:11 ` Dietmar Maurer 2012-08-31 8:48 ` Mark Kirkwood 2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg 0 siblings, 2 replies; 31+ messages in thread From: Dietmar Maurer @ 2012-08-31 8:11 UTC (permalink / raw) To: Alexandre DERUMIER, Josh Durgin Cc: Dieter Kasper, ceph-devel@vger.kernel.org, Andreas Bluemle >>RBD waits for the data to be on disk on all replicas. It's pretty easy >>to relax this to in memory on all replicas, but there's no option for >>that right now. I thought that is dangerous, because you can loose data? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-31 8:11 ` Dietmar Maurer @ 2012-08-31 8:48 ` Mark Kirkwood 2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper 2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg 1 sibling, 1 reply; 31+ messages in thread From: Mark Kirkwood @ 2012-08-31 8:48 UTC (permalink / raw) To: Dietmar Maurer Cc: Alexandre DERUMIER, Josh Durgin, Dieter Kasper, ceph-devel@vger.kernel.org, Andreas Bluemle On 31/08/12 20:11, Dietmar Maurer wrote: >>> RBD waits for the data to be on disk on all replicas. It's pretty easy >>> to relax this to in memory on all replicas, but there's no option for >>> that right now. > I thought that is dangerous, because you can loose data? > N�����r��y���b�X��ǧv�^�){.n�+���z�]z�{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+��ݢj"��!tml= And it is not immediately obvious that this is the bottleneck - from what I can see the 'sync' call being used (sync_file_range) is extremely fast and is *not* the major slowdown effect... Regards Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / major slowdown effect(s) 2012-08-31 8:48 ` Mark Kirkwood @ 2012-08-31 9:49 ` Dieter Kasper 2012-08-31 10:16 ` Mark Kirkwood 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-31 9:49 UTC (permalink / raw) To: Mark Kirkwood Cc: Dietmar Maurer, Alexandre DERUMIER, Josh Durgin, ceph-devel@vger.kernel.org, Andreas Bluemle Mark, Inktank, OK, it is very likely that 'sync_file_range' is not the major slowdown 'culprit'. But, which areas (design, current implementation, protocol, interconnect, tuning parameter, ...) would you rate as 'major slowdown effect(s)' ? Best Regards, -Dieter On Fri, Aug 31, 2012 at 08:48:34PM +1200, Mark Kirkwood wrote: > On 31/08/12 20:11, Dietmar Maurer wrote: > >>>RBD waits for the data to be on disk on all replicas. It's pretty easy > >>>to relax this to in memory on all replicas, but there's no option for > >>>that right now. > >I thought that is dangerous, because you can loose data? > >N???????????????r??????y?????????b???X????????v???^???)??{.n???+?????????z???]z???{ay???\x1d???????,j\a??????f?????????h?????????z???\x1e???w?????????\f?????????j:+v?????????w???j???m????????????\a????????????zZ+????????j"??????!tml= > > And it is not immediately obvious that this is the bottleneck - from > what I can see the 'sync' call being used (sync_file_range) is > extremely fast and is *not* the major slowdown effect... > > Regards > > Mark > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / major slowdown effect(s) 2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper @ 2012-08-31 10:16 ` Mark Kirkwood 0 siblings, 0 replies; 31+ messages in thread From: Mark Kirkwood @ 2012-08-31 10:16 UTC (permalink / raw) To: Dieter Kasper Cc: Dietmar Maurer, Alexandre DERUMIER, Josh Durgin, ceph-devel@vger.kernel.org, Andreas Bluemle Sorry Dieter, Not trying to say "you are wrong" or anything like that - just trying to add to the problem solving body of knowledge that from what *I* have tried out the 'sync' issue does not look to be the bad guy here - altho more analysis is always welcome (usual story - my findings should be confirm-able by others doing similar tests)! regards Mark On 31/08/12 21:49, Dieter Kasper wrote: > Mark, Inktank, > > OK, it is very likely that 'sync_file_range' is not the major slowdown 'culprit'. > > But, which areas (design, current implementation, protocol, interconnect, tuning parameter, ...) > would you rate as 'major slowdown effect(s)' ? > ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: RBD performance - tuning hints 2012-08-31 8:11 ` Dietmar Maurer 2012-08-31 8:48 ` Mark Kirkwood @ 2012-08-31 10:58 ` Jerker Nyberg 1 sibling, 0 replies; 31+ messages in thread From: Jerker Nyberg @ 2012-08-31 10:58 UTC (permalink / raw) To: ceph-devel@vger.kernel.org On Fri, 31 Aug 2012, Dietmar Maurer wrote: >>> RBD waits for the data to be on disk on all replicas. It's pretty easy >>> to relax this to in memory on all replicas, but there's no option for >>> that right now. > > I thought that is dangerous, because you can loose data? By putting the journal in a tmpfs then data written to the journal does not hit disk. If all replicas fail data will be lost. For some use cases that might be ok. For example incremental backups or fast scratch space or volatile virtual machines etc. Also see this previous discussion: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg06070.html --jerker ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 16:12 ` Alexandre DERUMIER 2012-08-30 16:16 ` Josh Durgin @ 2012-08-30 16:48 ` Dieter Kasper 2012-08-30 18:10 ` Gregory Farnum 1 sibling, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-08-30 16:48 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org, Andreas Bluemle [-- Attachment #1: Type: text/plain, Size: 10043 bytes --] On Thu, Aug 30, 2012 at 06:12:11PM +0200, Alexandre DERUMIER wrote: > >>well, you have to compare > >>- pure a SSD (via PCIe or SAS-6G) vs. > >>- Ceph-Journal, which goes 2x over 10GbE with IP > >> Client -> primary-copy -> 2nd-copy > >> (= redundancy over Ethernet distance) > > Sure but the first osd ack to the client,before replicating to the others osd. no > > Client -> primary-copy -> 2nd-copy > <-ack > primary-copy -> 2nd-copy > -> 3st-copy > > Or I'm wrong ? yes, please have a look at the attached file: ceph-replication-acks.png The client usually will continue on 'ACK' and not wait for the 'commit'. BTW. all my journals are in RAM (/dev/ramX) 32x 2GB = 32GB of data with replica 2x If "filestore min/max sync interval" is set to 99999999 data should 'never' be written to OSD ('never' at least during the tests if the written data is < 32GB) In such a configuration only the Ceph-Code and the Interconnect (10GbE/IP) would be the brakeman. Cheers, -Dieter > > > ----- Mail original ----- > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > À: "Alexandre DERUMIER" <aderumier@odiso.com> > Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> > Envoyé: Jeudi 30 Août 2012 18:02:05 > Objet: Re: RBD performance - tuning hints > > On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: > > Thanks > > > > >> 8x SSD, 200GB each > > > > 20000 iops seem pretty low,no ? > well, you have to compare > - pure a SSD (via PCIe or SAS-6G) vs. > - Ceph-Journal, which goes 2x over 10GbE with IP > Client -> primary-copy -> 2nd-copy > (= redundancy over Ethernet distance) > > I'm curious about the answer from Inktank, > > -Dieter > > > > > > > for @intank: > > > > Is their a bottleneck somewhere in ceph ? > Maybe "SimpleMessenger dispatching: cause of performance problems?" > from Thu, 16 Aug 2012 18:08:39 +0200 > by <andreas.bluemle@itxperts.de> > can be an answer. > Especially if a small number of OSDs is used. > > > > > I said that, because I would like to know if it's scale by adding new nodes. > > > > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) > > > > > > ----- Mail original ----- > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > À: "Alexandre DERUMIER" <aderumier@odiso.com> > > Cc: ceph-devel@vger.kernel.org > > Envoyé: Jeudi 30 Août 2012 17:33:42 > > Objet: Re: RBD performance - tuning hints > > > > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: > > > Thanks for the report ! > > > > > > vs your first benchmark, it's with RBD 4M or 64K ? > > with 4MB (see attached config info) > > > > Cheers, > > -Dieter > > > > > > > > (how much ssd by node?) > > 8x SSD, 200GB each > > > > > > > > > > > > > > ----- Mail original ----- > > > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > > À: "Alexandre DERUMIER" <aderumier@odiso.com> > > > Cc: ceph-devel@vger.kernel.org > > > Envoyé: Jeudi 30 Août 2012 16:56:34 > > > Objet: Re: RBD performance - tuning hints > > > > > > Hi Alexandre, > > > > > > with the 4 filestore parameter below some fio values could be increased: > > > filestore max sync interval = 30 > > > filestore min sync interval = 29 > > > filestore flusher = false > > > filestore queue max ops = 10000 > > > > > > ###### IOPS > > > fio_read_4k_64: 9373 > > > fio_read_4k_128: 9939 > > > fio_randwrite_8k_16: 12376 > > > fio_randwrite_4k_16: 13315 > > > fio_randwrite_512_32: 13660 > > > fio_randwrite_8k_32: 17318 > > > fio_randwrite_4k_32: 18057 > > > fio_randwrite_8k_64: 19693 > > > fio_randwrite_512_64: 20015 <<< > > > fio_randwrite_4k_64: 20024 <<< > > > fio_randwrite_8k_128: 20547 <<< > > > fio_randwrite_4k_128: 20839 <<< > > > fio_randwrite_512_128: 21417 <<< > > > fio_randread_8k_128: 48872 > > > fio_randread_4k_128: 50002 > > > fio_randread_512_128: 51202 > > > > > > ###### MB/s > > > fio_randread_2m_32: 628 > > > fio_read_4m_64: 630 > > > fio_randread_8m_32: 633 > > > fio_read_2m_32: 637 > > > fio_read_4m_16: 640 > > > fio_randread_4m_16: 652 > > > fio_write_2m_32: 660 > > > fio_randread_4m_32: 677 > > > fio_read_4m_32: 678 > > > (...) > > > fio_write_4m_64: 771 > > > fio_randwrite_2m_64: 789 > > > fio_write_8m_128: 796 > > > fio_write_4m_32: 802 > > > fio_randwrite_4m_128: 807 <<< > > > fio_randwrite_2m_32: 811 <<< > > > fio_write_2m_128: 833 <<< > > > fio_write_8m_64: 901 <<< > > > > > > Best Regards, > > > -Dieter > > > > > > > > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: > > > > Nice results ! > > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? > > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) > > > > > > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > I think you can try to tune these values > > > > > > > > filestore max sync interval = 30 > > > > filestore min sync interval = 29 > > > > filestore flusher = false > > > > filestore queue max ops = 10000 > > > > > > > > > > > > > > > > ----- Mail original ----- > > > > > > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> > > > > À: ceph-devel@vger.kernel.org > > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> > > > > Envoyé: Mardi 28 Août 2012 19:48:42 > > > > Objet: RBD performance - tuning hints > > > > > > > > Hi, > > > > > > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) > > > > I can observe a pretty nice rados bench performance > > > > (see bench-rados.txt for details): > > > > > > > > Bandwidth (MB/sec): 961.710 > > > > Max bandwidth (MB/sec): 1040 > > > > Min bandwidth (MB/sec): 772 > > > > > > > > > > > > Also the bandwidth performance generated with > > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} > > > > > > > > .... is acceptable, e.g. > > > > fio_write_4m_16 795 MB/s > > > > fio_randwrite_8m_128 717 MB/s > > > > fio_randwrite_8m_16 714 MB/s > > > > fio_randwrite_2m_32 692 MB/s > > > > > > > > > > > > But, the write IOPS seems to be limited around 19k ... > > > > RBD 4M 64k (= optimal_io_size) > > > > fio_randread_512_128 53286 55925 > > > > fio_randread_4k_128 51110 44382 > > > > fio_randread_8k_128 30854 29938 > > > > fio_randwrite_512_128 18888 2386 > > > > fio_randwrite_512_64 18844 2582 > > > > fio_randwrite_8k_64 17350 2445 > > > > (...) > > > > fio_read_4k_128 10073 53151 > > > > fio_read_4k_64 9500 39757 > > > > fio_read_4k_32 9220 23650 > > > > (...) > > > > fio_read_4k_16 9122 14322 > > > > fio_write_4k_128 2190 14306 > > > > fio_read_8k_32 706 13894 > > > > fio_write_4k_64 2197 12297 > > > > fio_write_8k_64 3563 11705 > > > > fio_write_8k_128 3444 11219 > > > > > > > > > > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. > > > > > > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) > > > > > > > > > > > > Kind Regards, > > > > -Dieter > > > > > > > > > > > > > > > > -- > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > Alexandre D e rumier > > > > > > > > Ingénieur Systèmes et Réseaux > > > > > > > > > > > > Fixe : 03 20 68 88 85 > > > > > > > > Fax : 03 20 68 90 88 > > > > > > > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > > > 12 rue Marivaux 75002 Paris > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > -- > > > > > > -- > > > > > > > > > > > > > > > > > > Alexandre D e rumier > > > > > > Ingénieur Systèmes et Réseaux > > > > > > > > > Fixe : 03 20 68 88 85 > > > > > > Fax : 03 20 68 90 88 > > > > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > > 12 rue Marivaux 75002 Paris > > > > > > > > > > > -- > > > > -- > > > > > > > > > > > > Alexandre D e rumier > > > > Ingénieur Systèmes et Réseaux > > > > > > Fixe : 03 20 68 88 85 > > > > Fax : 03 20 68 90 88 > > > > > > 45 Bvd du Général Leclerc 59100 Roubaix > > 12 rue Marivaux 75002 Paris > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Principal Consultant, Data Center Storage Architecture and Technology FTS CTO FUJITSU TECHNOLOGY SOLUTIONS GMBH Mies-van-der-Rohe-Straße 8 / 4F 80807 München Germany Telephone: +49 89 62060 1898 Telefax: +49 89 62060 329 1898 Mobile: +49 170 8563173 Email: dieter.kasper@ts.fujitsu.com Internet: http://ts.fujitsu.com Company Details: http://ts.fujitsu.com/imprint.html [-- Attachment #2: ceph-replication-acks.png --] [-- Type: image/png, Size: 18144 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints 2012-08-30 16:48 ` Dieter Kasper @ 2012-08-30 18:10 ` Gregory Farnum 0 siblings, 0 replies; 31+ messages in thread From: Gregory Farnum @ 2012-08-30 18:10 UTC (permalink / raw) To: Dieter Kasper Cc: Alexandre DERUMIER, ceph-devel@vger.kernel.org, Andreas Bluemle, Samuel Just On Thu, Aug 30, 2012 at 9:48 AM, Dieter Kasper <d.kasper@kabelmail.de> wrote: > On Thu, Aug 30, 2012 at 06:12:11PM +0200, Alexandre DERUMIER wrote: >> >>well, you have to compare >> >>- pure a SSD (via PCIe or SAS-6G) vs. >> >>- Ceph-Journal, which goes 2x over 10GbE with IP >> >> Client -> primary-copy -> 2nd-copy >> >> (= redundancy over Ethernet distance) >> >> Sure but the first osd ack to the client,before replicating to the others osd. > no > >> >> Client -> primary-copy -> 2nd-copy >> <-ack >> primary-copy -> 2nd-copy >> -> 3st-copy >> >> Or I'm wrong ? > yes, > please have a look at the attached file: ceph-replication-acks.png > The client usually will continue on 'ACK' and not wait for the 'commit'. > > BTW. all my journals are in RAM (/dev/ramX) > 32x 2GB = 32GB of data with replica 2x > > If "filestore min/max sync interval" is set to 99999999 > data should 'never' be written to OSD > ('never' at least during the tests if the written data is < 32GB) I believe it actually will start syncing to disk when the journal is half full (right, Sam?) — and even if it doesn't sync, there's a reasonable chance that some of the data will be written out to disk in the background (though that shouldn't slow anything down, of course). :) -Greg > > In such a configuration only the Ceph-Code and the Interconnect (10GbE/IP) would be the brakeman. > > Cheers, > -Dieter > > >> >> >> ----- Mail original ----- >> >> De: "Dieter Kasper" <d.kasper@kabelmail.de> >> À: "Alexandre DERUMIER" <aderumier@odiso.com> >> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de> >> Envoyé: Jeudi 30 Août 2012 18:02:05 >> Objet: Re: RBD performance - tuning hints >> >> On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote: >> > Thanks >> > >> > >> 8x SSD, 200GB each >> > >> > 20000 iops seem pretty low,no ? >> well, you have to compare >> - pure a SSD (via PCIe or SAS-6G) vs. >> - Ceph-Journal, which goes 2x over 10GbE with IP >> Client -> primary-copy -> 2nd-copy >> (= redundancy over Ethernet distance) >> >> I'm curious about the answer from Inktank, >> >> -Dieter >> >> > >> > >> > for @intank: >> > >> > Is their a bottleneck somewhere in ceph ? >> Maybe "SimpleMessenger dispatching: cause of performance problems?" >> from Thu, 16 Aug 2012 18:08:39 +0200 >> by <andreas.bluemle@itxperts.de> >> can be an answer. >> Especially if a small number of OSDs is used. >> >> > >> > I said that, because I would like to know if it's scale by adding new nodes. >> > >> > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list) >> > >> > >> > ----- Mail original ----- >> > >> > De: "Dieter Kasper" <d.kasper@kabelmail.de> >> > À: "Alexandre DERUMIER" <aderumier@odiso.com> >> > Cc: ceph-devel@vger.kernel.org >> > Envoyé: Jeudi 30 Août 2012 17:33:42 >> > Objet: Re: RBD performance - tuning hints >> > >> > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote: >> > > Thanks for the report ! >> > > >> > > vs your first benchmark, it's with RBD 4M or 64K ? >> > with 4MB (see attached config info) >> > >> > Cheers, >> > -Dieter >> > >> > > >> > > (how much ssd by node?) >> > 8x SSD, 200GB each >> > >> > > >> > > >> > > >> > > ----- Mail original ----- >> > > >> > > De: "Dieter Kasper" <d.kasper@kabelmail.de> >> > > À: "Alexandre DERUMIER" <aderumier@odiso.com> >> > > Cc: ceph-devel@vger.kernel.org >> > > Envoyé: Jeudi 30 Août 2012 16:56:34 >> > > Objet: Re: RBD performance - tuning hints >> > > >> > > Hi Alexandre, >> > > >> > > with the 4 filestore parameter below some fio values could be increased: >> > > filestore max sync interval = 30 >> > > filestore min sync interval = 29 >> > > filestore flusher = false >> > > filestore queue max ops = 10000 >> > > >> > > ###### IOPS >> > > fio_read_4k_64: 9373 >> > > fio_read_4k_128: 9939 >> > > fio_randwrite_8k_16: 12376 >> > > fio_randwrite_4k_16: 13315 >> > > fio_randwrite_512_32: 13660 >> > > fio_randwrite_8k_32: 17318 >> > > fio_randwrite_4k_32: 18057 >> > > fio_randwrite_8k_64: 19693 >> > > fio_randwrite_512_64: 20015 <<< >> > > fio_randwrite_4k_64: 20024 <<< >> > > fio_randwrite_8k_128: 20547 <<< >> > > fio_randwrite_4k_128: 20839 <<< >> > > fio_randwrite_512_128: 21417 <<< >> > > fio_randread_8k_128: 48872 >> > > fio_randread_4k_128: 50002 >> > > fio_randread_512_128: 51202 >> > > >> > > ###### MB/s >> > > fio_randread_2m_32: 628 >> > > fio_read_4m_64: 630 >> > > fio_randread_8m_32: 633 >> > > fio_read_2m_32: 637 >> > > fio_read_4m_16: 640 >> > > fio_randread_4m_16: 652 >> > > fio_write_2m_32: 660 >> > > fio_randread_4m_32: 677 >> > > fio_read_4m_32: 678 >> > > (...) >> > > fio_write_4m_64: 771 >> > > fio_randwrite_2m_64: 789 >> > > fio_write_8m_128: 796 >> > > fio_write_4m_32: 802 >> > > fio_randwrite_4m_128: 807 <<< >> > > fio_randwrite_2m_32: 811 <<< >> > > fio_write_2m_128: 833 <<< >> > > fio_write_8m_64: 901 <<< >> > > >> > > Best Regards, >> > > -Dieter >> > > >> > > >> > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote: >> > > > Nice results ! >> > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ? >> > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster) >> > > > >> > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> > > > I think you can try to tune these values >> > > > >> > > > filestore max sync interval = 30 >> > > > filestore min sync interval = 29 >> > > > filestore flusher = false >> > > > filestore queue max ops = 10000 >> > > > >> > > > >> > > > >> > > > ----- Mail original ----- >> > > > >> > > > De: "Dieter Kasper" <d.kasper@kabelmail.de> >> > > > À: ceph-devel@vger.kernel.org >> > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de> >> > > > Envoyé: Mardi 28 Août 2012 19:48:42 >> > > > Objet: RBD performance - tuning hints >> > > > >> > > > Hi, >> > > > >> > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details) >> > > > I can observe a pretty nice rados bench performance >> > > > (see bench-rados.txt for details): >> > > > >> > > > Bandwidth (MB/sec): 961.710 >> > > > Max bandwidth (MB/sec): 1040 >> > > > Min bandwidth (MB/sec): 772 >> > > > >> > > > >> > > > Also the bandwidth performance generated with >> > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads} >> > > > >> > > > .... is acceptable, e.g. >> > > > fio_write_4m_16 795 MB/s >> > > > fio_randwrite_8m_128 717 MB/s >> > > > fio_randwrite_8m_16 714 MB/s >> > > > fio_randwrite_2m_32 692 MB/s >> > > > >> > > > >> > > > But, the write IOPS seems to be limited around 19k ... >> > > > RBD 4M 64k (= optimal_io_size) >> > > > fio_randread_512_128 53286 55925 >> > > > fio_randread_4k_128 51110 44382 >> > > > fio_randread_8k_128 30854 29938 >> > > > fio_randwrite_512_128 18888 2386 >> > > > fio_randwrite_512_64 18844 2582 >> > > > fio_randwrite_8k_64 17350 2445 >> > > > (...) >> > > > fio_read_4k_128 10073 53151 >> > > > fio_read_4k_64 9500 39757 >> > > > fio_read_4k_32 9220 23650 >> > > > (...) >> > > > fio_read_4k_16 9122 14322 >> > > > fio_write_4k_128 2190 14306 >> > > > fio_read_8k_32 706 13894 >> > > > fio_write_4k_64 2197 12297 >> > > > fio_write_8k_64 3563 11705 >> > > > fio_write_8k_128 3444 11219 >> > > > >> > > > >> > > > Any hints for tuning the IOPS (read and/or write) would be appreciated. >> > > > >> > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full) >> > > > >> > > > >> > > > Kind Regards, >> > > > -Dieter >> > > > >> > > > >> > > > >> > > > -- >> > > > >> > > > -- >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > Alexandre D e rumier >> > > > >> > > > Ingénieur Systèmes et Réseaux >> > > > >> > > > >> > > > Fixe : 03 20 68 88 85 >> > > > >> > > > Fax : 03 20 68 90 88 >> > > > >> > > > >> > > > 45 Bvd du Général Leclerc 59100 Roubaix >> > > > 12 rue Marivaux 75002 Paris >> > > > -- >> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > > > the body of a message to majordomo@vger.kernel.org >> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > >> > > >> > > >> > > >> > > -- >> > > >> > > -- >> > > >> > > >> > > >> > > >> > > >> > > Alexandre D e rumier >> > > >> > > Ingénieur Systèmes et Réseaux >> > > >> > > >> > > Fixe : 03 20 68 88 85 >> > > >> > > Fax : 03 20 68 90 88 >> > > >> > > >> > > 45 Bvd du Général Leclerc 59100 Roubaix >> > > 12 rue Marivaux 75002 Paris >> > > >> > >> > >> > >> > -- >> > >> > -- >> > >> > >> > >> > >> > >> > Alexandre D e rumier >> > >> > Ingénieur Systèmes et Réseaux >> > >> > >> > Fixe : 03 20 68 88 85 >> > >> > Fax : 03 20 68 90 88 >> > >> > >> > 45 Bvd du Général Leclerc 59100 Roubaix >> > 12 rue Marivaux 75002 Paris >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> >> -- >> >> >> >> >> >> Alexandre D e rumier >> >> Ingénieur Systèmes et Réseaux >> >> >> Fixe : 03 20 68 88 85 >> >> Fax : 03 20 68 90 88 >> >> >> 45 Bvd du Général Leclerc 59100 Roubaix >> 12 rue Marivaux 75002 Paris >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Principal Consultant, Data Center Storage Architecture and Technology > FTS CTO > FUJITSU TECHNOLOGY SOLUTIONS GMBH > Mies-van-der-Rohe-Straße 8 / 4F > 80807 München > Germany > > Telephone: +49 89 62060 1898 > Telefax: +49 89 62060 329 1898 > Mobile: +49 170 8563173 > Email: dieter.kasper@ts.fujitsu.com > Internet: http://ts.fujitsu.com > Company Details: http://ts.fujitsu.com/imprint.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-08-31 10:58 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-20 10:24 Ceph write performance George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
2012-07-20 10:48 ` George Shuklin
2012-07-20 11:49 ` Mark Nelson
2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper
2012-07-20 21:28 ` Mark Nelson
2012-07-20 15:53 ` Ceph write performance Matthew Richardson
2012-07-20 16:37 ` Gregory Farnum
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
2012-08-28 19:04 ` Dieter Kasper
2012-08-29 8:50 ` Alexandre DERUMIER
2012-08-29 17:37 ` Josh Durgin
2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper
2012-08-29 22:34 ` Samuel Just
2012-08-30 15:08 ` Dieter Kasper
2012-08-30 20:39 ` Samuel Just
2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper
2012-08-30 15:28 ` Alexandre DERUMIER
2012-08-30 15:33 ` Dieter Kasper
2012-08-30 15:46 ` Alexandre DERUMIER
2012-08-30 16:02 ` Dieter Kasper
2012-08-30 16:12 ` Alexandre DERUMIER
2012-08-30 16:16 ` Josh Durgin
2012-08-31 7:46 ` Alexandre DERUMIER
2012-08-31 8:11 ` Dietmar Maurer
2012-08-31 8:48 ` Mark Kirkwood
2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper
2012-08-31 10:16 ` Mark Kirkwood
2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg
2012-08-30 16:48 ` Dieter Kasper
2012-08-30 18:10 ` Gregory Farnum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.