* Ceph write performance
@ 2012-07-20 10:24 George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
` (3 more replies)
0 siblings, 4 replies; 31+ messages in thread
From: George Shuklin @ 2012-07-20 10:24 UTC (permalink / raw)
To: ceph-devel
Good day.
I've start to play with Ceph... And I found some kinda strange
performance issues. I'm not sure if this is due ceph limitation or my
bad setup.
Setup:
osd - xfs on ramdisk (only one osd)
mds - raid0 on 10 disks
mon - second raid0 on 10 disks
I've mount ceph share at localhost and run FIO (randwrite, 4k, iodepth=32)
What I've got: 1900 IOPS on writing (4k block, 1Gb span).
Normally fio shows about 200kIOPS writing on ramdisk.
Why it was so slow? I've done setup exactly like described here:
http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
(but one osd).
Thanks.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
@ 2012-07-20 10:48 ` George Shuklin
2012-07-20 11:49 ` Mark Nelson
0 siblings, 1 reply; 31+ messages in thread
From: George Shuklin @ 2012-07-20 10:48 UTC (permalink / raw)
To: Dieter Kasper (KD), ceph-devel
On 20.07.2012 14:41, Dieter Kasper (KD) wrote:
Good day.
Thank you for attention.
ramdisk size ~70Gb (modprobe brd rd_size=70000000)
journal seems be on same device as storage
size of OSD was unchanged (... means I create it by manual and do not
make any specific changes)
During test I watch IO load closely, IO on MDS/MON was insignificant
(most of the time zero, sometimes few very mild peaks).
Just in case, configs:
ceph.conf:
[osd]
osd journal size = 1000
filestore xattr use omap = true
[mon.a]
host = srv1
mon addr = 192.168.0.1:6789
[osd.0]
host = srv1
[mds.a]
host = srv1
fio.ini:
[test]
blocksize=4k
filename=/media/test
size=16g
fallocate=posix
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=32
Thanks for advising, I'll recheck with new settings.
> George,
>
> please share more details of your config:
> - RAM size of your system
> - location of the journal
> - size of your OSD
>
> Can you try (just for the 1st test) to
> .. put the journal on RAM disk
> .. put the MDS on RAM disk
> .. put the MON on RAM disk
> .. use btrfs for OSD
>
> As an alternative to isolate the bottleneck you can try to
> - run without a journal
> - use RBD instead Ceph-FS
> + create a File System on top of the /dev/rbd0
>
> Regards,
> Dieter Kasper
>
>
> On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote:
>> Good day.
>>
>> I've start to play with Ceph... And I found some kinda strange
>> performance issues. I'm not sure if this is due ceph limitation or my
>> bad setup.
>>
>> Setup:
>>
>> osd - xfs on ramdisk (only one osd)
>> mds - raid0 on 10 disks
>> mon - second raid0 on 10 disks
>>
>> I've mount ceph share at localhost and run FIO (randwrite, 4k, iodepth=32)
>>
>> What I've got: 1900 IOPS on writing (4k block, 1Gb span).
>>
>> Normally fio shows about 200kIOPS writing on ramdisk.
>>
>> Why it was so slow? I've done setup exactly like described here:
>> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
>> (but one osd).
>>
>> Thanks.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance
2012-07-20 10:48 ` George Shuklin
@ 2012-07-20 11:49 ` Mark Nelson
2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper
0 siblings, 1 reply; 31+ messages in thread
From: Mark Nelson @ 2012-07-20 11:49 UTC (permalink / raw)
To: George Shuklin; +Cc: Dieter Kasper (KD), ceph-devel
Hi George,
I think you may find that the limitation is in the the filestore. It's
one of the things I've been working on trying to track down as I've seen
low performance on SSDs with small request sizes as well. You can use
the test_filestore_workloadgen to specifically test the filestore code
with small requests if you'd like. I'm not sure if it is included with
the binary distribution but it can be compiled if you download the src.
I think it's "make test_filestore_workloadgen" in the src directory.
Mark
On 7/20/12 5:48 AM, George Shuklin wrote:
> On 20.07.2012 14:41, Dieter Kasper (KD) wrote:
>
> Good day.
>
> Thank you for attention.
>
> ramdisk size ~70Gb (modprobe brd rd_size=70000000)
> journal seems be on same device as storage
> size of OSD was unchanged (... means I create it by manual and do not
> make any specific changes)
>
> During test I watch IO load closely, IO on MDS/MON was insignificant
> (most of the time zero, sometimes few very mild peaks).
>
> Just in case, configs:
>
> ceph.conf:
>
> [osd]
> osd journal size = 1000
> filestore xattr use omap = true
>
> [mon.a]
> host = srv1
> mon addr = 192.168.0.1:6789
>
> [osd.0]
> host = srv1
>
> [mds.a]
> host = srv1
>
> fio.ini:
> [test]
> blocksize=4k
> filename=/media/test
> size=16g
> fallocate=posix
> rw=randread
> direct=1
> buffered=0
> ioengine=libaio
> iodepth=32
>
>
> Thanks for advising, I'll recheck with new settings.
>
>> George,
>>
>> please share more details of your config:
>> - RAM size of your system
>> - location of the journal
>> - size of your OSD
>>
>> Can you try (just for the 1st test) to
>> .. put the journal on RAM disk
>> .. put the MDS on RAM disk
>> .. put the MON on RAM disk
>> .. use btrfs for OSD
>>
>> As an alternative to isolate the bottleneck you can try to
>> - run without a journal
>> - use RBD instead Ceph-FS
>> + create a File System on top of the /dev/rbd0
>>
>> Regards,
>> Dieter Kasper
>>
>>
>> On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote:
>>> Good day.
>>>
>>> I've start to play with Ceph... And I found some kinda strange
>>> performance issues. I'm not sure if this is due ceph limitation or my
>>> bad setup.
>>>
>>> Setup:
>>>
>>> osd - xfs on ramdisk (only one osd)
>>> mds - raid0 on 10 disks
>>> mon - second raid0 on 10 disks
>>>
>>> I've mount ceph share at localhost and run FIO (randwrite, 4k,
>>> iodepth=32)
>>>
>>> What I've got: 1900 IOPS on writing (4k block, 1Gb span).
>>>
>>> Normally fio shows about 200kIOPS writing on ramdisk.
>>>
>>> Why it was so slow? I've done setup exactly like described here:
>>> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
>>> (but one osd).
>>>
>>> Thanks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance
2012-07-20 10:24 Ceph write performance George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
@ 2012-07-20 15:53 ` Matthew Richardson
2012-07-20 16:37 ` Gregory Farnum
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
3 siblings, 0 replies; 31+ messages in thread
From: Matthew Richardson @ 2012-07-20 15:53 UTC (permalink / raw)
To: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 793 bytes --]
On 20/07/12 11:24, George Shuklin wrote:
> Good day.
>
> I've start to play with Ceph... And I found some kinda strange
> performance issues. I'm not sure if this is due ceph limitation or my
> bad setup.
I'm seeing a similar problem which looks like a potential bug, which
someone else seems to have already reported
(http://www.spinics.net/lists/ceph-devel/msg07335.html and
http://www.spinics.net/lists/ceph-devel/msg07691.html)
The problem only seems to hit for me when I do random writes - can you
try fio with sequential writes (rw=write) and see if your problem also
disappears? It might help confirm this as an issue.
Thanks,
Matthew
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance
2012-07-20 10:24 Ceph write performance George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
2012-07-20 15:53 ` Ceph write performance Matthew Richardson
@ 2012-07-20 16:37 ` Gregory Farnum
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
3 siblings, 0 replies; 31+ messages in thread
From: Gregory Farnum @ 2012-07-20 16:37 UTC (permalink / raw)
To: George Shuklin; +Cc: ceph-devel
On Fri, Jul 20, 2012 at 3:24 AM, George Shuklin <shuklin@selectel.ru> wrote:
> Good day.
>
> I've start to play with Ceph... And I found some kinda strange performance
> issues. I'm not sure if this is due ceph limitation or my bad setup.
>
> Setup:
>
> osd - xfs on ramdisk (only one osd)
> mds - raid0 on 10 disks
> mon - second raid0 on 10 disks
I'm not going to butt in on the performance discussion, but just FYI,
the MDS does not use any local storage — it puts everything on the
OSDs. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance on RAM-DISK
2012-07-20 11:49 ` Mark Nelson
@ 2012-07-20 20:36 ` Dieter Kasper
2012-07-20 21:28 ` Mark Nelson
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-07-20 20:36 UTC (permalink / raw)
To: Mark Nelson; +Cc: George Shuklin, ceph-devel, Dieter Kasper (KD)
[-- Attachment #1: Type: text/plain, Size: 5220 bytes --]
Hi Mark, George,
I can observe a similar (poor) Performance on my system with fio on /dev/rbd1
#--- seq. write RBD
RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 41.1819 s, 255 MB/s
#--- seq. read RBD
RX37-0:~ # dd of=/dev/zero if=/dev/rbd1 bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 40.9595 s, 256 MB/s
#--- seq. read /dev/ramX
RX37-0:~ # dd of=/dev/zero if=/dev/ram0 bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 4.68389 s, 2.2 GB/s
Does ceph-osd/filestore 'eat' 90% of my resources/bandwidth/latency ?
RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1
(...)
write: io=461592KB, bw=15371KB/s, iops=3842 , runt= 30030msec
write: io=5120.0MB, bw=893927KB/s, iops=223481 , runt= 5865msec (on /dev/ram0)
RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randread --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1
(...)
read : io=698356KB, bw=23240KB/s, iops=5809 , runt= 30050msec
read : io=5120.0MB, bw=1631.1MB/s, iops=417559 , runt= 3139msec (on /dev/ram0)
RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=1m --size=5G --numjobs=4 --runtime=10 --group_reporting --name=file1
(...)
write: io=6377.0MB, bw=217125KB/s, iops=212 , runt= 30075msec
write: io=5120.0MB, bw=2114.9MB/s, iops=2114 , runt= 2421msec (on /dev/ram0)
Where is the bottleneck ?
What is filestore doing ?
How can I disable the journal and write only to the btrfs OSDs ? (like as they would be SSDs)
How can I get better performance ?
Regards,
Dieter
P.S. I will try to get the "test_filestore_workloadgen"
On Fri, Jul 20, 2012 at 06:49:30AM -0500, Mark Nelson wrote:
> Hi George,
>
> I think you may find that the limitation is in the the filestore.
> It's one of the things I've been working on trying to track down as
> I've seen low performance on SSDs with small request sizes as well.
> You can use the test_filestore_workloadgen to specifically test the
> filestore code with small requests if you'd like. I'm not sure if
> it is included with the binary distribution but it can be compiled
> if you download the src. I think it's "make
> test_filestore_workloadgen" in the src directory.
>
> Mark
>
> On 7/20/12 5:48 AM, George Shuklin wrote:
> >On 20.07.2012 14:41, Dieter Kasper (KD) wrote:
> >
> >Good day.
> >
> >Thank you for attention.
> >
> >ramdisk size ~70Gb (modprobe brd rd_size=70000000)
> >journal seems be on same device as storage
> >size of OSD was unchanged (... means I create it by manual and do not
> >make any specific changes)
> >
> >During test I watch IO load closely, IO on MDS/MON was insignificant
> >(most of the time zero, sometimes few very mild peaks).
> >
> >Just in case, configs:
> >
> >ceph.conf:
> >
> >[osd]
> > osd journal size = 1000
> > filestore xattr use omap = true
> >
> >[mon.a]
> > host = srv1
> > mon addr = 192.168.0.1:6789
> >
> >[osd.0]
> > host = srv1
> >
> >[mds.a]
> > host = srv1
> >
> >fio.ini:
> >[test]
> >blocksize=4k
> >filename=/media/test
> >size=16g
> >fallocate=posix
> >rw=randread
> >direct=1
> >buffered=0
> >ioengine=libaio
> >iodepth=32
> >
> >
> >Thanks for advising, I'll recheck with new settings.
> >
> >>George,
> >>
> >>please share more details of your config:
> >>- RAM size of your system
> >>- location of the journal
> >>- size of your OSD
> >>
> >>Can you try (just for the 1st test) to
> >>.. put the journal on RAM disk
> >>.. put the MDS on RAM disk
> >>.. put the MON on RAM disk
> >>.. use btrfs for OSD
> >>
> >>As an alternative to isolate the bottleneck you can try to
> >>- run without a journal
> >>- use RBD instead Ceph-FS
> >> + create a File System on top of the /dev/rbd0
> >>
> >>Regards,
> >>Dieter Kasper
> >>
> >>
> >>On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote:
> >>>Good day.
> >>>
> >>>I've start to play with Ceph... And I found some kinda strange
> >>>performance issues. I'm not sure if this is due ceph limitation or my
> >>>bad setup.
> >>>
> >>>Setup:
> >>>
> >>>osd - xfs on ramdisk (only one osd)
> >>>mds - raid0 on 10 disks
> >>>mon - second raid0 on 10 disks
> >>>
> >>>I've mount ceph share at localhost and run FIO (randwrite, 4k,
> >>>iodepth=32)
> >>>
> >>>What I've got: 1900 IOPS on writing (4k block, 1Gb span).
> >>>
> >>>Normally fio shows about 200kIOPS writing on ramdisk.
> >>>
> >>>Why it was so slow? I've done setup exactly like described here:
> >>>http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
> >>>(but one osd).
> >>>
> >>>Thanks.
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
[-- Attachment #2: ceph.conf --]
[-- Type: text/plain, Size: 864 bytes --]
[global]
pid file = /var/run/ceph/$name.pid
debug ms = 0
auth supported = cephx
keyring = /etc/ceph/keyring.client
[mon]
mon data = /tmp/mon$id
[mon.a]
host = localhost
mon addr = 127.0.0.1:6789
[osd]
journal dio = false
osd data = /data/$name
osd journal = /mnt/osd.journal/$name/journal
osd journal size = 1000
keyring = /etc/ceph/keyring.$name
# debug osd = 20
# debug ms = 1 ; message traffic
# debug filestore = 20 ; local object storage
# debug journal = 20 ; local journaling
# debug monc = 5 ; monitor interaction, startup
[osd.0]
host = localhost
btrfs devs = /dev/ram0
[osd.1]
host = localhost
btrfs devs = /dev/ram1
[osd.2]
host = localhost
btrfs devs = /dev/ram2
[mds.a]
host = localhost
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Ceph write performance on RAM-DISK
2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper
@ 2012-07-20 21:28 ` Mark Nelson
0 siblings, 0 replies; 31+ messages in thread
From: Mark Nelson @ 2012-07-20 21:28 UTC (permalink / raw)
To: Dieter Kasper; +Cc: George Shuklin, ceph-devel
On 07/20/2012 03:36 PM, Dieter Kasper wrote:
> Hi Mark, George,
>
> I can observe a similar (poor) Performance on my system with fio on /dev/rbd1
>
> #--- seq. write RBD
> RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 41.1819 s, 255 MB/s
>
> #--- seq. read RBD
> RX37-0:~ # dd of=/dev/zero if=/dev/rbd1 bs=1024k count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 40.9595 s, 256 MB/s
>
> #--- seq. read /dev/ramX
> RX37-0:~ # dd of=/dev/zero if=/dev/ram0 bs=1024k count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 4.68389 s, 2.2 GB/s
>
> Does ceph-osd/filestore 'eat' 90% of my resources/bandwidth/latency ?
>
Well, there are multiple layers involved here, so it's possible that
some of the code for RBD is playing a part in this too. I have
specifically seen slow performance with smaller requests with the
filestore though, so that is where I'm focusing my energy right now.
>
> RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1
> (...)
> write: io=461592KB, bw=15371KB/s, iops=3842 , runt= 30030msec
> write: io=5120.0MB, bw=893927KB/s, iops=223481 , runt= 5865msec (on /dev/ram0)
>
>
> RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randread --bs=4k --size=5G --numjobs=64 --runtime=30 --group_reporting --name=file1
> (...)
> read : io=698356KB, bw=23240KB/s, iops=5809 , runt= 30050msec
> read : io=5120.0MB, bw=1631.1MB/s, iops=417559 , runt= 3139msec (on /dev/ram0)
>
>
> RX37-0:~ # fio --filename=/dev/rbd1 --direct=1 --rw=randwrite --bs=1m --size=5G --numjobs=4 --runtime=10 --group_reporting --name=file1
> (...)
> write: io=6377.0MB, bw=217125KB/s, iops=212 , runt= 30075msec
> write: io=5120.0MB, bw=2114.9MB/s, iops=2114 , runt= 2421msec (on /dev/ram0)
>
>
> Where is the bottleneck ?
> What is filestore doing ?
> How can I disable the journal and write only to the btrfs OSDs ? (like as they would be SSDs)
> How can I get better performance ?
Not yet sure where the bottleneck is, but we are actively looking into
it. Sadly the process has been complicated by potential bottleneck in
our test hardware that could be masking real issues in the code.
>
>
> Regards,
> Dieter
>
> P.S. I will try to get the "test_filestore_workloadgen"
>
>
> On Fri, Jul 20, 2012 at 06:49:30AM -0500, Mark Nelson wrote:
>> Hi George,
>>
>> I think you may find that the limitation is in the the filestore.
>> It's one of the things I've been working on trying to track down as
>> I've seen low performance on SSDs with small request sizes as well.
>> You can use the test_filestore_workloadgen to specifically test the
>> filestore code with small requests if you'd like. I'm not sure if
>> it is included with the binary distribution but it can be compiled
>> if you download the src. I think it's "make
>> test_filestore_workloadgen" in the src directory.
>>
>> Mark
>>
>> On 7/20/12 5:48 AM, George Shuklin wrote:
>>> On 20.07.2012 14:41, Dieter Kasper (KD) wrote:
>>>
>>> Good day.
>>>
>>> Thank you for attention.
>>>
>>> ramdisk size ~70Gb (modprobe brd rd_size=70000000)
>>> journal seems be on same device as storage
>>> size of OSD was unchanged (... means I create it by manual and do not
>>> make any specific changes)
>>>
>>> During test I watch IO load closely, IO on MDS/MON was insignificant
>>> (most of the time zero, sometimes few very mild peaks).
>>>
>>> Just in case, configs:
>>>
>>> ceph.conf:
>>>
>>> [osd]
>>> osd journal size = 1000
>>> filestore xattr use omap = true
>>>
>>> [mon.a]
>>> host = srv1
>>> mon addr = 192.168.0.1:6789
>>>
>>> [osd.0]
>>> host = srv1
>>>
>>> [mds.a]
>>> host = srv1
>>>
>>> fio.ini:
>>> [test]
>>> blocksize=4k
>>> filename=/media/test
>>> size=16g
>>> fallocate=posix
>>> rw=randread
>>> direct=1
>>> buffered=0
>>> ioengine=libaio
>>> iodepth=32
>>>
>>>
>>> Thanks for advising, I'll recheck with new settings.
>>>
>>>> George,
>>>>
>>>> please share more details of your config:
>>>> - RAM size of your system
>>>> - location of the journal
>>>> - size of your OSD
>>>>
>>>> Can you try (just for the 1st test) to
>>>> .. put the journal on RAM disk
>>>> .. put the MDS on RAM disk
>>>> .. put the MON on RAM disk
>>>> .. use btrfs for OSD
>>>>
>>>> As an alternative to isolate the bottleneck you can try to
>>>> - run without a journal
>>>> - use RBD instead Ceph-FS
>>>> + create a File System on top of the /dev/rbd0
>>>>
>>>> Regards,
>>>> Dieter Kasper
>>>>
>>>>
>>>> On Fri, Jul 20, 2012 at 12:24:15PM +0200, George Shuklin wrote:
>>>>> Good day.
>>>>>
>>>>> I've start to play with Ceph... And I found some kinda strange
>>>>> performance issues. I'm not sure if this is due ceph limitation or my
>>>>> bad setup.
>>>>>
>>>>> Setup:
>>>>>
>>>>> osd - xfs on ramdisk (only one osd)
>>>>> mds - raid0 on 10 disks
>>>>> mon - second raid0 on 10 disks
>>>>>
>>>>> I've mount ceph share at localhost and run FIO (randwrite, 4k,
>>>>> iodepth=32)
>>>>>
>>>>> What I've got: 1900 IOPS on writing (4k block, 1Gb span).
>>>>>
>>>>> Normally fio shows about 200kIOPS writing on ramdisk.
>>>>>
>>>>> Why it was so slow? I've done setup exactly like described here:
>>>>> http://ceph.com/docs/master/start/quick-start/#start-the-ceph-cluster
>>>>> (but one osd).
>>>>>
>>>>> Thanks.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Mark Nelson
Performance Engineer
Inktank
^ permalink raw reply [flat|nested] 31+ messages in thread
* RBD performance - tuning hints
2012-07-20 10:24 Ceph write performance George Shuklin
` (2 preceding siblings ...)
2012-07-20 16:37 ` Gregory Farnum
@ 2012-08-28 17:48 ` Dieter Kasper
2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
2012-08-29 8:50 ` Alexandre DERUMIER
3 siblings, 2 replies; 31+ messages in thread
From: Dieter Kasper @ 2012-08-28 17:48 UTC (permalink / raw)
To: ceph-devel@vger.kernel.org; +Cc: Dieter Kasper (KD)
[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]
Hi,
on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
I can observe a pretty nice rados bench performance
(see bench-rados.txt for details):
Bandwidth (MB/sec): 961.710
Max bandwidth (MB/sec): 1040
Min bandwidth (MB/sec): 772
Also the bandwidth performance generated with
fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
.... is acceptable, e.g.
fio_write_4m_16 795 MB/s
fio_randwrite_8m_128 717 MB/s
fio_randwrite_8m_16 714 MB/s
fio_randwrite_2m_32 692 MB/s
But, the write IOPS seems to be limited around 19k ...
RBD 4M 64k (= optimal_io_size)
fio_randread_512_128 53286 55925
fio_randread_4k_128 51110 44382
fio_randread_8k_128 30854 29938
fio_randwrite_512_128 18888 2386
fio_randwrite_512_64 18844 2582
fio_randwrite_8k_64 17350 2445
(...)
fio_read_4k_128 10073 53151
fio_read_4k_64 9500 39757
fio_read_4k_32 9220 23650
(...)
fio_read_4k_16 9122 14322
fio_write_4k_128 2190 14306
fio_read_8k_32 706 13894
fio_write_4k_64 2197 12297
fio_write_8k_64 3563 11705
fio_write_8k_128 3444 11219
Any hints for tuning the IOPS (read and/or write) would be appreciated.
How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
Kind Regards,
-Dieter
[-- Attachment #2: bench-rados.txt --]
[-- Type: text/plain, Size: 1746 bytes --]
rados bench -p pbench 60 write
Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 228 212 847.857 848 0.042984 0.0684383
2 16 451 435 869.88 892 0.084162 0.0700566
3 16 695 679 905.223 976 0.057677 0.0695337
4 16 942 926 925.894 988 0.038117 0.0685357
5 16 1162 1146 916.7 880 0.042098 0.0693864
6 16 1400 1384 922.569 952 0.063983 0.0689167
7 16 1644 1628 930.189 976 0.065745 0.0684646
8 16 1895 1879 939.404 1004 0.051277 0.0677953
9 16 2145 2129 946.127 1000 0.055165 0.067354
(...)
57 16 13704 13688 960.47 996 0.082716 0.0665862
58 16 13954 13938 961.15 1000 0.041879 0.0665307
59 16 14194 14178 961.129 960 0.046657 0.0664642
2012-08-28 17:32:18.620060min lat: 0.030234 max lat: 3.17834 avg lat: 0.0664676
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
60 16 14446 14430 961.909 1008 0.051635 0.0664676
Total time run: 60.084612
Total writes made: 14446
Write size: 4194304
Bandwidth (MB/sec): 961.710
Stddev Bandwidth: 54.0809
Max bandwidth (MB/sec): 1040
Min bandwidth (MB/sec): 772
Average Latency: 0.0665337
Stddev Latency: 0.0800225
Max latency: 3.17834
Min latency: 0.030234
[-- Attachment #3: bench-config.txt --]
[-- Type: text/plain, Size: 26557 bytes --]
--- RX37-3c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-3 3.0.41-5.1-default #1 SMP Wed Aug 22 00:54:03 UTC 2012 (9c63123) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856332 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdm
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 198232151949312
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 39 C
Blocks sent to initiator = 188127268306944
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 42 C
Blocks sent to initiator = 241646771896320
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 33 C
Blocks sent to initiator = 202151376715776
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 186279543177216
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 200414079221760
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 40 C
Blocks sent to initiator = 301595287879680
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 30 C
Blocks sent to initiator = 190686448058368
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdm on /data/osd.30 type btrfs (rw,noatime)
/dev/sdn on /data/osd.31 type btrfs (rw,noatime)
/dev/sdo on /data/osd.32 type btrfs (rw,noatime)
/dev/sdp on /data/osd.33 type btrfs (rw,noatime)
/dev/sdq on /data/osd.34 type btrfs (rw,noatime)
/dev/sdr on /data/osd.35 type btrfs (rw,noatime)
/dev/sds on /data/osd.36 type btrfs (rw,noatime)
/dev/sdt on /data/osd.37 type btrfs (rw,noatime)
--- RX37-4c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-4 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856432 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdd
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sde
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdf
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdg
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdh
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdi
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdj
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdk
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 33 C
Blocks sent to initiator = 326270260871168
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 29 C
Blocks sent to initiator = 230247207272448
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 168513041858560
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 171904673513472
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 30 C
Blocks sent to initiator = 175995797635072
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 206814587125760
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 26 C
Blocks sent to initiator = 239652363567104
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 32 C
Blocks sent to initiator = 221954917269504
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdd on /data/osd.40 type btrfs (rw,noatime)
/dev/sde on /data/osd.41 type btrfs (rw,noatime)
/dev/sdf on /data/osd.42 type btrfs (rw,noatime)
/dev/sdg on /data/osd.43 type btrfs (rw,noatime)
/dev/sdh on /data/osd.44 type btrfs (rw,noatime)
/dev/sdi on /data/osd.45 type btrfs (rw,noatime)
/dev/sdj on /data/osd.46 type btrfs (rw,noatime)
/dev/sdk on /data/osd.47 type btrfs (rw,noatime)
--- RX37-5c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-5 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 74226012 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 195550280417280
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 177656960122880
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 41 C
Blocks sent to initiator = 238550402465792
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 31 C
Blocks sent to initiator = 226579741409280
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 33 C
Blocks sent to initiator = 186652383248384
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 219684389519360
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 39 C
Blocks sent to initiator = 223471107833856
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 29 C
Blocks sent to initiator = 190300723085312
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdo on /data/osd.50 type btrfs (rw,noatime)
/dev/sdp on /data/osd.51 type btrfs (rw,noatime)
/dev/sdq on /data/osd.52 type btrfs (rw,noatime)
/dev/sdr on /data/osd.53 type btrfs (rw,noatime)
/dev/sds on /data/osd.54 type btrfs (rw,noatime)
/dev/sdt on /data/osd.55 type btrfs (rw,noatime)
/dev/sdu on /data/osd.56 type btrfs (rw,noatime)
/dev/sdv on /data/osd.57 type btrfs (rw,noatime)
--- RX37-6c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-6 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856344 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 41 C
Blocks sent to initiator = 195597608943616
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 197325225984000
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 42 C
Blocks sent to initiator = 182463498289152
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 45 C
Blocks sent to initiator = 250870398713856
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 209343584665600
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 33 C
Blocks sent to initiator = 226728102330368
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 43 C
Blocks sent to initiator = 213839006138368
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 38 C
Blocks sent to initiator = 179503745728512
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdn on /data/osd.60 type btrfs (rw,noatime)
/dev/sdo on /data/osd.61 type btrfs (rw,noatime)
/dev/sdp on /data/osd.62 type btrfs (rw,noatime)
/dev/sdq on /data/osd.63 type btrfs (rw,noatime)
/dev/sdr on /data/osd.64 type btrfs (rw,noatime)
/dev/sds on /data/osd.65 type btrfs (rw,noatime)
/dev/sdt on /data/osd.66 type btrfs (rw,noatime)
/dev/sdu on /data/osd.67 type btrfs (rw,noatime)
--- RX37-7c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-7 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856344 kB
optimal_io_size: 4194304
65536
scheduler: [noop] deadline cfq
noop deadline [cfq]
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
--- RX37-8c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-8 3.0.36-16-default #1 SMP Wed Jul 18 00:18:54 UTC 2012 (544e41f) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 65952088 kB
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
--------------------------------------------------------------------------------
dumped osdmap epoch 15
epoch 15
fsid 7ab4662b-0575-4875-b59d-3bef85bb918d
created 2012-08-26 15:10:43.529294
modifed 2012-08-26 15:11:09.537529
flags
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0
max_osd 68
osd.30 up in weight 1 up_from 2 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6800/7884 192.168.114.52:6800/7884 192.168.114.52:6801/7884 exists,up f1912b6b-2abf-4eef-83e0-8657d78e48f8
osd.31 up in weight 1 up_from 4 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6801/8057 192.168.114.52:6802/8057 192.168.114.52:6803/8057 exists,up 2a254612-5242-4ae8-8ba7-3fe2eaa3eec5
osd.32 up in weight 1 up_from 3 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6802/8225 192.168.114.52:6804/8225 192.168.114.52:6805/8225 exists,up d41508ee-131c-47b8-9218-8f81bc7f7716
osd.33 up in weight 1 up_from 3 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6803/8415 192.168.114.52:6806/8415 192.168.114.52:6807/8415 exists,up 2e5a96be-ca3a-4c7d-8895-b61c07d858ac
osd.34 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6804/8588 192.168.114.52:6808/8588 192.168.114.52:6809/8588 exists,up 214d8253-ad9b-4268-ba67-365ae9bc612a
osd.35 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6805/8777 192.168.114.52:6810/8777 192.168.114.52:6811/8777 exists,up 9d328117-581a-4fdb-bee8-e373e74ee013
osd.36 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6806/8966 192.168.114.52:6812/8966 192.168.114.52:6813/8966 exists,up 0d046c45-ddd3-4c24-814c-36ace0632167
osd.37 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.52:6807/9155 192.168.114.52:6814/9155 192.168.114.52:6815/9155 exists,up 2265a65a-624c-4729-bf64-47850270b4a9
osd.40 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6800/14455 192.168.114.53:6800/14455 192.168.114.53:6801/14455 exists,up e782364f-c5ee-4181-98ba-8e8009a789db
osd.41 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6801/14639 192.168.114.53:6802/14639 192.168.114.53:6803/14639 exists,up 3154b1e5-e49a-417a-9b80-d64995afb2c8
osd.42 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6802/14816 192.168.114.53:6804/14816 192.168.114.53:6805/14816 exists,up a7cab833-70b2-4067-83a3-a8a7b7ccb1c2
osd.43 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6803/15013 192.168.114.53:6806/15013 192.168.114.53:6807/15013 exists,up 5afeea03-5a5d-4643-bbde-aaadda1bde01
osd.44 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6804/15190 192.168.114.53:6808/15190 192.168.114.53:6809/15190 exists,up 5b1a90a2-596d-40d4-b33d-cf74142f7e96
osd.45 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6805/15420 192.168.114.53:6810/15420 192.168.114.53:6811/15420 exists,up e4d85019-c8d4-4dc8-bec3-ceaddab60b99
osd.46 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6806/15623 192.168.114.53:6812/15623 192.168.114.53:6813/15623 exists,up 0a1b6a02-1b70-457f-9602-8f02e00d7ae1
osd.47 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.53:6807/15826 192.168.114.53:6814/15826 192.168.114.53:6815/15826 exists,up 7be9d381-8c38-440c-ae22-fc29a9349351
osd.50 up in weight 1 up_from 5 up_thru 12 down_at 0 last_clean_interval [0,0) 192.168.113.54:6800/1915 192.168.114.54:6800/1915 192.168.114.54:6801/1915 exists,up 7653343d-5602-4a6e-ac69-a278dab28c8c
osd.51 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6801/2155 192.168.114.54:6802/2155 192.168.114.54:6803/2155 exists,up a58bfbfb-8f21-4939-8ca1-b8209be68a30
osd.52 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6802/2322 192.168.114.54:6804/2322 192.168.114.54:6805/2322 exists,up 81daeb73-23f4-4f68-b56b-7d5a1b95e7e0
osd.53 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6803/2515 192.168.114.54:6806/2515 192.168.114.54:6807/2515 exists,up b3978c52-f689-45e8-9ee2-681e3bdeeeb2
osd.54 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6804/2702 192.168.114.54:6808/2702 192.168.114.54:6809/2702 exists,up 205b59d3-176a-4048-84c5-81dd181a8e71
osd.55 up in weight 1 up_from 5 up_thru 11 down_at 0 last_clean_interval [0,0) 192.168.113.54:6805/2889 192.168.114.54:6810/2889 192.168.114.54:6811/2889 exists,up cd4d82de-0da8-48b0-a54f-d1372b611958
osd.56 up in weight 1 up_from 6 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.54:6806/3082 192.168.114.54:6812/3082 192.168.114.54:6813/3082 exists,up b82b38a6-64ad-487a-899b-6c62ebe6bb13
osd.57 up in weight 1 up_from 6 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.54:6807/3269 192.168.114.54:6814/3269 192.168.114.54:6815/3269 exists,up c155cf46-d287-4439-a39e-ff80c22e0caa
osd.60 up in weight 1 up_from 7 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6800/30607 192.168.114.55:6800/30607 192.168.114.55:6801/30607 exists,up ab8370bf-c722-4eab-9842-498b6dfef765
osd.61 up in weight 1 up_from 7 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6801/30801 192.168.114.55:6802/30801 192.168.114.55:6803/30801 exists,up a189a254-efcd-4129-867e-384cd0765d19
osd.62 up in weight 1 up_from 8 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6802/30946 192.168.114.55:6804/30946 192.168.114.55:6805/30946 exists,up 2ddc9000-a5be-4c7f-9362-2c525b93db7f
osd.63 up in weight 1 up_from 9 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6803/31139 192.168.114.55:6806/31139 192.168.114.55:6807/31139 exists,up 5c4661fb-4c6c-411d-bf46-b4ead15a019a
osd.64 up in weight 1 up_from 9 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6804/31332 192.168.114.55:6808/31332 192.168.114.55:6809/31332 exists,up b67f9e9b-d0f6-41b9-ac7f-0c355950316f
osd.65 up in weight 1 up_from 10 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6805/31525 192.168.114.55:6810/31525 192.168.114.55:6811/31525 exists,up 9e179b5f-b0ca-4799-8b02-13fc3a78eda5
osd.66 up in weight 1 up_from 10 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6806/31814 192.168.114.55:6812/31814 192.168.114.55:6813/31814 exists,up e300060b-ac96-4ed0-9670-ffe3d7547a18
osd.67 up in weight 1 up_from 11 up_thru 14 down_at 0 last_clean_interval [0,0) 192.168.113.55:6807/32063 192.168.114.55:6814/32063 192.168.114.55:6815/32063 exists,up f87f78b3-61ba-403a-b012-ddd055ced47f
ceph.conf
---content---
# global
[global]
# enable secure authentication
auth supported = none
# allow ourselves to open a lot of files
#max open files = 1100000
max open files = 131072
# set log file
log file = /ceph/log/$name.log
# log_to_syslog = true # uncomment this line to log to syslog
# set up pid files
pid file = /var/run/ceph/$name.pid
# If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible
#ms bind ipv6 = true
public network = 192.168.113.0/24
cluster network = 192.168.114.0/24
# monitors
# You need at least one. You need at least three if you want to
# tolerate any node failures. Always create an odd number.
[mon]
mon data = /ceph/$name
# If you are using for example the RADOS Gateway and want to have your newly created
# pools a higher replication level, you can set a default
#osd pool default size = 3
# You can also specify a CRUSH rule for new pools
# Wiki: http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
#osd pool default crush rule = 0
# Timing is critical for monitors, but if you want to allow the clocks to drift a
# bit more, you can specify the max drift.
#mon clock drift allowed = 1
# Tell the monitor to backoff from this warning for 30 seconds
#mon clock drift warn backoff = 30
# logging, for debugging monitor crashes, in order of
# their likelihood of being helpful :)
#debug ms = 1
#debug mon = 20
#debug paxos = 20
#debug auth = 20
debug optracker = 0
[mon.0]
host = RX37-3c
mon addr = 192.168.113.52:6789
[mon.1]
host = RX37-7c
mon addr = 192.168.113.56:6789
[mon.2]
host = RX37-8c
mon addr = 192.168.113.57:6789
# mds
# You need at least one. Define two to get a standby.
[mds]
# mds data = /ceph/$name
# where the mds keeps it's secret encryption keys
#keyring = /data/keyring.$name
# mds logging to debug issues.
#debug ms = 1
#debug mds = 20
debug optracker = 0
[mds.0]
host = RX37-8c
# osd
# You need at least one. Two if you want data to be replicated.
# Define as many as you like.
[osd]
# This is where the btrfs volume will be mounted.
osd data = /data/$name
# journal dio = true
# osd op threads = 24
# osd disk threads = 24
# filestore op threads = 6
# filestore queue max ops = 24
# Ideally, make this a separate disk or partition. A few
# hundred MB should be enough; more if you have fast or many
# disks. You can use a file under the osd data dir if need be
# (e.g. /data/$name/journal), but it will be slower than a
# separate disk or partition.
# This is an example of a file-based journal.
# osd journal = /ceph/$name/journal
# osd journal size = 2048
# journal size, in megabytes
# If you want to run the journal on a tmpfs, disable DirectIO
#journal dio = false
# You can change the number of recovery operations to speed up recovery
# or slow it down if your machines can't handle it
# osd recovery max active = 3
# osd logging to debug osd issues, in order of likelihood of being
# helpful
#debug ms = 1
#debug osd = 20
#debug filestore = 20
#debug journal = 20
debug optracker = 0
fstype = btrfs
[osd.30]
host = RX37-3c
devs = /dev/sdm
osd journal = /dev/ram0
[osd.31]
host = RX37-3c
devs = /dev/sdn
osd journal = /dev/ram1
[osd.32]
host = RX37-3c
devs = /dev/sdo
osd journal = /dev/ram2
[osd.33]
host = RX37-3c
devs = /dev/sdp
osd journal = /dev/ram3
[osd.34]
host = RX37-3c
devs = /dev/sdq
osd journal = /dev/ram4
[osd.35]
host = RX37-3c
devs = /dev/sdr
osd journal = /dev/ram5
[osd.36]
host = RX37-3c
devs = /dev/sds
osd journal = /dev/ram6
[osd.37]
host = RX37-3c
devs = /dev/sdt
osd journal = /dev/ram7
[osd.40]
host = RX37-4c
devs = /dev/sdd
osd journal = /dev/ram0
[osd.41]
host = RX37-4c
devs = /dev/sde
osd journal = /dev/ram1
[osd.42]
host = RX37-4c
devs = /dev/sdf
osd journal = /dev/ram2
[osd.43]
host = RX37-4c
devs = /dev/sdg
osd journal = /dev/ram3
[osd.44]
host = RX37-4c
devs = /dev/sdh
osd journal = /dev/ram4
[osd.45]
host = RX37-4c
devs = /dev/sdi
osd journal = /dev/ram5
[osd.46]
host = RX37-4c
devs = /dev/sdj
osd journal = /dev/ram6
[osd.47]
host = RX37-4c
devs = /dev/sdk
osd journal = /dev/ram7
[osd.50]
host = RX37-5c
devs = /dev/sdo
osd journal = /dev/ram0
[osd.51]
host = RX37-5c
devs = /dev/sdp
osd journal = /dev/ram1
[osd.52]
host = RX37-5c
devs = /dev/sdq
osd journal = /dev/ram2
[osd.53]
host = RX37-5c
devs = /dev/sdr
osd journal = /dev/ram3
[osd.54]
host = RX37-5c
devs = /dev/sds
osd journal = /dev/ram4
[osd.55]
host = RX37-5c
devs = /dev/sdt
osd journal = /dev/ram5
[osd.56]
host = RX37-5c
devs = /dev/sdu
osd journal = /dev/ram6
[osd.57]
host = RX37-5c
devs = /dev/sdv
osd journal = /dev/ram7
[osd.60]
host = RX37-6c
devs = /dev/sdn
osd journal = /dev/ram0
[osd.61]
host = RX37-6c
devs = /dev/sdo
osd journal = /dev/ram1
[osd.62]
host = RX37-6c
devs = /dev/sdp
osd journal = /dev/ram2
[osd.63]
host = RX37-6c
devs = /dev/sdq
osd journal = /dev/ram3
[osd.64]
host = RX37-6c
devs = /dev/sdr
osd journal = /dev/ram4
[osd.65]
host = RX37-6c
devs = /dev/sds
osd journal = /dev/ram5
[osd.66]
host = RX37-6c
devs = /dev/sdt
osd journal = /dev/ram6
[osd.67]
host = RX37-6c
devs = /dev/sdu
osd journal = /dev/ram7
devs = /dev/sdc
[client.01]
client hostname = RX37-7c
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
@ 2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
2012-08-28 19:04 ` Dieter Kasper
2012-08-29 8:50 ` Alexandre DERUMIER
1 sibling, 1 reply; 31+ messages in thread
From: Smart Weblications GmbH - Florian Wiessner @ 2012-08-28 18:53 UTC (permalink / raw)
To: Dieter Kasper, ceph-devel
Am 28.08.2012 19:48, schrieb Dieter Kasper:
> Hi,
>
> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> I can observe a pretty nice rados bench performance
> (see bench-rados.txt for details):
i'd like to know which 10GE Switch you have used? Do you use 10GE-Base-T?
--
Mit freundlichen Grüßen,
Florian Wiessner
Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila
fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de
--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
@ 2012-08-28 19:04 ` Dieter Kasper
0 siblings, 0 replies; 31+ messages in thread
From: Dieter Kasper @ 2012-08-28 19:04 UTC (permalink / raw)
To: Smart Weblications GmbH - Florian Wiessner; +Cc: ceph-devel@vger.kernel.org
On Tue, Aug 28, 2012 at 08:53:46PM +0200, Smart Weblications GmbH - Florian Wiessner wrote:
> Am 28.08.2012 19:48, schrieb Dieter Kasper:
> > Hi,
> >
> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > I can observe a pretty nice rados bench performance
> > (see bench-rados.txt for details):
>
> i'd like to know which 10GE Switch you have used? Do you use 10GE-Base-T?
http://www.brocade.com/products/all/switches/product-details/turboiron-24x-switch/index.page
Mit freundlichen Grüßen
Dieter Kasper
>
>
>
>
> --
>
> Mit freundlichen Grüßen,
>
> Florian Wiessner
>
> Smart Weblications GmbH
> Martinsberger Str. 1
> D-95119 Naila
>
> fon.: +49 9282 9638 200
> fax.: +49 9282 9638 205
> 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
> http://www.smart-weblications.de
>
> --
> Sitz der Gesellschaft: Naila
> Geschäftsführer: Florian Wiessner
> HRB-Nr.: HRB 3840 Amtsgericht Hof
> *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
@ 2012-08-29 8:50 ` Alexandre DERUMIER
2012-08-29 17:37 ` Josh Durgin
2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper
1 sibling, 2 replies; 31+ messages in thread
From: Alexandre DERUMIER @ 2012-08-29 8:50 UTC (permalink / raw)
To: Dieter Kasper; +Cc: ceph-devel
Nice results !
(can you make same benchmark from a qemu-kvm guest with virtio-driver ?
I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
I think you can try to tune these values
filestore max sync interval = 30
filestore min sync interval = 29
filestore flusher = false
filestore queue max ops = 10000
----- Mail original -----
De: "Dieter Kasper" <d.kasper@kabelmail.de>
À: ceph-devel@vger.kernel.org
Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
Envoyé: Mardi 28 Août 2012 19:48:42
Objet: RBD performance - tuning hints
Hi,
on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
I can observe a pretty nice rados bench performance
(see bench-rados.txt for details):
Bandwidth (MB/sec): 961.710
Max bandwidth (MB/sec): 1040
Min bandwidth (MB/sec): 772
Also the bandwidth performance generated with
fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
.... is acceptable, e.g.
fio_write_4m_16 795 MB/s
fio_randwrite_8m_128 717 MB/s
fio_randwrite_8m_16 714 MB/s
fio_randwrite_2m_32 692 MB/s
But, the write IOPS seems to be limited around 19k ...
RBD 4M 64k (= optimal_io_size)
fio_randread_512_128 53286 55925
fio_randread_4k_128 51110 44382
fio_randread_8k_128 30854 29938
fio_randwrite_512_128 18888 2386
fio_randwrite_512_64 18844 2582
fio_randwrite_8k_64 17350 2445
(...)
fio_read_4k_128 10073 53151
fio_read_4k_64 9500 39757
fio_read_4k_32 9220 23650
(...)
fio_read_4k_16 9122 14322
fio_write_4k_128 2190 14306
fio_read_8k_32 706 13894
fio_write_4k_64 2197 12297
fio_write_8k_64 3563 11705
fio_write_8k_128 3444 11219
Any hints for tuning the IOPS (read and/or write) would be appreciated.
How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
Kind Regards,
-Dieter
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-29 8:50 ` Alexandre DERUMIER
@ 2012-08-29 17:37 ` Josh Durgin
2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper
2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper
1 sibling, 1 reply; 31+ messages in thread
From: Josh Durgin @ 2012-08-29 17:37 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: Dieter Kasper, ceph-devel
On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote:
> Nice results !
> (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>
>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> I think you can try to tune these values
>
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
Increasing filestore_op_threads might help as well.
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: ceph-devel@vger.kernel.org
> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> Envoyé: Mardi 28 Août 2012 19:48:42
> Objet: RBD performance - tuning hints
>
> Hi,
>
> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> I can observe a pretty nice rados bench performance
> (see bench-rados.txt for details):
>
> Bandwidth (MB/sec): 961.710
> Max bandwidth (MB/sec): 1040
> Min bandwidth (MB/sec): 772
>
>
> Also the bandwidth performance generated with
> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>
> .... is acceptable, e.g.
> fio_write_4m_16 795 MB/s
> fio_randwrite_8m_128 717 MB/s
> fio_randwrite_8m_16 714 MB/s
> fio_randwrite_2m_32 692 MB/s
>
>
> But, the write IOPS seems to be limited around 19k ...
> RBD 4M 64k (= optimal_io_size)
> fio_randread_512_128 53286 55925
> fio_randread_4k_128 51110 44382
> fio_randread_8k_128 30854 29938
> fio_randwrite_512_128 18888 2386
> fio_randwrite_512_64 18844 2582
> fio_randwrite_8k_64 17350 2445
> (...)
> fio_read_4k_128 10073 53151
> fio_read_4k_64 9500 39757
> fio_read_4k_32 9220 23650
> (...)
> fio_read_4k_16 9122 14322
> fio_write_4k_128 2190 14306
> fio_read_8k_32 706 13894
> fio_write_4k_64 2197 12297
> fio_write_8k_64 3563 11705
> fio_write_8k_128 3444 11219
>
>
> Any hints for tuning the IOPS (read and/or write) would be appreciated.
>
> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>
>
> Kind Regards,
> -Dieter
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc
2012-08-29 17:37 ` Josh Durgin
@ 2012-08-29 19:29 ` Dieter Kasper
2012-08-29 22:34 ` Samuel Just
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-29 19:29 UTC (permalink / raw)
To: Josh Durgin
Cc: Alexandre DERUMIER, ceph-devel@vger.kernel.org,
Dieter Kasper (KD)
Hi Josh,
thanks for the hint.
Can you please spend a view words about the meaing of these parameters ?
- filestore min/max sync interval = int/float ? seconds ? of what ?
- filestore flusher = false
- filestore queue max ops = 10000
what is 'one op' ? queue in front of what ?
- filestore op threads =
what are useful values here ?
- journal dio = true/false
- osd op threads =
- osd disk threads =
Kind Regards,
-Dieter
On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote:
> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote:
> > Nice results !
> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> >
> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > I think you can try to tune these values
> >
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
>
> Increasing filestore_op_threads might help as well.
>
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: ceph-devel@vger.kernel.org
> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > Envoyé: Mardi 28 Août 2012 19:48:42
> > Objet: RBD performance - tuning hints
> >
> > Hi,
> >
> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > I can observe a pretty nice rados bench performance
> > (see bench-rados.txt for details):
> >
> > Bandwidth (MB/sec): 961.710
> > Max bandwidth (MB/sec): 1040
> > Min bandwidth (MB/sec): 772
> >
> >
> > Also the bandwidth performance generated with
> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> >
> > .... is acceptable, e.g.
> > fio_write_4m_16 795 MB/s
> > fio_randwrite_8m_128 717 MB/s
> > fio_randwrite_8m_16 714 MB/s
> > fio_randwrite_2m_32 692 MB/s
> >
> >
> > But, the write IOPS seems to be limited around 19k ...
> > RBD 4M 64k (= optimal_io_size)
> > fio_randread_512_128 53286 55925
> > fio_randread_4k_128 51110 44382
> > fio_randread_8k_128 30854 29938
> > fio_randwrite_512_128 18888 2386
> > fio_randwrite_512_64 18844 2582
> > fio_randwrite_8k_64 17350 2445
> > (...)
> > fio_read_4k_128 10073 53151
> > fio_read_4k_64 9500 39757
> > fio_read_4k_32 9220 23650
> > (...)
> > fio_read_4k_16 9122 14322
> > fio_write_4k_128 2190 14306
> > fio_read_8k_32 706 13894
> > fio_write_4k_64 2197 12297
> > fio_write_8k_64 3563 11705
> > fio_write_8k_128 3444 11219
> >
> >
> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> >
> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> >
> >
> > Kind Regards,
> > -Dieter
> >
> >
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc
2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper
@ 2012-08-29 22:34 ` Samuel Just
2012-08-30 15:08 ` Dieter Kasper
0 siblings, 1 reply; 31+ messages in thread
From: Samuel Just @ 2012-08-29 22:34 UTC (permalink / raw)
To: Dieter Kasper; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org
filestore [min|max] sync interval:
Periodically, the filestore needs to quiesce writes and do a syncfs in
order to create
a consistent commit point up to which it can free journal entries. Syncing more
frequently tends to reduce the time required to do the sync, and
reduces the amount
of data that needs to remain in the journal. Less frequent syncs
would allow the
backing filesystem to better coalesce small writes and metadata
updates hopefully
resulting in more efficient syncs. 'filestore max sync interval'
defines the maximum
time period between syncs, 'filestore min sync interval' defines the
minimum time
period between syncs.
filestore flusher:
The filestore flusher forces data from large writes to be written out
using sync_file_range
before the sync in order to (hopefully) reduce the cost of the
eventual sync. In practice,
disabling 'filestore flusher' seems to improve performance in some cases.
filestore queue max ops:
'filestore queue max ops' defines the number of in progress ops the
filestore will accept
before blocking on queueing new ones. This mostly shouldn't have much
of an effect
on performance and should probably be ignored.
filestore op threads:
'filestore op threads' defines the number of threads used to submit
filesystem operations
in parallel.
journal dio:
'journal dio' enables using O_DIRECT for writing to the journal. This
should usually
be enabled. If possible, 'journal aio' should also be enabled to
allow use of libaio
to do asynchronous writes.
osd op threads:
'osd op threads' defines the size of the thread pool used to service
OSD operations
such as client requests. Increasing this may increase the rate of
request processing.
osd disk threads:
'osd disk threads' defines the number of threads used to perform background disk
intensive osd operations such as scrubbing and snap trimming.
On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote:
> Hi Josh,
>
> thanks for the hint.
> Can you please spend a view words about the meaing of these parameters ?
> - filestore min/max sync interval = int/float ? seconds ? of what ?
> - filestore flusher = false
> - filestore queue max ops = 10000
> what is 'one op' ? queue in front of what ?
> - filestore op threads =
> what are useful values here ?
>
> - journal dio = true/false
> - osd op threads =
> - osd disk threads =
>
>
> Kind Regards,
> -Dieter
>
>
> On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote:
>> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote:
>> > Nice results !
>> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
>> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>> >
>> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> > I think you can try to tune these values
>> >
>> > filestore max sync interval = 30
>> > filestore min sync interval = 29
>> > filestore flusher = false
>> > filestore queue max ops = 10000
>>
>> Increasing filestore_op_threads might help as well.
>>
>> > ----- Mail original -----
>> >
>> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> > À: ceph-devel@vger.kernel.org
>> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
>> > Envoyé: Mardi 28 Août 2012 19:48:42
>> > Objet: RBD performance - tuning hints
>> >
>> > Hi,
>> >
>> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
>> > I can observe a pretty nice rados bench performance
>> > (see bench-rados.txt for details):
>> >
>> > Bandwidth (MB/sec): 961.710
>> > Max bandwidth (MB/sec): 1040
>> > Min bandwidth (MB/sec): 772
>> >
>> >
>> > Also the bandwidth performance generated with
>> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>> >
>> > .... is acceptable, e.g.
>> > fio_write_4m_16 795 MB/s
>> > fio_randwrite_8m_128 717 MB/s
>> > fio_randwrite_8m_16 714 MB/s
>> > fio_randwrite_2m_32 692 MB/s
>> >
>> >
>> > But, the write IOPS seems to be limited around 19k ...
>> > RBD 4M 64k (= optimal_io_size)
>> > fio_randread_512_128 53286 55925
>> > fio_randread_4k_128 51110 44382
>> > fio_randread_8k_128 30854 29938
>> > fio_randwrite_512_128 18888 2386
>> > fio_randwrite_512_64 18844 2582
>> > fio_randwrite_8k_64 17350 2445
>> > (...)
>> > fio_read_4k_128 10073 53151
>> > fio_read_4k_64 9500 39757
>> > fio_read_4k_32 9220 23650
>> > (...)
>> > fio_read_4k_16 9122 14322
>> > fio_write_4k_128 2190 14306
>> > fio_read_8k_32 706 13894
>> > fio_write_4k_64 2197 12297
>> > fio_write_8k_64 3563 11705
>> > fio_write_8k_128 3444 11219
>> >
>> >
>> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
>> >
>> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> >
>> >
>> > Kind Regards,
>> > -Dieter
>> >
>> >
>> >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-29 8:50 ` Alexandre DERUMIER
2012-08-29 17:37 ` Josh Durgin
@ 2012-08-30 14:56 ` Dieter Kasper
2012-08-30 15:28 ` Alexandre DERUMIER
1 sibling, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-30 14:56 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org
Hi Alexandre,
with the 4 filestore parameter below some fio values could be increased:
filestore max sync interval = 30
filestore min sync interval = 29
filestore flusher = false
filestore queue max ops = 10000
###### IOPS
fio_read_4k_64: 9373
fio_read_4k_128: 9939
fio_randwrite_8k_16: 12376
fio_randwrite_4k_16: 13315
fio_randwrite_512_32: 13660
fio_randwrite_8k_32: 17318
fio_randwrite_4k_32: 18057
fio_randwrite_8k_64: 19693
fio_randwrite_512_64: 20015 <<<
fio_randwrite_4k_64: 20024 <<<
fio_randwrite_8k_128: 20547 <<<
fio_randwrite_4k_128: 20839 <<<
fio_randwrite_512_128: 21417 <<<
fio_randread_8k_128: 48872
fio_randread_4k_128: 50002
fio_randread_512_128: 51202
###### MB/s
fio_randread_2m_32: 628
fio_read_4m_64: 630
fio_randread_8m_32: 633
fio_read_2m_32: 637
fio_read_4m_16: 640
fio_randread_4m_16: 652
fio_write_2m_32: 660
fio_randread_4m_32: 677
fio_read_4m_32: 678
(...)
fio_write_4m_64: 771
fio_randwrite_2m_64: 789
fio_write_8m_128: 796
fio_write_4m_32: 802
fio_randwrite_4m_128: 807 <<<
fio_randwrite_2m_32: 811 <<<
fio_write_2m_128: 833 <<<
fio_write_8m_64: 901 <<<
Best Regards,
-Dieter
On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> Nice results !
> (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>
> >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> I think you can try to tune these values
>
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
>
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: ceph-devel@vger.kernel.org
> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> Envoyé: Mardi 28 Août 2012 19:48:42
> Objet: RBD performance - tuning hints
>
> Hi,
>
> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> I can observe a pretty nice rados bench performance
> (see bench-rados.txt for details):
>
> Bandwidth (MB/sec): 961.710
> Max bandwidth (MB/sec): 1040
> Min bandwidth (MB/sec): 772
>
>
> Also the bandwidth performance generated with
> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>
> .... is acceptable, e.g.
> fio_write_4m_16 795 MB/s
> fio_randwrite_8m_128 717 MB/s
> fio_randwrite_8m_16 714 MB/s
> fio_randwrite_2m_32 692 MB/s
>
>
> But, the write IOPS seems to be limited around 19k ...
> RBD 4M 64k (= optimal_io_size)
> fio_randread_512_128 53286 55925
> fio_randread_4k_128 51110 44382
> fio_randread_8k_128 30854 29938
> fio_randwrite_512_128 18888 2386
> fio_randwrite_512_64 18844 2582
> fio_randwrite_8k_64 17350 2445
> (...)
> fio_read_4k_128 10073 53151
> fio_read_4k_64 9500 39757
> fio_read_4k_32 9220 23650
> (...)
> fio_read_4k_16 9122 14322
> fio_write_4k_128 2190 14306
> fio_read_8k_32 706 13894
> fio_write_4k_64 2197 12297
> fio_write_8k_64 3563 11705
> fio_write_8k_128 3444 11219
>
>
> Any hints for tuning the IOPS (read and/or write) would be appreciated.
>
> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>
>
> Kind Regards,
> -Dieter
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc
2012-08-29 22:34 ` Samuel Just
@ 2012-08-30 15:08 ` Dieter Kasper
2012-08-30 20:39 ` Samuel Just
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-30 15:08 UTC (permalink / raw)
To: Samuel Just; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org
Samuel,
thank you very much for this explicitely description!
As far as I understand the journal acts as a ringbuffer in front of the OSD.
Using time as a parameter to trigger sync might not be the best for
a dynamic Storage subsystem. On a high workload e.g. 10/20 for min/max
might be optimal for for 4 nodes with 10 OSDs each,
but not after adding 4 additional nodes.
Are there parameters to trigger the syncs to OSD
in relation to the fill grade of the journal ?
e.g.
filestore [min|max] sync percent:
Do not sync before min-% full; sync after max-% full
What would happen if I set "filestore [min|max] sync interval" to 999999 ?
Will the journal sync start at 100% full or at X% ?
What is 'X' by defaut ?
How can I set 'X' ?
Best Regards,
-Dieter
On Thu, Aug 30, 2012 at 12:34:43AM +0200, Samuel Just wrote:
> filestore [min|max] sync interval:
>
> Periodically, the filestore needs to quiesce writes and do a syncfs in
> order to create
> a consistent commit point up to which it can free journal entries. Syncing more
> frequently tends to reduce the time required to do the sync, and
> reduces the amount
> of data that needs to remain in the journal. Less frequent syncs
> would allow the
> backing filesystem to better coalesce small writes and metadata
> updates hopefully
> resulting in more efficient syncs. 'filestore max sync interval'
> defines the maximum
> time period between syncs, 'filestore min sync interval' defines the
> minimum time
> period between syncs.
>
> filestore flusher:
>
> The filestore flusher forces data from large writes to be written out
> using sync_file_range
> before the sync in order to (hopefully) reduce the cost of the
> eventual sync. In practice,
> disabling 'filestore flusher' seems to improve performance in some cases.
>
> filestore queue max ops:
>
> 'filestore queue max ops' defines the number of in progress ops the
> filestore will accept
> before blocking on queueing new ones. This mostly shouldn't have much
> of an effect
> on performance and should probably be ignored.
>
> filestore op threads:
>
> 'filestore op threads' defines the number of threads used to submit
> filesystem operations
> in parallel.
>
> journal dio:
>
> 'journal dio' enables using O_DIRECT for writing to the journal. This
> should usually
> be enabled. If possible, 'journal aio' should also be enabled to
> allow use of libaio
> to do asynchronous writes.
>
> osd op threads:
>
> 'osd op threads' defines the size of the thread pool used to service
> OSD operations
> such as client requests. Increasing this may increase the rate of
> request processing.
>
> osd disk threads:
>
> 'osd disk threads' defines the number of threads used to perform background disk
> intensive osd operations such as scrubbing and snap trimming.
>
> On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote:
> > Hi Josh,
> >
> > thanks for the hint.
> > Can you please spend a view words about the meaing of these parameters ?
> > - filestore min/max sync interval = int/float ? seconds ? of what ?
> > - filestore flusher = false
> > - filestore queue max ops = 10000
> > what is 'one op' ? queue in front of what ?
> > - filestore op threads =
> > what are useful values here ?
> >
> > - journal dio = true/false
> > - osd op threads =
> > - osd disk threads =
> >
> >
> > Kind Regards,
> > -Dieter
> >
> >
> > On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote:
> >> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote:
> >> > Nice results !
> >> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> >> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> >> >
> >> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> >> > I think you can try to tune these values
> >> >
> >> > filestore max sync interval = 30
> >> > filestore min sync interval = 29
> >> > filestore flusher = false
> >> > filestore queue max ops = 10000
> >>
> >> Increasing filestore_op_threads might help as well.
> >>
> >> > ----- Mail original -----
> >> >
> >> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> >> > À: ceph-devel@vger.kernel.org
> >> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> >> > Envoyé: Mardi 28 Août 2012 19:48:42
> >> > Objet: RBD performance - tuning hints
> >> >
> >> > Hi,
> >> >
> >> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> >> > I can observe a pretty nice rados bench performance
> >> > (see bench-rados.txt for details):
> >> >
> >> > Bandwidth (MB/sec): 961.710
> >> > Max bandwidth (MB/sec): 1040
> >> > Min bandwidth (MB/sec): 772
> >> >
> >> >
> >> > Also the bandwidth performance generated with
> >> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> >> >
> >> > .... is acceptable, e.g.
> >> > fio_write_4m_16 795 MB/s
> >> > fio_randwrite_8m_128 717 MB/s
> >> > fio_randwrite_8m_16 714 MB/s
> >> > fio_randwrite_2m_32 692 MB/s
> >> >
> >> >
> >> > But, the write IOPS seems to be limited around 19k ...
> >> > RBD 4M 64k (= optimal_io_size)
> >> > fio_randread_512_128 53286 55925
> >> > fio_randread_4k_128 51110 44382
> >> > fio_randread_8k_128 30854 29938
> >> > fio_randwrite_512_128 18888 2386
> >> > fio_randwrite_512_64 18844 2582
> >> > fio_randwrite_8k_64 17350 2445
> >> > (...)
> >> > fio_read_4k_128 10073 53151
> >> > fio_read_4k_64 9500 39757
> >> > fio_read_4k_32 9220 23650
> >> > (...)
> >> > fio_read_4k_16 9122 14322
> >> > fio_write_4k_128 2190 14306
> >> > fio_read_8k_32 706 13894
> >> > fio_write_4k_64 2197 12297
> >> > fio_write_8k_64 3563 11705
> >> > fio_write_8k_128 3444 11219
> >> >
> >> >
> >> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> >> >
> >> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> >> >
> >> >
> >> > Kind Regards,
> >> > -Dieter
> >> >
> >> >
> >> >
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper
@ 2012-08-30 15:28 ` Alexandre DERUMIER
2012-08-30 15:33 ` Dieter Kasper
0 siblings, 1 reply; 31+ messages in thread
From: Alexandre DERUMIER @ 2012-08-30 15:28 UTC (permalink / raw)
To: Dieter Kasper; +Cc: ceph-devel
Thanks for the report !
vs your first benchmark, it's with RBD 4M or 64K ?
(how much ssd by node?)
----- Mail original -----
De: "Dieter Kasper" <d.kasper@kabelmail.de>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: ceph-devel@vger.kernel.org
Envoyé: Jeudi 30 Août 2012 16:56:34
Objet: Re: RBD performance - tuning hints
Hi Alexandre,
with the 4 filestore parameter below some fio values could be increased:
filestore max sync interval = 30
filestore min sync interval = 29
filestore flusher = false
filestore queue max ops = 10000
###### IOPS
fio_read_4k_64: 9373
fio_read_4k_128: 9939
fio_randwrite_8k_16: 12376
fio_randwrite_4k_16: 13315
fio_randwrite_512_32: 13660
fio_randwrite_8k_32: 17318
fio_randwrite_4k_32: 18057
fio_randwrite_8k_64: 19693
fio_randwrite_512_64: 20015 <<<
fio_randwrite_4k_64: 20024 <<<
fio_randwrite_8k_128: 20547 <<<
fio_randwrite_4k_128: 20839 <<<
fio_randwrite_512_128: 21417 <<<
fio_randread_8k_128: 48872
fio_randread_4k_128: 50002
fio_randread_512_128: 51202
###### MB/s
fio_randread_2m_32: 628
fio_read_4m_64: 630
fio_randread_8m_32: 633
fio_read_2m_32: 637
fio_read_4m_16: 640
fio_randread_4m_16: 652
fio_write_2m_32: 660
fio_randread_4m_32: 677
fio_read_4m_32: 678
(...)
fio_write_4m_64: 771
fio_randwrite_2m_64: 789
fio_write_8m_128: 796
fio_write_4m_32: 802
fio_randwrite_4m_128: 807 <<<
fio_randwrite_2m_32: 811 <<<
fio_write_2m_128: 833 <<<
fio_write_8m_64: 901 <<<
Best Regards,
-Dieter
On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> Nice results !
> (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>
> >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> I think you can try to tune these values
>
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
>
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: ceph-devel@vger.kernel.org
> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> Envoyé: Mardi 28 Août 2012 19:48:42
> Objet: RBD performance - tuning hints
>
> Hi,
>
> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> I can observe a pretty nice rados bench performance
> (see bench-rados.txt for details):
>
> Bandwidth (MB/sec): 961.710
> Max bandwidth (MB/sec): 1040
> Min bandwidth (MB/sec): 772
>
>
> Also the bandwidth performance generated with
> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>
> .... is acceptable, e.g.
> fio_write_4m_16 795 MB/s
> fio_randwrite_8m_128 717 MB/s
> fio_randwrite_8m_16 714 MB/s
> fio_randwrite_2m_32 692 MB/s
>
>
> But, the write IOPS seems to be limited around 19k ...
> RBD 4M 64k (= optimal_io_size)
> fio_randread_512_128 53286 55925
> fio_randread_4k_128 51110 44382
> fio_randread_8k_128 30854 29938
> fio_randwrite_512_128 18888 2386
> fio_randwrite_512_64 18844 2582
> fio_randwrite_8k_64 17350 2445
> (...)
> fio_read_4k_128 10073 53151
> fio_read_4k_64 9500 39757
> fio_read_4k_32 9220 23650
> (...)
> fio_read_4k_16 9122 14322
> fio_write_4k_128 2190 14306
> fio_read_8k_32 706 13894
> fio_write_4k_64 2197 12297
> fio_write_8k_64 3563 11705
> fio_write_8k_128 3444 11219
>
>
> Any hints for tuning the IOPS (read and/or write) would be appreciated.
>
> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>
>
> Kind Regards,
> -Dieter
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 15:28 ` Alexandre DERUMIER
@ 2012-08-30 15:33 ` Dieter Kasper
2012-08-30 15:46 ` Alexandre DERUMIER
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-30 15:33 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 5048 bytes --]
On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
> Thanks for the report !
>
> vs your first benchmark, it's with RBD 4M or 64K ?
with 4MB (see attached config info)
Cheers,
-Dieter
>
> (how much ssd by node?)
8x SSD, 200GB each
>
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 30 Août 2012 16:56:34
> Objet: Re: RBD performance - tuning hints
>
> Hi Alexandre,
>
> with the 4 filestore parameter below some fio values could be increased:
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
>
> ###### IOPS
> fio_read_4k_64: 9373
> fio_read_4k_128: 9939
> fio_randwrite_8k_16: 12376
> fio_randwrite_4k_16: 13315
> fio_randwrite_512_32: 13660
> fio_randwrite_8k_32: 17318
> fio_randwrite_4k_32: 18057
> fio_randwrite_8k_64: 19693
> fio_randwrite_512_64: 20015 <<<
> fio_randwrite_4k_64: 20024 <<<
> fio_randwrite_8k_128: 20547 <<<
> fio_randwrite_4k_128: 20839 <<<
> fio_randwrite_512_128: 21417 <<<
> fio_randread_8k_128: 48872
> fio_randread_4k_128: 50002
> fio_randread_512_128: 51202
>
> ###### MB/s
> fio_randread_2m_32: 628
> fio_read_4m_64: 630
> fio_randread_8m_32: 633
> fio_read_2m_32: 637
> fio_read_4m_16: 640
> fio_randread_4m_16: 652
> fio_write_2m_32: 660
> fio_randread_4m_32: 677
> fio_read_4m_32: 678
> (...)
> fio_write_4m_64: 771
> fio_randwrite_2m_64: 789
> fio_write_8m_128: 796
> fio_write_4m_32: 802
> fio_randwrite_4m_128: 807 <<<
> fio_randwrite_2m_32: 811 <<<
> fio_write_2m_128: 833 <<<
> fio_write_8m_64: 901 <<<
>
> Best Regards,
> -Dieter
>
>
> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> > Nice results !
> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> >
> > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > I think you can try to tune these values
> >
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: ceph-devel@vger.kernel.org
> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > Envoyé: Mardi 28 Août 2012 19:48:42
> > Objet: RBD performance - tuning hints
> >
> > Hi,
> >
> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > I can observe a pretty nice rados bench performance
> > (see bench-rados.txt for details):
> >
> > Bandwidth (MB/sec): 961.710
> > Max bandwidth (MB/sec): 1040
> > Min bandwidth (MB/sec): 772
> >
> >
> > Also the bandwidth performance generated with
> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> >
> > .... is acceptable, e.g.
> > fio_write_4m_16 795 MB/s
> > fio_randwrite_8m_128 717 MB/s
> > fio_randwrite_8m_16 714 MB/s
> > fio_randwrite_2m_32 692 MB/s
> >
> >
> > But, the write IOPS seems to be limited around 19k ...
> > RBD 4M 64k (= optimal_io_size)
> > fio_randread_512_128 53286 55925
> > fio_randread_4k_128 51110 44382
> > fio_randread_8k_128 30854 29938
> > fio_randwrite_512_128 18888 2386
> > fio_randwrite_512_64 18844 2582
> > fio_randwrite_8k_64 17350 2445
> > (...)
> > fio_read_4k_128 10073 53151
> > fio_read_4k_64 9500 39757
> > fio_read_4k_32 9220 23650
> > (...)
> > fio_read_4k_16 9122 14322
> > fio_write_4k_128 2190 14306
> > fio_read_8k_32 706 13894
> > fio_write_4k_64 2197 12297
> > fio_write_8k_64 3563 11705
> > fio_write_8k_128 3444 11219
> >
> >
> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> >
> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> >
> >
> > Kind Regards,
> > -Dieter
> >
> >
> >
> > --
> >
> > --
> >
> >
> >
> >
> >
> > Alexandre D e rumier
> >
> > Ingénieur Systèmes et Réseaux
> >
> >
> > Fixe : 03 20 68 88 85
> >
> > Fax : 03 20 68 90 88
> >
> >
> > 45 Bvd du Général Leclerc 59100 Roubaix
> > 12 rue Marivaux 75002 Paris
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
>
[-- Attachment #2: hwconf.txt --]
[-- Type: text/plain, Size: 26784 bytes --]
--- RX37-3c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-3 3.0.41-5.1-default #1 SMP Wed Aug 22 00:54:03 UTC 2012 (9c63123) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856332 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdm
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 38 C
Blocks sent to initiator = 257379169992704
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 40 C
Blocks sent to initiator = 238453816033280
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 43 C
Blocks sent to initiator = 297650494636032
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 254438979665920
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 35 C
Blocks sent to initiator = 238876987752448
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 259011676995584
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 41 C
Blocks sent to initiator = 359638046343168
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 31 C
Blocks sent to initiator = 247008082264064
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdm on /data/osd.30 type xfs (rw,noatime)
/dev/sdn on /data/osd.31 type xfs (rw,noatime)
/dev/sdo on /data/osd.32 type xfs (rw,noatime)
/dev/sdp on /data/osd.33 type xfs (rw,noatime)
/dev/sdq on /data/osd.34 type xfs (rw,noatime)
/dev/sdr on /data/osd.35 type xfs (rw,noatime)
/dev/sds on /data/osd.36 type xfs (rw,noatime)
/dev/sdt on /data/osd.37 type xfs (rw,noatime)
--- RX37-4c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-4 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856432 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdd
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sde
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdf
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdg
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdh
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdi
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdj
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdk
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 389173798240256
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 30 C
Blocks sent to initiator = 286249688498176
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 35 C
Blocks sent to initiator = 220455000604672
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 38 C
Blocks sent to initiator = 223169319272448
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 31 C
Blocks sent to initiator = 232096593346560
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 264802534424576
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 27 C
Blocks sent to initiator = 288896512425984
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 32 C
Blocks sent to initiator = 282331621359616
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdd on /data/osd.40 type xfs (rw,noatime)
/dev/sde on /data/osd.41 type xfs (rw,noatime)
/dev/sdf on /data/osd.42 type xfs (rw,noatime)
/dev/sdg on /data/osd.43 type xfs (rw,noatime)
/dev/sdh on /data/osd.44 type xfs (rw,noatime)
/dev/sdi on /data/osd.45 type xfs (rw,noatime)
/dev/sdj on /data/osd.46 type xfs (rw,noatime)
/dev/sdk on /data/osd.47 type xfs (rw,noatime)
--- RX37-5c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-5 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 74226012 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 247461838848000
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 38 C
Blocks sent to initiator = 231320898764800
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 41 C
Blocks sent to initiator = 290086906232832
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 32 C
Blocks sent to initiator = 287719053852672
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 33 C
Blocks sent to initiator = 243922265702400
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 35 C
Blocks sent to initiator = 272285122428928
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 40 C
Blocks sent to initiator = 279561266790400
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 29 C
Blocks sent to initiator = 247978778427392
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdo on /data/osd.50 type xfs (rw,noatime)
/dev/sdp on /data/osd.51 type xfs (rw,noatime)
/dev/sdq on /data/osd.52 type xfs (rw,noatime)
/dev/sdr on /data/osd.53 type xfs (rw,noatime)
/dev/sds on /data/osd.54 type xfs (rw,noatime)
/dev/sdt on /data/osd.55 type xfs (rw,noatime)
/dev/sdu on /data/osd.56 type xfs (rw,noatime)
/dev/sdv on /data/osd.57 type xfs (rw,noatime)
--- RX37-6c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-6 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 32856344 kB
Disk /dev/ram0: 2048 MB, 2048000000 bytes
Disk /dev/ram1: 2048 MB, 2048000000 bytes
Disk /dev/ram2: 2048 MB, 2048000000 bytes
Disk /dev/ram3: 2048 MB, 2048000000 bytes
Disk /dev/ram4: 2048 MB, 2048000000 bytes
Disk /dev/ram5: 2048 MB, 2048000000 bytes
Disk /dev/ram6: 2048 MB, 2048000000 bytes
Disk /dev/ram7: 2048 MB, 2048000000 bytes
[10:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdn
[10:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdo
[10:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdp
[10:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[11:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[11:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[11:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
[11:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 41 C
Blocks sent to initiator = 259148495192064
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 36 C
Blocks sent to initiator = 250183472381952
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 43 C
Blocks sent to initiator = 232864704626688
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 46 C
Blocks sent to initiator = 313614921629696
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 37 C
Blocks sent to initiator = 269851218149376
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 34 C
Blocks sent to initiator = 278551060283392
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 43 C
Blocks sent to initiator = 267839076302848
Device: INTEL(R) SSD 910 200GB Version: a411
Current Drive Temperature: 39 C
Blocks sent to initiator = 233988811653120
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
/dev/sdn on /data/osd.60 type xfs (rw,noatime)
/dev/sdo on /data/osd.61 type xfs (rw,noatime)
/dev/sdp on /data/osd.62 type xfs (rw,noatime)
/dev/sdq on /data/osd.63 type xfs (rw,noatime)
/dev/sdr on /data/osd.64 type xfs (rw,noatime)
/dev/sds on /data/osd.65 type xfs (rw,noatime)
/dev/sdt on /data/osd.66 type xfs (rw,noatime)
/dev/sdu on /data/osd.67 type xfs (rw,noatime)
--- RX37-7c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-7 3.0.36-10-default #1 SMP Mon Jul 9 14:42:03 UTC 2012 (595894d) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 1.20 GHz (asserted by call to hardware).
MemTotal: 32856344 kB
optimal_io_size: 4194304
4194304
4194304
scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
--- RX37-8c --------------------------------------------------------------------
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
Linux RX37-8 3.0.36-16-default #1 SMP Wed Jul 18 00:18:54 UTC 2012 (544e41f) x86_64 x86_64 x86_64 GNU/Linux
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
current CPU frequency is 2.30 GHz (asserted by call to hardware).
MemTotal: 65952088 kB
optimal_io_size: scheduler: [noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
--------------------------------------------------------------------------------
dumped osdmap epoch 19
epoch 19
fsid 31dc8e8c-45cb-4b94-b581-a9258964f1a6
created 2012-08-29 22:08:58.870313
modifed 2012-08-29 22:09:50.084564
flags
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 4352 pgp_num 4352 last_change 1 owner 0
pool 3 'pbench' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 768 pgp_num 768 last_change 18 owner 0
max_osd 68
osd.30 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6800/24876 192.168.114.52:6800/24876 192.168.114.52:6801/24876 exists,up 0a9a6db3-1c0d-4d66-ac99-bd900076c42c
osd.31 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6801/25090 192.168.114.52:6802/25090 192.168.114.52:6803/25090 exists,up 0adab61b-c1c3-479f-b58e-42bec92bd5b0
osd.32 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6802/25276 192.168.114.52:6804/25276 192.168.114.52:6805/25276 exists,up 331bf096-d785-4ae8-b790-d746a0abb694
osd.33 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6803/25464 192.168.114.52:6806/25464 192.168.114.52:6807/25464 exists,up a1f9ea5b-e0db-474c-b7bc-6cb3d3a213a4
osd.34 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6804/25650 192.168.114.52:6808/25650 192.168.114.52:6809/25650 exists,up dcbe68e7-fef3-430d-a857-560db28de27f
osd.35 up in weight 1 up_from 2 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6805/25838 192.168.114.52:6810/25838 192.168.114.52:6811/25838 exists,up ab1589d0-e725-4484-8f5d-f65bc5c64643
osd.36 up in weight 1 up_from 3 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6806/26026 192.168.114.52:6812/26026 192.168.114.52:6813/26026 exists,up 2eea079f-bcfe-48a4-abb5-a15c7daf80ba
osd.37 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.52:6807/26218 192.168.114.52:6814/26218 192.168.114.52:6815/26218 exists,up 9822d872-79a6-4cd3-898f-2e905fbce44a
osd.40 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6800/18525 192.168.114.53:6800/18525 192.168.114.53:6801/18525 exists,up 0f0c61ea-4d78-429c-9928-b3422ad2dec7
osd.41 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6801/18750 192.168.114.53:6802/18750 192.168.114.53:6803/18750 exists,up 3935c6a7-61ff-4c97-88b9-472051ba8b6c
osd.42 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6802/18946 192.168.114.53:6804/18946 192.168.114.53:6805/18946 exists,up 3efc6383-5097-4e95-9af2-e0e7bc9ddc10
osd.43 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6803/19154 192.168.114.53:6806/19154 192.168.114.53:6807/19154 exists,up cdb8cf82-077b-40c2-adbc-fae29ba41645
osd.44 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6804/19350 192.168.114.53:6808/19350 192.168.114.53:6809/19350 exists,up 5ab69e45-a73a-4cd4-9837-2d54fb4ea4ec
osd.45 up in weight 1 up_from 4 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6805/19546 192.168.114.53:6810/19546 192.168.114.53:6811/19546 exists,up ec3d2118-6f46-4ef8-a431-553710f33a18
osd.46 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6806/19766 192.168.114.53:6812/19766 192.168.114.53:6813/19766 exists,up dcd94df3-b679-46a6-b670-5269a29913c1
osd.47 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.53:6807/19968 192.168.114.53:6814/19968 192.168.114.53:6815/19968 exists,up 41019d97-c4f3-4c8d-9189-bae642c31678
osd.50 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6800/3848 192.168.114.54:6800/3848 192.168.114.54:6801/3848 exists,up 0b9ebe8e-9cb8-440d-948e-d4c8aa16b407
osd.51 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6801/4061 192.168.114.54:6802/4061 192.168.114.54:6803/4061 exists,up 3c2e8031-d01d-4bf9-965e-1b77563d5f8f
osd.52 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6802/4248 192.168.114.54:6804/4248 192.168.114.54:6805/4248 exists,up 4d641c3c-0a7a-4b20-b047-9042b61685bb
osd.53 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6803/4446 192.168.114.54:6806/4446 192.168.114.54:6807/4446 exists,up e335a6e9-9c32-48c6-8f15-11aa84a6287d
osd.54 up in weight 1 up_from 5 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6804/4632 192.168.114.54:6808/4632 192.168.114.54:6809/4632 exists,up 16f3955c-9eee-442b-86d8-cbbc5938efbf
osd.55 up in weight 1 up_from 6 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6805/4836 192.168.114.54:6810/4836 192.168.114.54:6811/4836 exists,up 83e59145-9ff8-4c0b-b066-2b2e4e9c9953
osd.56 up in weight 1 up_from 6 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6806/5029 192.168.114.54:6812/5029 192.168.114.54:6813/5029 exists,up dfdeb186-5c96-4466-b4d3-5f32fa712792
osd.57 up in weight 1 up_from 7 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.54:6807/5351 192.168.114.54:6814/5351 192.168.114.54:6815/5351 exists,up adf7a484-b0f1-4bf7-a8e7-2c1e64dfb77f
osd.60 up in weight 1 up_from 7 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6800/31038 192.168.114.55:6800/31038 192.168.114.55:6801/31038 exists,up e9b949c8-1b47-4749-9408-1e9f7b89b0e6
osd.61 up in weight 1 up_from 8 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6801/31257 192.168.114.55:6802/31257 192.168.114.55:6803/31257 exists,up 19fcad53-d951-4645-a6d5-7dad1deba6fb
osd.62 up in weight 1 up_from 8 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6802/31449 192.168.114.55:6804/31449 192.168.114.55:6805/31449 exists,up 7e98db0e-2ae2-473d-9b03-798ec472b29b
osd.63 up in weight 1 up_from 9 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6803/31641 192.168.114.55:6806/31641 192.168.114.55:6807/31641 exists,up 9abc714c-06e4-40ba-8afe-8465209e0272
osd.64 up in weight 1 up_from 9 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6804/31937 192.168.114.55:6808/31937 192.168.114.55:6809/31937 exists,up 6a20e4b1-d1e9-4f69-b903-b403136ddb1d
osd.65 up in weight 1 up_from 10 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6805/32175 192.168.114.55:6810/32175 192.168.114.55:6811/32175 exists,up e95ad5b2-6866-4161-8060-781a31d7ece2
osd.66 up in weight 1 up_from 10 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6806/32487 192.168.114.55:6812/32487 192.168.114.55:6813/32487 exists,up f3126979-ecd6-45de-b0bf-54cb2b0af042
osd.67 up in weight 1 up_from 11 up_thru 18 down_at 0 last_clean_interval [0,0) 192.168.113.55:6807/32679 192.168.114.55:6814/32679 192.168.114.55:6815/32679 exists,up 37d3f121-b6f4-4c6f-ac9b-30533e8fa60a
ceph.conf
---content---
# global
[global]
# enable secure authentication
auth supported = none
# allow ourselves to open a lot of files
#max open files = 1100000
max open files = 131072
# set log file
log file = /ceph/log/$name.log
# log_to_syslog = true # uncomment this line to log to syslog
# set up pid files
pid file = /var/run/ceph/$name.pid
# If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible
#ms bind ipv6 = true
public network = 192.168.113.0/24
cluster network = 192.168.114.0/24
# monitors
# You need at least one. You need at least three if you want to
# tolerate any node failures. Always create an odd number.
[mon]
mon data = /ceph/$name
# If you are using for example the RADOS Gateway and want to have your newly created
# pools a higher replication level, you can set a default
#osd pool default size = 3
# You can also specify a CRUSH rule for new pools
# Wiki: http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
#osd pool default crush rule = 0
# Timing is critical for monitors, but if you want to allow the clocks to drift a
# bit more, you can specify the max drift.
#mon clock drift allowed = 1
# Tell the monitor to backoff from this warning for 30 seconds
#mon clock drift warn backoff = 30
# logging, for debugging monitor crashes, in order of
# their likelihood of being helpful :)
#debug ms = 1
#debug mon = 20
#debug paxos = 20
#debug auth = 20
debug optracker = 0
[mon.0]
host = RX37-3c
mon addr = 192.168.113.52:6789
[mon.1]
host = RX37-7c
mon addr = 192.168.113.56:6789
[mon.2]
host = RX37-8c
mon addr = 192.168.113.57:6789
# mds
# You need at least one. Define two to get a standby.
[mds]
# mds data = /ceph/$name
# where the mds keeps it's secret encryption keys
#keyring = /data/keyring.$name
# mds logging to debug issues.
#debug ms = 1
#debug mds = 20
debug optracker = 0
[mds.0]
host = RX37-8c
# osd
# You need at least one. Two if you want data to be replicated.
# Define as many as you like.
[osd]
# This is where the btrfs volume will be mounted.
osd data = /data/$name
# journal dio = true
# osd op threads = 24
# osd disk threads = 24
# filestore op threads = 6
# filestore queue max ops = 24
filestore max sync interval = 30
filestore min sync interval = 29
filestore flusher = false
filestore queue max ops = 10000
# Ideally, make this a separate disk or partition. A few
# hundred MB should be enough; more if you have fast or many
# disks. You can use a file under the osd data dir if need be
# (e.g. /data/$name/journal), but it will be slower than a
# separate disk or partition.
# This is an example of a file-based journal.
# osd journal = /ceph/$name/journal
# osd journal size = 2048
# journal size, in megabytes
# If you want to run the journal on a tmpfs, disable DirectIO
#journal dio = false
# You can change the number of recovery operations to speed up recovery
# or slow it down if your machines can't handle it
# osd recovery max active = 3
# osd logging to debug osd issues, in order of likelihood of being
# helpful
#debug ms = 1
#debug osd = 20
#debug filestore = 20
#debug journal = 20
debug optracker = 0
fstype = xfs
[osd.30]
host = RX37-3c
devs = /dev/sdm
osd journal = /dev/ram0
[osd.31]
host = RX37-3c
devs = /dev/sdn
osd journal = /dev/ram1
[osd.32]
host = RX37-3c
devs = /dev/sdo
osd journal = /dev/ram2
[osd.33]
host = RX37-3c
devs = /dev/sdp
osd journal = /dev/ram3
[osd.34]
host = RX37-3c
devs = /dev/sdq
osd journal = /dev/ram4
[osd.35]
host = RX37-3c
devs = /dev/sdr
osd journal = /dev/ram5
[osd.36]
host = RX37-3c
devs = /dev/sds
osd journal = /dev/ram6
[osd.37]
host = RX37-3c
devs = /dev/sdt
osd journal = /dev/ram7
[osd.40]
host = RX37-4c
devs = /dev/sdd
osd journal = /dev/ram0
[osd.41]
host = RX37-4c
devs = /dev/sde
osd journal = /dev/ram1
[osd.42]
host = RX37-4c
devs = /dev/sdf
osd journal = /dev/ram2
[osd.43]
host = RX37-4c
devs = /dev/sdg
osd journal = /dev/ram3
[osd.44]
host = RX37-4c
devs = /dev/sdh
osd journal = /dev/ram4
[osd.45]
host = RX37-4c
devs = /dev/sdi
osd journal = /dev/ram5
[osd.46]
host = RX37-4c
devs = /dev/sdj
osd journal = /dev/ram6
[osd.47]
host = RX37-4c
devs = /dev/sdk
osd journal = /dev/ram7
[osd.50]
host = RX37-5c
devs = /dev/sdo
osd journal = /dev/ram0
[osd.51]
host = RX37-5c
devs = /dev/sdp
osd journal = /dev/ram1
[osd.52]
host = RX37-5c
devs = /dev/sdq
osd journal = /dev/ram2
[osd.53]
host = RX37-5c
devs = /dev/sdr
osd journal = /dev/ram3
[osd.54]
host = RX37-5c
devs = /dev/sds
osd journal = /dev/ram4
[osd.55]
host = RX37-5c
devs = /dev/sdt
osd journal = /dev/ram5
[osd.56]
host = RX37-5c
devs = /dev/sdu
osd journal = /dev/ram6
[osd.57]
host = RX37-5c
devs = /dev/sdv
osd journal = /dev/ram7
[osd.60]
host = RX37-6c
devs = /dev/sdn
osd journal = /dev/ram0
[osd.61]
host = RX37-6c
devs = /dev/sdo
osd journal = /dev/ram1
[osd.62]
host = RX37-6c
devs = /dev/sdp
osd journal = /dev/ram2
[osd.63]
host = RX37-6c
devs = /dev/sdq
osd journal = /dev/ram3
[osd.64]
host = RX37-6c
devs = /dev/sdr
osd journal = /dev/ram4
[osd.65]
host = RX37-6c
devs = /dev/sds
osd journal = /dev/ram5
[osd.66]
host = RX37-6c
devs = /dev/sdt
osd journal = /dev/ram6
[osd.67]
host = RX37-6c
devs = /dev/sdu
osd journal = /dev/ram7
devs = /dev/sdc
[client.01]
client hostname = RX37-7c
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 15:33 ` Dieter Kasper
@ 2012-08-30 15:46 ` Alexandre DERUMIER
2012-08-30 16:02 ` Dieter Kasper
0 siblings, 1 reply; 31+ messages in thread
From: Alexandre DERUMIER @ 2012-08-30 15:46 UTC (permalink / raw)
To: Dieter Kasper; +Cc: ceph-devel
Thanks
>> 8x SSD, 200GB each
20000 iops seem pretty low,no ?
for @intank:
Is their a bottleneck somewhere in ceph ?
I said that, because I would like to know if it's scale by adding new nodes.
Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
----- Mail original -----
De: "Dieter Kasper" <d.kasper@kabelmail.de>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: ceph-devel@vger.kernel.org
Envoyé: Jeudi 30 Août 2012 17:33:42
Objet: Re: RBD performance - tuning hints
On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
> Thanks for the report !
>
> vs your first benchmark, it's with RBD 4M or 64K ?
with 4MB (see attached config info)
Cheers,
-Dieter
>
> (how much ssd by node?)
8x SSD, 200GB each
>
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 30 Août 2012 16:56:34
> Objet: Re: RBD performance - tuning hints
>
> Hi Alexandre,
>
> with the 4 filestore parameter below some fio values could be increased:
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
>
> ###### IOPS
> fio_read_4k_64: 9373
> fio_read_4k_128: 9939
> fio_randwrite_8k_16: 12376
> fio_randwrite_4k_16: 13315
> fio_randwrite_512_32: 13660
> fio_randwrite_8k_32: 17318
> fio_randwrite_4k_32: 18057
> fio_randwrite_8k_64: 19693
> fio_randwrite_512_64: 20015 <<<
> fio_randwrite_4k_64: 20024 <<<
> fio_randwrite_8k_128: 20547 <<<
> fio_randwrite_4k_128: 20839 <<<
> fio_randwrite_512_128: 21417 <<<
> fio_randread_8k_128: 48872
> fio_randread_4k_128: 50002
> fio_randread_512_128: 51202
>
> ###### MB/s
> fio_randread_2m_32: 628
> fio_read_4m_64: 630
> fio_randread_8m_32: 633
> fio_read_2m_32: 637
> fio_read_4m_16: 640
> fio_randread_4m_16: 652
> fio_write_2m_32: 660
> fio_randread_4m_32: 677
> fio_read_4m_32: 678
> (...)
> fio_write_4m_64: 771
> fio_randwrite_2m_64: 789
> fio_write_8m_128: 796
> fio_write_4m_32: 802
> fio_randwrite_4m_128: 807 <<<
> fio_randwrite_2m_32: 811 <<<
> fio_write_2m_128: 833 <<<
> fio_write_8m_64: 901 <<<
>
> Best Regards,
> -Dieter
>
>
> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> > Nice results !
> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> >
> > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > I think you can try to tune these values
> >
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: ceph-devel@vger.kernel.org
> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > Envoyé: Mardi 28 Août 2012 19:48:42
> > Objet: RBD performance - tuning hints
> >
> > Hi,
> >
> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > I can observe a pretty nice rados bench performance
> > (see bench-rados.txt for details):
> >
> > Bandwidth (MB/sec): 961.710
> > Max bandwidth (MB/sec): 1040
> > Min bandwidth (MB/sec): 772
> >
> >
> > Also the bandwidth performance generated with
> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> >
> > .... is acceptable, e.g.
> > fio_write_4m_16 795 MB/s
> > fio_randwrite_8m_128 717 MB/s
> > fio_randwrite_8m_16 714 MB/s
> > fio_randwrite_2m_32 692 MB/s
> >
> >
> > But, the write IOPS seems to be limited around 19k ...
> > RBD 4M 64k (= optimal_io_size)
> > fio_randread_512_128 53286 55925
> > fio_randread_4k_128 51110 44382
> > fio_randread_8k_128 30854 29938
> > fio_randwrite_512_128 18888 2386
> > fio_randwrite_512_64 18844 2582
> > fio_randwrite_8k_64 17350 2445
> > (...)
> > fio_read_4k_128 10073 53151
> > fio_read_4k_64 9500 39757
> > fio_read_4k_32 9220 23650
> > (...)
> > fio_read_4k_16 9122 14322
> > fio_write_4k_128 2190 14306
> > fio_read_8k_32 706 13894
> > fio_write_4k_64 2197 12297
> > fio_write_8k_64 3563 11705
> > fio_write_8k_128 3444 11219
> >
> >
> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> >
> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> >
> >
> > Kind Regards,
> > -Dieter
> >
> >
> >
> > --
> >
> > --
> >
> >
> >
> >
> >
> > Alexandre D e rumier
> >
> > Ingénieur Systèmes et Réseaux
> >
> >
> > Fixe : 03 20 68 88 85
> >
> > Fax : 03 20 68 90 88
> >
> >
> > 45 Bvd du Général Leclerc 59100 Roubaix
> > 12 rue Marivaux 75002 Paris
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
>
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 15:46 ` Alexandre DERUMIER
@ 2012-08-30 16:02 ` Dieter Kasper
2012-08-30 16:12 ` Alexandre DERUMIER
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-30 16:02 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org, Andreas Bluemle
On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
> Thanks
>
> >> 8x SSD, 200GB each
>
> 20000 iops seem pretty low,no ?
well, you have to compare
- pure a SSD (via PCIe or SAS-6G) vs.
- Ceph-Journal, which goes 2x over 10GbE with IP
Client -> primary-copy -> 2nd-copy
(= redundancy over Ethernet distance)
I'm curious about the answer from Inktank,
-Dieter
>
>
> for @intank:
>
> Is their a bottleneck somewhere in ceph ?
Maybe "SimpleMessenger dispatching: cause of performance problems?"
from Thu, 16 Aug 2012 18:08:39 +0200
by <andreas.bluemle@itxperts.de>
can be an answer.
Especially if a small number of OSDs is used.
>
> I said that, because I would like to know if it's scale by adding new nodes.
>
> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 30 Août 2012 17:33:42
> Objet: Re: RBD performance - tuning hints
>
> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
> > Thanks for the report !
> >
> > vs your first benchmark, it's with RBD 4M or 64K ?
> with 4MB (see attached config info)
>
> Cheers,
> -Dieter
>
> >
> > (how much ssd by node?)
> 8x SSD, 200GB each
>
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
> > Cc: ceph-devel@vger.kernel.org
> > Envoyé: Jeudi 30 Août 2012 16:56:34
> > Objet: Re: RBD performance - tuning hints
> >
> > Hi Alexandre,
> >
> > with the 4 filestore parameter below some fio values could be increased:
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
> >
> > ###### IOPS
> > fio_read_4k_64: 9373
> > fio_read_4k_128: 9939
> > fio_randwrite_8k_16: 12376
> > fio_randwrite_4k_16: 13315
> > fio_randwrite_512_32: 13660
> > fio_randwrite_8k_32: 17318
> > fio_randwrite_4k_32: 18057
> > fio_randwrite_8k_64: 19693
> > fio_randwrite_512_64: 20015 <<<
> > fio_randwrite_4k_64: 20024 <<<
> > fio_randwrite_8k_128: 20547 <<<
> > fio_randwrite_4k_128: 20839 <<<
> > fio_randwrite_512_128: 21417 <<<
> > fio_randread_8k_128: 48872
> > fio_randread_4k_128: 50002
> > fio_randread_512_128: 51202
> >
> > ###### MB/s
> > fio_randread_2m_32: 628
> > fio_read_4m_64: 630
> > fio_randread_8m_32: 633
> > fio_read_2m_32: 637
> > fio_read_4m_16: 640
> > fio_randread_4m_16: 652
> > fio_write_2m_32: 660
> > fio_randread_4m_32: 677
> > fio_read_4m_32: 678
> > (...)
> > fio_write_4m_64: 771
> > fio_randwrite_2m_64: 789
> > fio_write_8m_128: 796
> > fio_write_4m_32: 802
> > fio_randwrite_4m_128: 807 <<<
> > fio_randwrite_2m_32: 811 <<<
> > fio_write_2m_128: 833 <<<
> > fio_write_8m_64: 901 <<<
> >
> > Best Regards,
> > -Dieter
> >
> >
> > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> > > Nice results !
> > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> > >
> > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > > I think you can try to tune these values
> > >
> > > filestore max sync interval = 30
> > > filestore min sync interval = 29
> > > filestore flusher = false
> > > filestore queue max ops = 10000
> > >
> > >
> > >
> > > ----- Mail original -----
> > >
> > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > > À: ceph-devel@vger.kernel.org
> > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > > Envoyé: Mardi 28 Août 2012 19:48:42
> > > Objet: RBD performance - tuning hints
> > >
> > > Hi,
> > >
> > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > > I can observe a pretty nice rados bench performance
> > > (see bench-rados.txt for details):
> > >
> > > Bandwidth (MB/sec): 961.710
> > > Max bandwidth (MB/sec): 1040
> > > Min bandwidth (MB/sec): 772
> > >
> > >
> > > Also the bandwidth performance generated with
> > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> > >
> > > .... is acceptable, e.g.
> > > fio_write_4m_16 795 MB/s
> > > fio_randwrite_8m_128 717 MB/s
> > > fio_randwrite_8m_16 714 MB/s
> > > fio_randwrite_2m_32 692 MB/s
> > >
> > >
> > > But, the write IOPS seems to be limited around 19k ...
> > > RBD 4M 64k (= optimal_io_size)
> > > fio_randread_512_128 53286 55925
> > > fio_randread_4k_128 51110 44382
> > > fio_randread_8k_128 30854 29938
> > > fio_randwrite_512_128 18888 2386
> > > fio_randwrite_512_64 18844 2582
> > > fio_randwrite_8k_64 17350 2445
> > > (...)
> > > fio_read_4k_128 10073 53151
> > > fio_read_4k_64 9500 39757
> > > fio_read_4k_32 9220 23650
> > > (...)
> > > fio_read_4k_16 9122 14322
> > > fio_write_4k_128 2190 14306
> > > fio_read_8k_32 706 13894
> > > fio_write_4k_64 2197 12297
> > > fio_write_8k_64 3563 11705
> > > fio_write_8k_128 3444 11219
> > >
> > >
> > > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> > >
> > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > >
> > >
> > > Kind Regards,
> > > -Dieter
> > >
> > >
> > >
> > > --
> > >
> > > --
> > >
> > >
> > >
> > >
> > >
> > > Alexandre D e rumier
> > >
> > > Ingénieur Systèmes et Réseaux
> > >
> > >
> > > Fixe : 03 20 68 88 85
> > >
> > > Fax : 03 20 68 90 88
> > >
> > >
> > > 45 Bvd du Général Leclerc 59100 Roubaix
> > > 12 rue Marivaux 75002 Paris
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> >
> > --
> >
> > --
> >
> >
> >
> >
> >
> > Alexandre D e rumier
> >
> > Ingénieur Systèmes et Réseaux
> >
> >
> > Fixe : 03 20 68 88 85
> >
> > Fax : 03 20 68 90 88
> >
> >
> > 45 Bvd du Général Leclerc 59100 Roubaix
> > 12 rue Marivaux 75002 Paris
> >
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 16:02 ` Dieter Kasper
@ 2012-08-30 16:12 ` Alexandre DERUMIER
2012-08-30 16:16 ` Josh Durgin
2012-08-30 16:48 ` Dieter Kasper
0 siblings, 2 replies; 31+ messages in thread
From: Alexandre DERUMIER @ 2012-08-30 16:12 UTC (permalink / raw)
To: Dieter Kasper; +Cc: ceph-devel, Andreas Bluemle
>>well, you have to compare
>>- pure a SSD (via PCIe or SAS-6G) vs.
>>- Ceph-Journal, which goes 2x over 10GbE with IP
>> Client -> primary-copy -> 2nd-copy
>> (= redundancy over Ethernet distance)
Sure but the first osd ack to the client,before replicating to the others osd.
Client -> primary-copy -> 2nd-copy
<-ack
primary-copy -> 2nd-copy
-> 3st-copy
Or I'm wrong ?
----- Mail original -----
De: "Dieter Kasper" <d.kasper@kabelmail.de>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
Envoyé: Jeudi 30 Août 2012 18:02:05
Objet: Re: RBD performance - tuning hints
On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
> Thanks
>
> >> 8x SSD, 200GB each
>
> 20000 iops seem pretty low,no ?
well, you have to compare
- pure a SSD (via PCIe or SAS-6G) vs.
- Ceph-Journal, which goes 2x over 10GbE with IP
Client -> primary-copy -> 2nd-copy
(= redundancy over Ethernet distance)
I'm curious about the answer from Inktank,
-Dieter
>
>
> for @intank:
>
> Is their a bottleneck somewhere in ceph ?
Maybe "SimpleMessenger dispatching: cause of performance problems?"
from Thu, 16 Aug 2012 18:08:39 +0200
by <andreas.bluemle@itxperts.de>
can be an answer.
Especially if a small number of OSDs is used.
>
> I said that, because I would like to know if it's scale by adding new nodes.
>
> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 30 Août 2012 17:33:42
> Objet: Re: RBD performance - tuning hints
>
> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
> > Thanks for the report !
> >
> > vs your first benchmark, it's with RBD 4M or 64K ?
> with 4MB (see attached config info)
>
> Cheers,
> -Dieter
>
> >
> > (how much ssd by node?)
> 8x SSD, 200GB each
>
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
> > Cc: ceph-devel@vger.kernel.org
> > Envoyé: Jeudi 30 Août 2012 16:56:34
> > Objet: Re: RBD performance - tuning hints
> >
> > Hi Alexandre,
> >
> > with the 4 filestore parameter below some fio values could be increased:
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
> >
> > ###### IOPS
> > fio_read_4k_64: 9373
> > fio_read_4k_128: 9939
> > fio_randwrite_8k_16: 12376
> > fio_randwrite_4k_16: 13315
> > fio_randwrite_512_32: 13660
> > fio_randwrite_8k_32: 17318
> > fio_randwrite_4k_32: 18057
> > fio_randwrite_8k_64: 19693
> > fio_randwrite_512_64: 20015 <<<
> > fio_randwrite_4k_64: 20024 <<<
> > fio_randwrite_8k_128: 20547 <<<
> > fio_randwrite_4k_128: 20839 <<<
> > fio_randwrite_512_128: 21417 <<<
> > fio_randread_8k_128: 48872
> > fio_randread_4k_128: 50002
> > fio_randread_512_128: 51202
> >
> > ###### MB/s
> > fio_randread_2m_32: 628
> > fio_read_4m_64: 630
> > fio_randread_8m_32: 633
> > fio_read_2m_32: 637
> > fio_read_4m_16: 640
> > fio_randread_4m_16: 652
> > fio_write_2m_32: 660
> > fio_randread_4m_32: 677
> > fio_read_4m_32: 678
> > (...)
> > fio_write_4m_64: 771
> > fio_randwrite_2m_64: 789
> > fio_write_8m_128: 796
> > fio_write_4m_32: 802
> > fio_randwrite_4m_128: 807 <<<
> > fio_randwrite_2m_32: 811 <<<
> > fio_write_2m_128: 833 <<<
> > fio_write_8m_64: 901 <<<
> >
> > Best Regards,
> > -Dieter
> >
> >
> > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> > > Nice results !
> > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> > >
> > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > > I think you can try to tune these values
> > >
> > > filestore max sync interval = 30
> > > filestore min sync interval = 29
> > > filestore flusher = false
> > > filestore queue max ops = 10000
> > >
> > >
> > >
> > > ----- Mail original -----
> > >
> > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > > À: ceph-devel@vger.kernel.org
> > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > > Envoyé: Mardi 28 Août 2012 19:48:42
> > > Objet: RBD performance - tuning hints
> > >
> > > Hi,
> > >
> > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > > I can observe a pretty nice rados bench performance
> > > (see bench-rados.txt for details):
> > >
> > > Bandwidth (MB/sec): 961.710
> > > Max bandwidth (MB/sec): 1040
> > > Min bandwidth (MB/sec): 772
> > >
> > >
> > > Also the bandwidth performance generated with
> > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> > >
> > > .... is acceptable, e.g.
> > > fio_write_4m_16 795 MB/s
> > > fio_randwrite_8m_128 717 MB/s
> > > fio_randwrite_8m_16 714 MB/s
> > > fio_randwrite_2m_32 692 MB/s
> > >
> > >
> > > But, the write IOPS seems to be limited around 19k ...
> > > RBD 4M 64k (= optimal_io_size)
> > > fio_randread_512_128 53286 55925
> > > fio_randread_4k_128 51110 44382
> > > fio_randread_8k_128 30854 29938
> > > fio_randwrite_512_128 18888 2386
> > > fio_randwrite_512_64 18844 2582
> > > fio_randwrite_8k_64 17350 2445
> > > (...)
> > > fio_read_4k_128 10073 53151
> > > fio_read_4k_64 9500 39757
> > > fio_read_4k_32 9220 23650
> > > (...)
> > > fio_read_4k_16 9122 14322
> > > fio_write_4k_128 2190 14306
> > > fio_read_8k_32 706 13894
> > > fio_write_4k_64 2197 12297
> > > fio_write_8k_64 3563 11705
> > > fio_write_8k_128 3444 11219
> > >
> > >
> > > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> > >
> > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > >
> > >
> > > Kind Regards,
> > > -Dieter
> > >
> > >
> > >
> > > --
> > >
> > > --
> > >
> > >
> > >
> > >
> > >
> > > Alexandre D e rumier
> > >
> > > Ingénieur Systèmes et Réseaux
> > >
> > >
> > > Fixe : 03 20 68 88 85
> > >
> > > Fax : 03 20 68 90 88
> > >
> > >
> > > 45 Bvd du Général Leclerc 59100 Roubaix
> > > 12 rue Marivaux 75002 Paris
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> >
> > --
> >
> > --
> >
> >
> >
> >
> >
> > Alexandre D e rumier
> >
> > Ingénieur Systèmes et Réseaux
> >
> >
> > Fixe : 03 20 68 88 85
> >
> > Fax : 03 20 68 90 88
> >
> >
> > 45 Bvd du Général Leclerc 59100 Roubaix
> > 12 rue Marivaux 75002 Paris
> >
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 16:12 ` Alexandre DERUMIER
@ 2012-08-30 16:16 ` Josh Durgin
2012-08-31 7:46 ` Alexandre DERUMIER
2012-08-30 16:48 ` Dieter Kasper
1 sibling, 1 reply; 31+ messages in thread
From: Josh Durgin @ 2012-08-30 16:16 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: Dieter Kasper, ceph-devel, Andreas Bluemle
On 08/30/2012 09:12 AM, Alexandre DERUMIER wrote:
>>> well, you have to compare
>>> - pure a SSD (via PCIe or SAS-6G) vs.
>>> - Ceph-Journal, which goes 2x over 10GbE with IP
>>> Client -> primary-copy -> 2nd-copy
>>> (= redundancy over Ethernet distance)
>
> Sure but the first osd ack to the client,before replicating to the others osd.
>
> Client -> primary-copy -> 2nd-copy
> <-ack
> primary-copy -> 2nd-copy
> -> 3st-copy
>
> Or I'm wrong ?
RBD waits for the data to be on disk on all replicas. It's pretty easy
to relax this to in memory on all replicas, but there's no option for
that right now.
Josh
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
> Envoyé: Jeudi 30 Août 2012 18:02:05
> Objet: Re: RBD performance - tuning hints
>
> On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
>> Thanks
>>
>>>> 8x SSD, 200GB each
>>
>> 20000 iops seem pretty low,no ?
> well, you have to compare
> - pure a SSD (via PCIe or SAS-6G) vs.
> - Ceph-Journal, which goes 2x over 10GbE with IP
> Client -> primary-copy -> 2nd-copy
> (= redundancy over Ethernet distance)
>
> I'm curious about the answer from Inktank,
>
> -Dieter
>
>>
>>
>> for @intank:
>>
>> Is their a bottleneck somewhere in ceph ?
> Maybe "SimpleMessenger dispatching: cause of performance problems?"
> from Thu, 16 Aug 2012 18:08:39 +0200
> by <andreas.bluemle@itxperts.de>
> can be an answer.
> Especially if a small number of OSDs is used.
>
>>
>> I said that, because I would like to know if it's scale by adding new nodes.
>>
>> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
>>
>>
>> ----- Mail original -----
>>
>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: ceph-devel@vger.kernel.org
>> Envoyé: Jeudi 30 Août 2012 17:33:42
>> Objet: Re: RBD performance - tuning hints
>>
>> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
>>> Thanks for the report !
>>>
>>> vs your first benchmark, it's with RBD 4M or 64K ?
>> with 4MB (see attached config info)
>>
>> Cheers,
>> -Dieter
>>
>>>
>>> (how much ssd by node?)
>> 8x SSD, 200GB each
>>
>>>
>>>
>>>
>>> ----- Mail original -----
>>>
>>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Envoyé: Jeudi 30 Août 2012 16:56:34
>>> Objet: Re: RBD performance - tuning hints
>>>
>>> Hi Alexandre,
>>>
>>> with the 4 filestore parameter below some fio values could be increased:
>>> filestore max sync interval = 30
>>> filestore min sync interval = 29
>>> filestore flusher = false
>>> filestore queue max ops = 10000
>>>
>>> ###### IOPS
>>> fio_read_4k_64: 9373
>>> fio_read_4k_128: 9939
>>> fio_randwrite_8k_16: 12376
>>> fio_randwrite_4k_16: 13315
>>> fio_randwrite_512_32: 13660
>>> fio_randwrite_8k_32: 17318
>>> fio_randwrite_4k_32: 18057
>>> fio_randwrite_8k_64: 19693
>>> fio_randwrite_512_64: 20015 <<<
>>> fio_randwrite_4k_64: 20024 <<<
>>> fio_randwrite_8k_128: 20547 <<<
>>> fio_randwrite_4k_128: 20839 <<<
>>> fio_randwrite_512_128: 21417 <<<
>>> fio_randread_8k_128: 48872
>>> fio_randread_4k_128: 50002
>>> fio_randread_512_128: 51202
>>>
>>> ###### MB/s
>>> fio_randread_2m_32: 628
>>> fio_read_4m_64: 630
>>> fio_randread_8m_32: 633
>>> fio_read_2m_32: 637
>>> fio_read_4m_16: 640
>>> fio_randread_4m_16: 652
>>> fio_write_2m_32: 660
>>> fio_randread_4m_32: 677
>>> fio_read_4m_32: 678
>>> (...)
>>> fio_write_4m_64: 771
>>> fio_randwrite_2m_64: 789
>>> fio_write_8m_128: 796
>>> fio_write_4m_32: 802
>>> fio_randwrite_4m_128: 807 <<<
>>> fio_randwrite_2m_32: 811 <<<
>>> fio_write_2m_128: 833 <<<
>>> fio_write_8m_64: 901 <<<
>>>
>>> Best Regards,
>>> -Dieter
>>>
>>>
>>> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
>>>> Nice results !
>>>> (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
>>>> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>>>>
>>>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>>>> I think you can try to tune these values
>>>>
>>>> filestore max sync interval = 30
>>>> filestore min sync interval = 29
>>>> filestore flusher = false
>>>> filestore queue max ops = 10000
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>>>> À: ceph-devel@vger.kernel.org
>>>> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
>>>> Envoyé: Mardi 28 Août 2012 19:48:42
>>>> Objet: RBD performance - tuning hints
>>>>
>>>> Hi,
>>>>
>>>> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
>>>> I can observe a pretty nice rados bench performance
>>>> (see bench-rados.txt for details):
>>>>
>>>> Bandwidth (MB/sec): 961.710
>>>> Max bandwidth (MB/sec): 1040
>>>> Min bandwidth (MB/sec): 772
>>>>
>>>>
>>>> Also the bandwidth performance generated with
>>>> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>>>>
>>>> .... is acceptable, e.g.
>>>> fio_write_4m_16 795 MB/s
>>>> fio_randwrite_8m_128 717 MB/s
>>>> fio_randwrite_8m_16 714 MB/s
>>>> fio_randwrite_2m_32 692 MB/s
>>>>
>>>>
>>>> But, the write IOPS seems to be limited around 19k ...
>>>> RBD 4M 64k (= optimal_io_size)
>>>> fio_randread_512_128 53286 55925
>>>> fio_randread_4k_128 51110 44382
>>>> fio_randread_8k_128 30854 29938
>>>> fio_randwrite_512_128 18888 2386
>>>> fio_randwrite_512_64 18844 2582
>>>> fio_randwrite_8k_64 17350 2445
>>>> (...)
>>>> fio_read_4k_128 10073 53151
>>>> fio_read_4k_64 9500 39757
>>>> fio_read_4k_32 9220 23650
>>>> (...)
>>>> fio_read_4k_16 9122 14322
>>>> fio_write_4k_128 2190 14306
>>>> fio_read_8k_32 706 13894
>>>> fio_write_4k_64 2197 12297
>>>> fio_write_8k_64 3563 11705
>>>> fio_write_8k_128 3444 11219
>>>>
>>>>
>>>> Any hints for tuning the IOPS (read and/or write) would be appreciated.
>>>>
>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>>>>
>>>>
>>>> Kind Regards,
>>>> -Dieter
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Alexandre D e rumier
>>>>
>>>> Ingénieur Systèmes et Réseaux
>>>>
>>>>
>>>> Fixe : 03 20 68 88 85
>>>>
>>>> Fax : 03 20 68 90 88
>>>>
>>>>
>>>> 45 Bvd du Général Leclerc 59100 Roubaix
>>>> 12 rue Marivaux 75002 Paris
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>> Alexandre D e rumier
>>>
>>> Ingénieur Systèmes et Réseaux
>>>
>>>
>>> Fixe : 03 20 68 88 85
>>>
>>> Fax : 03 20 68 90 88
>>>
>>>
>>> 45 Bvd du Général Leclerc 59100 Roubaix
>>> 12 rue Marivaux 75002 Paris
>>>
>>
>>
>>
>> --
>>
>> --
>>
>>
>>
>>
>>
>> Alexandre D e rumier
>>
>> Ingénieur Systèmes et Réseaux
>>
>>
>> Fixe : 03 20 68 88 85
>>
>> Fax : 03 20 68 90 88
>>
>>
>> 45 Bvd du Général Leclerc 59100 Roubaix
>> 12 rue Marivaux 75002 Paris
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 16:12 ` Alexandre DERUMIER
2012-08-30 16:16 ` Josh Durgin
@ 2012-08-30 16:48 ` Dieter Kasper
2012-08-30 18:10 ` Gregory Farnum
1 sibling, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-30 16:48 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel@vger.kernel.org, Andreas Bluemle
[-- Attachment #1: Type: text/plain, Size: 10043 bytes --]
On Thu, Aug 30, 2012 at 06:12:11PM +0200, Alexandre DERUMIER wrote:
> >>well, you have to compare
> >>- pure a SSD (via PCIe or SAS-6G) vs.
> >>- Ceph-Journal, which goes 2x over 10GbE with IP
> >> Client -> primary-copy -> 2nd-copy
> >> (= redundancy over Ethernet distance)
>
> Sure but the first osd ack to the client,before replicating to the others osd.
no
>
> Client -> primary-copy -> 2nd-copy
> <-ack
> primary-copy -> 2nd-copy
> -> 3st-copy
>
> Or I'm wrong ?
yes,
please have a look at the attached file: ceph-replication-acks.png
The client usually will continue on 'ACK' and not wait for the 'commit'.
BTW. all my journals are in RAM (/dev/ramX)
32x 2GB = 32GB of data with replica 2x
If "filestore min/max sync interval" is set to 99999999
data should 'never' be written to OSD
('never' at least during the tests if the written data is < 32GB)
In such a configuration only the Ceph-Code and the Interconnect (10GbE/IP) would be the brakeman.
Cheers,
-Dieter
>
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
> Envoyé: Jeudi 30 Août 2012 18:02:05
> Objet: Re: RBD performance - tuning hints
>
> On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
> > Thanks
> >
> > >> 8x SSD, 200GB each
> >
> > 20000 iops seem pretty low,no ?
> well, you have to compare
> - pure a SSD (via PCIe or SAS-6G) vs.
> - Ceph-Journal, which goes 2x over 10GbE with IP
> Client -> primary-copy -> 2nd-copy
> (= redundancy over Ethernet distance)
>
> I'm curious about the answer from Inktank,
>
> -Dieter
>
> >
> >
> > for @intank:
> >
> > Is their a bottleneck somewhere in ceph ?
> Maybe "SimpleMessenger dispatching: cause of performance problems?"
> from Thu, 16 Aug 2012 18:08:39 +0200
> by <andreas.bluemle@itxperts.de>
> can be an answer.
> Especially if a small number of OSDs is used.
>
> >
> > I said that, because I would like to know if it's scale by adding new nodes.
> >
> > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
> >
> >
> > ----- Mail original -----
> >
> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
> > Cc: ceph-devel@vger.kernel.org
> > Envoyé: Jeudi 30 Août 2012 17:33:42
> > Objet: Re: RBD performance - tuning hints
> >
> > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
> > > Thanks for the report !
> > >
> > > vs your first benchmark, it's with RBD 4M or 64K ?
> > with 4MB (see attached config info)
> >
> > Cheers,
> > -Dieter
> >
> > >
> > > (how much ssd by node?)
> > 8x SSD, 200GB each
> >
> > >
> > >
> > >
> > > ----- Mail original -----
> > >
> > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > > À: "Alexandre DERUMIER" <aderumier@odiso.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Envoyé: Jeudi 30 Août 2012 16:56:34
> > > Objet: Re: RBD performance - tuning hints
> > >
> > > Hi Alexandre,
> > >
> > > with the 4 filestore parameter below some fio values could be increased:
> > > filestore max sync interval = 30
> > > filestore min sync interval = 29
> > > filestore flusher = false
> > > filestore queue max ops = 10000
> > >
> > > ###### IOPS
> > > fio_read_4k_64: 9373
> > > fio_read_4k_128: 9939
> > > fio_randwrite_8k_16: 12376
> > > fio_randwrite_4k_16: 13315
> > > fio_randwrite_512_32: 13660
> > > fio_randwrite_8k_32: 17318
> > > fio_randwrite_4k_32: 18057
> > > fio_randwrite_8k_64: 19693
> > > fio_randwrite_512_64: 20015 <<<
> > > fio_randwrite_4k_64: 20024 <<<
> > > fio_randwrite_8k_128: 20547 <<<
> > > fio_randwrite_4k_128: 20839 <<<
> > > fio_randwrite_512_128: 21417 <<<
> > > fio_randread_8k_128: 48872
> > > fio_randread_4k_128: 50002
> > > fio_randread_512_128: 51202
> > >
> > > ###### MB/s
> > > fio_randread_2m_32: 628
> > > fio_read_4m_64: 630
> > > fio_randread_8m_32: 633
> > > fio_read_2m_32: 637
> > > fio_read_4m_16: 640
> > > fio_randread_4m_16: 652
> > > fio_write_2m_32: 660
> > > fio_randread_4m_32: 677
> > > fio_read_4m_32: 678
> > > (...)
> > > fio_write_4m_64: 771
> > > fio_randwrite_2m_64: 789
> > > fio_write_8m_128: 796
> > > fio_write_4m_32: 802
> > > fio_randwrite_4m_128: 807 <<<
> > > fio_randwrite_2m_32: 811 <<<
> > > fio_write_2m_128: 833 <<<
> > > fio_write_8m_64: 901 <<<
> > >
> > > Best Regards,
> > > -Dieter
> > >
> > >
> > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
> > > > Nice results !
> > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
> > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
> > > >
> > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > > > I think you can try to tune these values
> > > >
> > > > filestore max sync interval = 30
> > > > filestore min sync interval = 29
> > > > filestore flusher = false
> > > > filestore queue max ops = 10000
> > > >
> > > >
> > > >
> > > > ----- Mail original -----
> > > >
> > > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
> > > > À: ceph-devel@vger.kernel.org
> > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
> > > > Envoyé: Mardi 28 Août 2012 19:48:42
> > > > Objet: RBD performance - tuning hints
> > > >
> > > > Hi,
> > > >
> > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
> > > > I can observe a pretty nice rados bench performance
> > > > (see bench-rados.txt for details):
> > > >
> > > > Bandwidth (MB/sec): 961.710
> > > > Max bandwidth (MB/sec): 1040
> > > > Min bandwidth (MB/sec): 772
> > > >
> > > >
> > > > Also the bandwidth performance generated with
> > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
> > > >
> > > > .... is acceptable, e.g.
> > > > fio_write_4m_16 795 MB/s
> > > > fio_randwrite_8m_128 717 MB/s
> > > > fio_randwrite_8m_16 714 MB/s
> > > > fio_randwrite_2m_32 692 MB/s
> > > >
> > > >
> > > > But, the write IOPS seems to be limited around 19k ...
> > > > RBD 4M 64k (= optimal_io_size)
> > > > fio_randread_512_128 53286 55925
> > > > fio_randread_4k_128 51110 44382
> > > > fio_randread_8k_128 30854 29938
> > > > fio_randwrite_512_128 18888 2386
> > > > fio_randwrite_512_64 18844 2582
> > > > fio_randwrite_8k_64 17350 2445
> > > > (...)
> > > > fio_read_4k_128 10073 53151
> > > > fio_read_4k_64 9500 39757
> > > > fio_read_4k_32 9220 23650
> > > > (...)
> > > > fio_read_4k_16 9122 14322
> > > > fio_write_4k_128 2190 14306
> > > > fio_read_8k_32 706 13894
> > > > fio_write_4k_64 2197 12297
> > > > fio_write_8k_64 3563 11705
> > > > fio_write_8k_128 3444 11219
> > > >
> > > >
> > > > Any hints for tuning the IOPS (read and/or write) would be appreciated.
> > > >
> > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
> > > >
> > > >
> > > > Kind Regards,
> > > > -Dieter
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > --
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Alexandre D e rumier
> > > >
> > > > Ingénieur Systèmes et Réseaux
> > > >
> > > >
> > > > Fixe : 03 20 68 88 85
> > > >
> > > > Fax : 03 20 68 90 88
> > > >
> > > >
> > > > 45 Bvd du Général Leclerc 59100 Roubaix
> > > > 12 rue Marivaux 75002 Paris
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > >
> > >
> > > --
> > >
> > > --
> > >
> > >
> > >
> > >
> > >
> > > Alexandre D e rumier
> > >
> > > Ingénieur Systèmes et Réseaux
> > >
> > >
> > > Fixe : 03 20 68 88 85
> > >
> > > Fax : 03 20 68 90 88
> > >
> > >
> > > 45 Bvd du Général Leclerc 59100 Roubaix
> > > 12 rue Marivaux 75002 Paris
> > >
> >
> >
> >
> > --
> >
> > --
> >
> >
> >
> >
> >
> > Alexandre D e rumier
> >
> > Ingénieur Systèmes et Réseaux
> >
> >
> > Fixe : 03 20 68 88 85
> >
> > Fax : 03 20 68 90 88
> >
> >
> > 45 Bvd du Général Leclerc 59100 Roubaix
> > 12 rue Marivaux 75002 Paris
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Principal Consultant, Data Center Storage Architecture and Technology
FTS CTO
FUJITSU TECHNOLOGY SOLUTIONS GMBH
Mies-van-der-Rohe-Straße 8 / 4F
80807 München
Germany
Telephone: +49 89 62060 1898
Telefax: +49 89 62060 329 1898
Mobile: +49 170 8563173
Email: dieter.kasper@ts.fujitsu.com
Internet: http://ts.fujitsu.com
Company Details: http://ts.fujitsu.com/imprint.html
[-- Attachment #2: ceph-replication-acks.png --]
[-- Type: image/png, Size: 18144 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 16:48 ` Dieter Kasper
@ 2012-08-30 18:10 ` Gregory Farnum
0 siblings, 0 replies; 31+ messages in thread
From: Gregory Farnum @ 2012-08-30 18:10 UTC (permalink / raw)
To: Dieter Kasper
Cc: Alexandre DERUMIER, ceph-devel@vger.kernel.org, Andreas Bluemle,
Samuel Just
On Thu, Aug 30, 2012 at 9:48 AM, Dieter Kasper <d.kasper@kabelmail.de> wrote:
> On Thu, Aug 30, 2012 at 06:12:11PM +0200, Alexandre DERUMIER wrote:
>> >>well, you have to compare
>> >>- pure a SSD (via PCIe or SAS-6G) vs.
>> >>- Ceph-Journal, which goes 2x over 10GbE with IP
>> >> Client -> primary-copy -> 2nd-copy
>> >> (= redundancy over Ethernet distance)
>>
>> Sure but the first osd ack to the client,before replicating to the others osd.
> no
>
>>
>> Client -> primary-copy -> 2nd-copy
>> <-ack
>> primary-copy -> 2nd-copy
>> -> 3st-copy
>>
>> Or I'm wrong ?
> yes,
> please have a look at the attached file: ceph-replication-acks.png
> The client usually will continue on 'ACK' and not wait for the 'commit'.
>
> BTW. all my journals are in RAM (/dev/ramX)
> 32x 2GB = 32GB of data with replica 2x
>
> If "filestore min/max sync interval" is set to 99999999
> data should 'never' be written to OSD
> ('never' at least during the tests if the written data is < 32GB)
I believe it actually will start syncing to disk when the journal is
half full (right, Sam?) — and even if it doesn't sync, there's a
reasonable chance that some of the data will be written out to disk in
the background (though that shouldn't slow anything down, of course).
:)
-Greg
>
> In such a configuration only the Ceph-Code and the Interconnect (10GbE/IP) would be the brakeman.
>
> Cheers,
> -Dieter
>
>
>>
>>
>> ----- Mail original -----
>>
>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
>> Envoyé: Jeudi 30 Août 2012 18:02:05
>> Objet: Re: RBD performance - tuning hints
>>
>> On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
>> > Thanks
>> >
>> > >> 8x SSD, 200GB each
>> >
>> > 20000 iops seem pretty low,no ?
>> well, you have to compare
>> - pure a SSD (via PCIe or SAS-6G) vs.
>> - Ceph-Journal, which goes 2x over 10GbE with IP
>> Client -> primary-copy -> 2nd-copy
>> (= redundancy over Ethernet distance)
>>
>> I'm curious about the answer from Inktank,
>>
>> -Dieter
>>
>> >
>> >
>> > for @intank:
>> >
>> > Is their a bottleneck somewhere in ceph ?
>> Maybe "SimpleMessenger dispatching: cause of performance problems?"
>> from Thu, 16 Aug 2012 18:08:39 +0200
>> by <andreas.bluemle@itxperts.de>
>> can be an answer.
>> Especially if a small number of OSDs is used.
>>
>> >
>> > I said that, because I would like to know if it's scale by adding new nodes.
>> >
>> > Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
>> >
>> >
>> > ----- Mail original -----
>> >
>> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> > Cc: ceph-devel@vger.kernel.org
>> > Envoyé: Jeudi 30 Août 2012 17:33:42
>> > Objet: Re: RBD performance - tuning hints
>> >
>> > On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
>> > > Thanks for the report !
>> > >
>> > > vs your first benchmark, it's with RBD 4M or 64K ?
>> > with 4MB (see attached config info)
>> >
>> > Cheers,
>> > -Dieter
>> >
>> > >
>> > > (how much ssd by node?)
>> > 8x SSD, 200GB each
>> >
>> > >
>> > >
>> > >
>> > > ----- Mail original -----
>> > >
>> > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> > > À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> > > Cc: ceph-devel@vger.kernel.org
>> > > Envoyé: Jeudi 30 Août 2012 16:56:34
>> > > Objet: Re: RBD performance - tuning hints
>> > >
>> > > Hi Alexandre,
>> > >
>> > > with the 4 filestore parameter below some fio values could be increased:
>> > > filestore max sync interval = 30
>> > > filestore min sync interval = 29
>> > > filestore flusher = false
>> > > filestore queue max ops = 10000
>> > >
>> > > ###### IOPS
>> > > fio_read_4k_64: 9373
>> > > fio_read_4k_128: 9939
>> > > fio_randwrite_8k_16: 12376
>> > > fio_randwrite_4k_16: 13315
>> > > fio_randwrite_512_32: 13660
>> > > fio_randwrite_8k_32: 17318
>> > > fio_randwrite_4k_32: 18057
>> > > fio_randwrite_8k_64: 19693
>> > > fio_randwrite_512_64: 20015 <<<
>> > > fio_randwrite_4k_64: 20024 <<<
>> > > fio_randwrite_8k_128: 20547 <<<
>> > > fio_randwrite_4k_128: 20839 <<<
>> > > fio_randwrite_512_128: 21417 <<<
>> > > fio_randread_8k_128: 48872
>> > > fio_randread_4k_128: 50002
>> > > fio_randread_512_128: 51202
>> > >
>> > > ###### MB/s
>> > > fio_randread_2m_32: 628
>> > > fio_read_4m_64: 630
>> > > fio_randread_8m_32: 633
>> > > fio_read_2m_32: 637
>> > > fio_read_4m_16: 640
>> > > fio_randread_4m_16: 652
>> > > fio_write_2m_32: 660
>> > > fio_randread_4m_32: 677
>> > > fio_read_4m_32: 678
>> > > (...)
>> > > fio_write_4m_64: 771
>> > > fio_randwrite_2m_64: 789
>> > > fio_write_8m_128: 796
>> > > fio_write_4m_32: 802
>> > > fio_randwrite_4m_128: 807 <<<
>> > > fio_randwrite_2m_32: 811 <<<
>> > > fio_write_2m_128: 833 <<<
>> > > fio_write_8m_64: 901 <<<
>> > >
>> > > Best Regards,
>> > > -Dieter
>> > >
>> > >
>> > > On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
>> > > > Nice results !
>> > > > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
>> > > > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>> > > >
>> > > > >>How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> > > > I think you can try to tune these values
>> > > >
>> > > > filestore max sync interval = 30
>> > > > filestore min sync interval = 29
>> > > > filestore flusher = false
>> > > > filestore queue max ops = 10000
>> > > >
>> > > >
>> > > >
>> > > > ----- Mail original -----
>> > > >
>> > > > De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> > > > À: ceph-devel@vger.kernel.org
>> > > > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
>> > > > Envoyé: Mardi 28 Août 2012 19:48:42
>> > > > Objet: RBD performance - tuning hints
>> > > >
>> > > > Hi,
>> > > >
>> > > > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
>> > > > I can observe a pretty nice rados bench performance
>> > > > (see bench-rados.txt for details):
>> > > >
>> > > > Bandwidth (MB/sec): 961.710
>> > > > Max bandwidth (MB/sec): 1040
>> > > > Min bandwidth (MB/sec): 772
>> > > >
>> > > >
>> > > > Also the bandwidth performance generated with
>> > > > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>> > > >
>> > > > .... is acceptable, e.g.
>> > > > fio_write_4m_16 795 MB/s
>> > > > fio_randwrite_8m_128 717 MB/s
>> > > > fio_randwrite_8m_16 714 MB/s
>> > > > fio_randwrite_2m_32 692 MB/s
>> > > >
>> > > >
>> > > > But, the write IOPS seems to be limited around 19k ...
>> > > > RBD 4M 64k (= optimal_io_size)
>> > > > fio_randread_512_128 53286 55925
>> > > > fio_randread_4k_128 51110 44382
>> > > > fio_randread_8k_128 30854 29938
>> > > > fio_randwrite_512_128 18888 2386
>> > > > fio_randwrite_512_64 18844 2582
>> > > > fio_randwrite_8k_64 17350 2445
>> > > > (...)
>> > > > fio_read_4k_128 10073 53151
>> > > > fio_read_4k_64 9500 39757
>> > > > fio_read_4k_32 9220 23650
>> > > > (...)
>> > > > fio_read_4k_16 9122 14322
>> > > > fio_write_4k_128 2190 14306
>> > > > fio_read_8k_32 706 13894
>> > > > fio_write_4k_64 2197 12297
>> > > > fio_write_8k_64 3563 11705
>> > > > fio_write_8k_128 3444 11219
>> > > >
>> > > >
>> > > > Any hints for tuning the IOPS (read and/or write) would be appreciated.
>> > > >
>> > > > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> > > >
>> > > >
>> > > > Kind Regards,
>> > > > -Dieter
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > --
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Alexandre D e rumier
>> > > >
>> > > > Ingénieur Systèmes et Réseaux
>> > > >
>> > > >
>> > > > Fixe : 03 20 68 88 85
>> > > >
>> > > > Fax : 03 20 68 90 88
>> > > >
>> > > >
>> > > > 45 Bvd du Général Leclerc 59100 Roubaix
>> > > > 12 rue Marivaux 75002 Paris
>> > > > --
>> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > > the body of a message to majordomo@vger.kernel.org
>> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > --
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Alexandre D e rumier
>> > >
>> > > Ingénieur Systèmes et Réseaux
>> > >
>> > >
>> > > Fixe : 03 20 68 88 85
>> > >
>> > > Fax : 03 20 68 90 88
>> > >
>> > >
>> > > 45 Bvd du Général Leclerc 59100 Roubaix
>> > > 12 rue Marivaux 75002 Paris
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > --
>> >
>> >
>> >
>> >
>> >
>> > Alexandre D e rumier
>> >
>> > Ingénieur Systèmes et Réseaux
>> >
>> >
>> > Fixe : 03 20 68 88 85
>> >
>> > Fax : 03 20 68 90 88
>> >
>> >
>> > 45 Bvd du Général Leclerc 59100 Roubaix
>> > 12 rue Marivaux 75002 Paris
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>>
>> --
>>
>>
>>
>>
>>
>> Alexandre D e rumier
>>
>> Ingénieur Systèmes et Réseaux
>>
>>
>> Fixe : 03 20 68 88 85
>>
>> Fax : 03 20 68 90 88
>>
>>
>> 45 Bvd du Général Leclerc 59100 Roubaix
>> 12 rue Marivaux 75002 Paris
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Principal Consultant, Data Center Storage Architecture and Technology
> FTS CTO
> FUJITSU TECHNOLOGY SOLUTIONS GMBH
> Mies-van-der-Rohe-Straße 8 / 4F
> 80807 München
> Germany
>
> Telephone: +49 89 62060 1898
> Telefax: +49 89 62060 329 1898
> Mobile: +49 170 8563173
> Email: dieter.kasper@ts.fujitsu.com
> Internet: http://ts.fujitsu.com
> Company Details: http://ts.fujitsu.com/imprint.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / parameter doc
2012-08-30 15:08 ` Dieter Kasper
@ 2012-08-30 20:39 ` Samuel Just
0 siblings, 0 replies; 31+ messages in thread
From: Samuel Just @ 2012-08-30 20:39 UTC (permalink / raw)
To: Dieter Kasper; +Cc: Josh Durgin, Alexandre DERUMIER, ceph-devel@vger.kernel.org
Ah, those are just min and max. Sync is also triggered when the
journal hits the half-full mark. We could make the percentage
configurable in the future.
-Sam
On Thu, Aug 30, 2012 at 8:08 AM, Dieter Kasper <d.kasper@kabelmail.de> wrote:
> Samuel,
>
> thank you very much for this explicitely description!
>
> As far as I understand the journal acts as a ringbuffer in front of the OSD.
> Using time as a parameter to trigger sync might not be the best for
> a dynamic Storage subsystem. On a high workload e.g. 10/20 for min/max
> might be optimal for for 4 nodes with 10 OSDs each,
> but not after adding 4 additional nodes.
>
> Are there parameters to trigger the syncs to OSD
> in relation to the fill grade of the journal ?
> e.g.
> filestore [min|max] sync percent:
>
> Do not sync before min-% full; sync after max-% full
>
> What would happen if I set "filestore [min|max] sync interval" to 999999 ?
> Will the journal sync start at 100% full or at X% ?
> What is 'X' by defaut ?
> How can I set 'X' ?
>
> Best Regards,
> -Dieter
>
>
> On Thu, Aug 30, 2012 at 12:34:43AM +0200, Samuel Just wrote:
>> filestore [min|max] sync interval:
>>
>> Periodically, the filestore needs to quiesce writes and do a syncfs in
>> order to create
>> a consistent commit point up to which it can free journal entries. Syncing more
>> frequently tends to reduce the time required to do the sync, and
>> reduces the amount
>> of data that needs to remain in the journal. Less frequent syncs
>> would allow the
>> backing filesystem to better coalesce small writes and metadata
>> updates hopefully
>> resulting in more efficient syncs. 'filestore max sync interval'
>> defines the maximum
>> time period between syncs, 'filestore min sync interval' defines the
>> minimum time
>> period between syncs.
>>
>> filestore flusher:
>>
>> The filestore flusher forces data from large writes to be written out
>> using sync_file_range
>> before the sync in order to (hopefully) reduce the cost of the
>> eventual sync. In practice,
>> disabling 'filestore flusher' seems to improve performance in some cases.
>>
>> filestore queue max ops:
>>
>> 'filestore queue max ops' defines the number of in progress ops the
>> filestore will accept
>> before blocking on queueing new ones. This mostly shouldn't have much
>> of an effect
>> on performance and should probably be ignored.
>>
>> filestore op threads:
>>
>> 'filestore op threads' defines the number of threads used to submit
>> filesystem operations
>> in parallel.
>>
>> journal dio:
>>
>> 'journal dio' enables using O_DIRECT for writing to the journal. This
>> should usually
>> be enabled. If possible, 'journal aio' should also be enabled to
>> allow use of libaio
>> to do asynchronous writes.
>>
>> osd op threads:
>>
>> 'osd op threads' defines the size of the thread pool used to service
>> OSD operations
>> such as client requests. Increasing this may increase the rate of
>> request processing.
>>
>> osd disk threads:
>>
>> 'osd disk threads' defines the number of threads used to perform background disk
>> intensive osd operations such as scrubbing and snap trimming.
>>
>> On Wed, Aug 29, 2012 at 12:29 PM, Dieter Kasper <d.kasper@kabelmail.de> wrote:
>> > Hi Josh,
>> >
>> > thanks for the hint.
>> > Can you please spend a view words about the meaing of these parameters ?
>> > - filestore min/max sync interval = int/float ? seconds ? of what ?
>> > - filestore flusher = false
>> > - filestore queue max ops = 10000
>> > what is 'one op' ? queue in front of what ?
>> > - filestore op threads =
>> > what are useful values here ?
>> >
>> > - journal dio = true/false
>> > - osd op threads =
>> > - osd disk threads =
>> >
>> >
>> > Kind Regards,
>> > -Dieter
>> >
>> >
>> > On Wed, Aug 29, 2012 at 07:37:36PM +0200, Josh Durgin wrote:
>> >> On 08/29/2012 01:50 AM, Alexandre DERUMIER wrote:
>> >> > Nice results !
>> >> > (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
>> >> > I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>> >> >
>> >> >>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> >> > I think you can try to tune these values
>> >> >
>> >> > filestore max sync interval = 30
>> >> > filestore min sync interval = 29
>> >> > filestore flusher = false
>> >> > filestore queue max ops = 10000
>> >>
>> >> Increasing filestore_op_threads might help as well.
>> >>
>> >> > ----- Mail original -----
>> >> >
>> >> > De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> >> > À: ceph-devel@vger.kernel.org
>> >> > Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
>> >> > Envoyé: Mardi 28 Août 2012 19:48:42
>> >> > Objet: RBD performance - tuning hints
>> >> >
>> >> > Hi,
>> >> >
>> >> > on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
>> >> > I can observe a pretty nice rados bench performance
>> >> > (see bench-rados.txt for details):
>> >> >
>> >> > Bandwidth (MB/sec): 961.710
>> >> > Max bandwidth (MB/sec): 1040
>> >> > Min bandwidth (MB/sec): 772
>> >> >
>> >> >
>> >> > Also the bandwidth performance generated with
>> >> > fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>> >> >
>> >> > .... is acceptable, e.g.
>> >> > fio_write_4m_16 795 MB/s
>> >> > fio_randwrite_8m_128 717 MB/s
>> >> > fio_randwrite_8m_16 714 MB/s
>> >> > fio_randwrite_2m_32 692 MB/s
>> >> >
>> >> >
>> >> > But, the write IOPS seems to be limited around 19k ...
>> >> > RBD 4M 64k (= optimal_io_size)
>> >> > fio_randread_512_128 53286 55925
>> >> > fio_randread_4k_128 51110 44382
>> >> > fio_randread_8k_128 30854 29938
>> >> > fio_randwrite_512_128 18888 2386
>> >> > fio_randwrite_512_64 18844 2582
>> >> > fio_randwrite_8k_64 17350 2445
>> >> > (...)
>> >> > fio_read_4k_128 10073 53151
>> >> > fio_read_4k_64 9500 39757
>> >> > fio_read_4k_32 9220 23650
>> >> > (...)
>> >> > fio_read_4k_16 9122 14322
>> >> > fio_write_4k_128 2190 14306
>> >> > fio_read_8k_32 706 13894
>> >> > fio_write_4k_64 2197 12297
>> >> > fio_write_8k_64 3563 11705
>> >> > fio_write_8k_128 3444 11219
>> >> >
>> >> >
>> >> > Any hints for tuning the IOPS (read and/or write) would be appreciated.
>> >> >
>> >> > How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>> >> >
>> >> >
>> >> > Kind Regards,
>> >> > -Dieter
>> >> >
>> >> >
>> >> >
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-30 16:16 ` Josh Durgin
@ 2012-08-31 7:46 ` Alexandre DERUMIER
2012-08-31 8:11 ` Dietmar Maurer
0 siblings, 1 reply; 31+ messages in thread
From: Alexandre DERUMIER @ 2012-08-31 7:46 UTC (permalink / raw)
To: Josh Durgin; +Cc: Dieter Kasper, ceph-devel, Andreas Bluemle
>>RBD waits for the data to be on disk on all replicas. It's pretty easy
>>to relax this to in memory on all replicas, but there's no option for
>>that right now.
Ok, thanks, I miss that.
When you say disk, you mean journal ?
----- Mail original -----
De: "Josh Durgin" <josh.durgin@inktank.com>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: "Dieter Kasper" <d.kasper@kabelmail.de>, ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
Envoyé: Jeudi 30 Août 2012 18:16:47
Objet: Re: RBD performance - tuning hints
On 08/30/2012 09:12 AM, Alexandre DERUMIER wrote:
>>> well, you have to compare
>>> - pure a SSD (via PCIe or SAS-6G) vs.
>>> - Ceph-Journal, which goes 2x over 10GbE with IP
>>> Client -> primary-copy -> 2nd-copy
>>> (= redundancy over Ethernet distance)
>
> Sure but the first osd ack to the client,before replicating to the others osd.
>
> Client -> primary-copy -> 2nd-copy
> <-ack
> primary-copy -> 2nd-copy
> -> 3st-copy
>
> Or I'm wrong ?
RBD waits for the data to be on disk on all replicas. It's pretty easy
to relax this to in memory on all replicas, but there's no option for
that right now.
Josh
>
> ----- Mail original -----
>
> De: "Dieter Kasper" <d.kasper@kabelmail.de>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org, "Andreas Bluemle" <andreas.bluemle@itxperts.de>
> Envoyé: Jeudi 30 Août 2012 18:02:05
> Objet: Re: RBD performance - tuning hints
>
> On Thu, Aug 30, 2012 at 05:46:35PM +0200, Alexandre DERUMIER wrote:
>> Thanks
>>
>>>> 8x SSD, 200GB each
>>
>> 20000 iops seem pretty low,no ?
> well, you have to compare
> - pure a SSD (via PCIe or SAS-6G) vs.
> - Ceph-Journal, which goes 2x over 10GbE with IP
> Client -> primary-copy -> 2nd-copy
> (= redundancy over Ethernet distance)
>
> I'm curious about the answer from Inktank,
>
> -Dieter
>
>>
>>
>> for @intank:
>>
>> Is their a bottleneck somewhere in ceph ?
> Maybe "SimpleMessenger dispatching: cause of performance problems?"
> from Thu, 16 Aug 2012 18:08:39 +0200
> by <andreas.bluemle@itxperts.de>
> can be an answer.
> Especially if a small number of OSDs is used.
>
>>
>> I said that, because I would like to know if it's scale by adding new nodes.
>>
>> Does Intank have already done some random iops benchmark ? (I always see sequential throughput bench in the mailing list)
>>
>>
>> ----- Mail original -----
>>
>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: ceph-devel@vger.kernel.org
>> Envoyé: Jeudi 30 Août 2012 17:33:42
>> Objet: Re: RBD performance - tuning hints
>>
>> On Thu, Aug 30, 2012 at 05:28:02PM +0200, Alexandre DERUMIER wrote:
>>> Thanks for the report !
>>>
>>> vs your first benchmark, it's with RBD 4M or 64K ?
>> with 4MB (see attached config info)
>>
>> Cheers,
>> -Dieter
>>
>>>
>>> (how much ssd by node?)
>> 8x SSD, 200GB each
>>
>>>
>>>
>>>
>>> ----- Mail original -----
>>>
>>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Envoyé: Jeudi 30 Août 2012 16:56:34
>>> Objet: Re: RBD performance - tuning hints
>>>
>>> Hi Alexandre,
>>>
>>> with the 4 filestore parameter below some fio values could be increased:
>>> filestore max sync interval = 30
>>> filestore min sync interval = 29
>>> filestore flusher = false
>>> filestore queue max ops = 10000
>>>
>>> ###### IOPS
>>> fio_read_4k_64: 9373
>>> fio_read_4k_128: 9939
>>> fio_randwrite_8k_16: 12376
>>> fio_randwrite_4k_16: 13315
>>> fio_randwrite_512_32: 13660
>>> fio_randwrite_8k_32: 17318
>>> fio_randwrite_4k_32: 18057
>>> fio_randwrite_8k_64: 19693
>>> fio_randwrite_512_64: 20015 <<<
>>> fio_randwrite_4k_64: 20024 <<<
>>> fio_randwrite_8k_128: 20547 <<<
>>> fio_randwrite_4k_128: 20839 <<<
>>> fio_randwrite_512_128: 21417 <<<
>>> fio_randread_8k_128: 48872
>>> fio_randread_4k_128: 50002
>>> fio_randread_512_128: 51202
>>>
>>> ###### MB/s
>>> fio_randread_2m_32: 628
>>> fio_read_4m_64: 630
>>> fio_randread_8m_32: 633
>>> fio_read_2m_32: 637
>>> fio_read_4m_16: 640
>>> fio_randread_4m_16: 652
>>> fio_write_2m_32: 660
>>> fio_randread_4m_32: 677
>>> fio_read_4m_32: 678
>>> (...)
>>> fio_write_4m_64: 771
>>> fio_randwrite_2m_64: 789
>>> fio_write_8m_128: 796
>>> fio_write_4m_32: 802
>>> fio_randwrite_4m_128: 807 <<<
>>> fio_randwrite_2m_32: 811 <<<
>>> fio_write_2m_128: 833 <<<
>>> fio_write_8m_64: 901 <<<
>>>
>>> Best Regards,
>>> -Dieter
>>>
>>>
>>> On Wed, Aug 29, 2012 at 10:50:12AM +0200, Alexandre DERUMIER wrote:
>>>> Nice results !
>>>> (can you make same benchmark from a qemu-kvm guest with virtio-driver ?
>>>> I have made some bench some month ago with stephan priebe, and we never be able to have more than 20000iops, with a full ssd 3nodes cluster)
>>>>
>>>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>>>> I think you can try to tune these values
>>>>
>>>> filestore max sync interval = 30
>>>> filestore min sync interval = 29
>>>> filestore flusher = false
>>>> filestore queue max ops = 10000
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Dieter Kasper" <d.kasper@kabelmail.de>
>>>> À: ceph-devel@vger.kernel.org
>>>> Cc: "Dieter Kasper (KD)" <d.kasper@kabelmail.de>
>>>> Envoyé: Mardi 28 Août 2012 19:48:42
>>>> Objet: RBD performance - tuning hints
>>>>
>>>> Hi,
>>>>
>>>> on my 4-node system (SSD + 10GbE, see bench-config.txt for details)
>>>> I can observe a pretty nice rados bench performance
>>>> (see bench-rados.txt for details):
>>>>
>>>> Bandwidth (MB/sec): 961.710
>>>> Max bandwidth (MB/sec): 1040
>>>> Min bandwidth (MB/sec): 772
>>>>
>>>>
>>>> Also the bandwidth performance generated with
>>>> fio --filename=/dev/rbd1 --direct=1 --rw=$io --bs=$bs --size=2G --iodepth=$threads --ioengine=libaio --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
>>>>
>>>> .... is acceptable, e.g.
>>>> fio_write_4m_16 795 MB/s
>>>> fio_randwrite_8m_128 717 MB/s
>>>> fio_randwrite_8m_16 714 MB/s
>>>> fio_randwrite_2m_32 692 MB/s
>>>>
>>>>
>>>> But, the write IOPS seems to be limited around 19k ...
>>>> RBD 4M 64k (= optimal_io_size)
>>>> fio_randread_512_128 53286 55925
>>>> fio_randread_4k_128 51110 44382
>>>> fio_randread_8k_128 30854 29938
>>>> fio_randwrite_512_128 18888 2386
>>>> fio_randwrite_512_64 18844 2582
>>>> fio_randwrite_8k_64 17350 2445
>>>> (...)
>>>> fio_read_4k_128 10073 53151
>>>> fio_read_4k_64 9500 39757
>>>> fio_read_4k_32 9220 23650
>>>> (...)
>>>> fio_read_4k_16 9122 14322
>>>> fio_write_4k_128 2190 14306
>>>> fio_read_8k_32 706 13894
>>>> fio_write_4k_64 2197 12297
>>>> fio_write_8k_64 3563 11705
>>>> fio_write_8k_128 3444 11219
>>>>
>>>>
>>>> Any hints for tuning the IOPS (read and/or write) would be appreciated.
>>>>
>>>> How can I set the variables when the Journal data have go to the OSD ? (after X seconds and/or when Y %-full)
>>>>
>>>>
>>>> Kind Regards,
>>>> -Dieter
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Alexandre D e rumier
>>>>
>>>> Ingénieur Systèmes et Réseaux
>>>>
>>>>
>>>> Fixe : 03 20 68 88 85
>>>>
>>>> Fax : 03 20 68 90 88
>>>>
>>>>
>>>> 45 Bvd du Général Leclerc 59100 Roubaix
>>>> 12 rue Marivaux 75002 Paris
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>> Alexandre D e rumier
>>>
>>> Ingénieur Systèmes et Réseaux
>>>
>>>
>>> Fixe : 03 20 68 88 85
>>>
>>> Fax : 03 20 68 90 88
>>>
>>>
>>> 45 Bvd du Général Leclerc 59100 Roubaix
>>> 12 rue Marivaux 75002 Paris
>>>
>>
>>
>>
>> --
>>
>> --
>>
>>
>>
>>
>>
>> Alexandre D e rumier
>>
>> Ingénieur Systèmes et Réseaux
>>
>>
>> Fixe : 03 20 68 88 85
>>
>> Fax : 03 20 68 90 88
>>
>>
>> 45 Bvd du Général Leclerc 59100 Roubaix
>> 12 rue Marivaux 75002 Paris
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: RBD performance - tuning hints
2012-08-31 7:46 ` Alexandre DERUMIER
@ 2012-08-31 8:11 ` Dietmar Maurer
2012-08-31 8:48 ` Mark Kirkwood
2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg
0 siblings, 2 replies; 31+ messages in thread
From: Dietmar Maurer @ 2012-08-31 8:11 UTC (permalink / raw)
To: Alexandre DERUMIER, Josh Durgin
Cc: Dieter Kasper, ceph-devel@vger.kernel.org, Andreas Bluemle
>>RBD waits for the data to be on disk on all replicas. It's pretty easy
>>to relax this to in memory on all replicas, but there's no option for
>>that right now.
I thought that is dangerous, because you can loose data?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints
2012-08-31 8:11 ` Dietmar Maurer
@ 2012-08-31 8:48 ` Mark Kirkwood
2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper
2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg
1 sibling, 1 reply; 31+ messages in thread
From: Mark Kirkwood @ 2012-08-31 8:48 UTC (permalink / raw)
To: Dietmar Maurer
Cc: Alexandre DERUMIER, Josh Durgin, Dieter Kasper,
ceph-devel@vger.kernel.org, Andreas Bluemle
On 31/08/12 20:11, Dietmar Maurer wrote:
>>> RBD waits for the data to be on disk on all replicas. It's pretty easy
>>> to relax this to in memory on all replicas, but there's no option for
>>> that right now.
> I thought that is dangerous, because you can loose data?
> N�����r��y���b�X��ǧv�^�){.n�+���z�]z�{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+��ݢj"��!tml=
And it is not immediately obvious that this is the bottleneck - from
what I can see the 'sync' call being used (sync_file_range) is extremely
fast and is *not* the major slowdown effect...
Regards
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / major slowdown effect(s)
2012-08-31 8:48 ` Mark Kirkwood
@ 2012-08-31 9:49 ` Dieter Kasper
2012-08-31 10:16 ` Mark Kirkwood
0 siblings, 1 reply; 31+ messages in thread
From: Dieter Kasper @ 2012-08-31 9:49 UTC (permalink / raw)
To: Mark Kirkwood
Cc: Dietmar Maurer, Alexandre DERUMIER, Josh Durgin,
ceph-devel@vger.kernel.org, Andreas Bluemle
Mark, Inktank,
OK, it is very likely that 'sync_file_range' is not the major slowdown 'culprit'.
But, which areas (design, current implementation, protocol, interconnect, tuning parameter, ...)
would you rate as 'major slowdown effect(s)' ?
Best Regards,
-Dieter
On Fri, Aug 31, 2012 at 08:48:34PM +1200, Mark Kirkwood wrote:
> On 31/08/12 20:11, Dietmar Maurer wrote:
> >>>RBD waits for the data to be on disk on all replicas. It's pretty easy
> >>>to relax this to in memory on all replicas, but there's no option for
> >>>that right now.
> >I thought that is dangerous, because you can loose data?
> >N???????????????r??????y?????????b???X????????v???^???)??{.n???+?????????z???]z???{ay???\x1d???????,j\a??????f?????????h?????????z???\x1e???w?????????\f?????????j:+v?????????w???j???m????????????\a????????????zZ+????????j"??????!tml=
>
> And it is not immediately obvious that this is the bottleneck - from
> what I can see the 'sync' call being used (sync_file_range) is
> extremely fast and is *not* the major slowdown effect...
>
> Regards
>
> Mark
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RBD performance - tuning hints / major slowdown effect(s)
2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper
@ 2012-08-31 10:16 ` Mark Kirkwood
0 siblings, 0 replies; 31+ messages in thread
From: Mark Kirkwood @ 2012-08-31 10:16 UTC (permalink / raw)
To: Dieter Kasper
Cc: Dietmar Maurer, Alexandre DERUMIER, Josh Durgin,
ceph-devel@vger.kernel.org, Andreas Bluemle
Sorry Dieter,
Not trying to say "you are wrong" or anything like that - just trying to
add to the problem solving body of knowledge that from what *I* have
tried out the 'sync' issue does not look to be the bad guy here - altho
more analysis is always welcome (usual story - my findings should be
confirm-able by others doing similar tests)!
regards
Mark
On 31/08/12 21:49, Dieter Kasper wrote:
> Mark, Inktank,
>
> OK, it is very likely that 'sync_file_range' is not the major slowdown 'culprit'.
>
> But, which areas (design, current implementation, protocol, interconnect, tuning parameter, ...)
> would you rate as 'major slowdown effect(s)' ?
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: RBD performance - tuning hints
2012-08-31 8:11 ` Dietmar Maurer
2012-08-31 8:48 ` Mark Kirkwood
@ 2012-08-31 10:58 ` Jerker Nyberg
1 sibling, 0 replies; 31+ messages in thread
From: Jerker Nyberg @ 2012-08-31 10:58 UTC (permalink / raw)
To: ceph-devel@vger.kernel.org
On Fri, 31 Aug 2012, Dietmar Maurer wrote:
>>> RBD waits for the data to be on disk on all replicas. It's pretty easy
>>> to relax this to in memory on all replicas, but there's no option for
>>> that right now.
>
> I thought that is dangerous, because you can loose data?
By putting the journal in a tmpfs then data written to the journal does
not hit disk. If all replicas fail data will be lost.
For some use cases that might be ok. For example incremental backups or
fast scratch space or volatile virtual machines etc.
Also see this previous discussion:
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg06070.html
--jerker
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-08-31 10:58 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-20 10:24 Ceph write performance George Shuklin
[not found] ` <20120720104150.GA16630@oder.kd-bie.de>
2012-07-20 10:48 ` George Shuklin
2012-07-20 11:49 ` Mark Nelson
2012-07-20 20:36 ` Ceph write performance on RAM-DISK Dieter Kasper
2012-07-20 21:28 ` Mark Nelson
2012-07-20 15:53 ` Ceph write performance Matthew Richardson
2012-07-20 16:37 ` Gregory Farnum
2012-08-28 17:48 ` RBD performance - tuning hints Dieter Kasper
2012-08-28 18:53 ` Smart Weblications GmbH - Florian Wiessner
2012-08-28 19:04 ` Dieter Kasper
2012-08-29 8:50 ` Alexandre DERUMIER
2012-08-29 17:37 ` Josh Durgin
2012-08-29 19:29 ` RBD performance - tuning hints / parameter doc Dieter Kasper
2012-08-29 22:34 ` Samuel Just
2012-08-30 15:08 ` Dieter Kasper
2012-08-30 20:39 ` Samuel Just
2012-08-30 14:56 ` RBD performance - tuning hints Dieter Kasper
2012-08-30 15:28 ` Alexandre DERUMIER
2012-08-30 15:33 ` Dieter Kasper
2012-08-30 15:46 ` Alexandre DERUMIER
2012-08-30 16:02 ` Dieter Kasper
2012-08-30 16:12 ` Alexandre DERUMIER
2012-08-30 16:16 ` Josh Durgin
2012-08-31 7:46 ` Alexandre DERUMIER
2012-08-31 8:11 ` Dietmar Maurer
2012-08-31 8:48 ` Mark Kirkwood
2012-08-31 9:49 ` RBD performance - tuning hints / major slowdown effect(s) Dieter Kasper
2012-08-31 10:16 ` Mark Kirkwood
2012-08-31 10:58 ` RBD performance - tuning hints Jerker Nyberg
2012-08-30 16:48 ` Dieter Kasper
2012-08-30 18:10 ` Gregory Farnum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.