poor write performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* poor write performance
@ 2013-04-18 11:46 James Harper
  2013-04-18 12:15 ` Wolfgang Hennerbichler
  2013-04-18 13:43 ` Mark Nelson
  0 siblings, 2 replies; 30+ messages in thread
From: James Harper @ 2013-04-18 11:46 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far...

Thanks

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-18 11:46 poor write performance James Harper
@ 2013-04-18 12:15 ` Wolfgang Hennerbichler
  2013-04-18 23:11   ` James Harper
  2013-04-18 13:43 ` Mark Nelson
  1 sibling, 1 reply; 30+ messages in thread
From: Wolfgang Hennerbichler @ 2013-04-18 12:15 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

Hi James,

This is just pure speculation, but can you assure that the bonding works
correctly? Maybe you have issues there. I have seen a lot of incorrectly
configured bonding throughout my life as unix admin.

Maybe this could help you a little:
http://www.wogri.at/Port-Channeling-802-3ad.338.0.html

On 04/18/2013 01:46 PM, James Harper wrote:
> I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong.
> 
> Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing.
> 
> My setup, approximately, is:
> 
> Two OSD's
> . 1 x 7200RPM SATA disk each
> . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch)
> . 1 x gigabit public network
> . journal on another spindle
> 
> Three MON's
> . 1 each on the OSD's
> . 1 on another server, which is also the one used for testing performance
> 
> I'm using debian packages from ceph which are version 0.56.4
> 
> For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on.
> 
> Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far...
> 
> Thanks
> 
> James
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler@risc-software.at
http://www.risc-software.at

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-18 11:46 poor write performance James Harper
  2013-04-18 12:15 ` Wolfgang Hennerbichler
@ 2013-04-18 13:43 ` Mark Nelson
  2013-04-18 16:46   ` Andrey Korolyov
  2013-04-18 23:23   ` James Harper
  1 sibling, 2 replies; 30+ messages in thread
From: Mark Nelson @ 2013-04-18 13:43 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

On 04/18/2013 06:46 AM, James Harper wrote:
> I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong.
>
> Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing.
>
> My setup, approximately, is:
>
> Two OSD's
> . 1 x 7200RPM SATA disk each
> . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch)
> . 1 x gigabit public network
> . journal on another spindle
>
> Three MON's
> . 1 each on the OSD's
> . 1 on another server, which is also the one used for testing performance
>
> I'm using debian packages from ceph which are version 0.56.4
>
> For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on.
>
> Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far...

Hi James!  Sorry to hear about the performance trouble!  Is it just 
sequential 4KB direct IO writes that are giving you troubles?  If you 
are using the kernel version of RBD, we don't have any kind of cache 
implemented there and since you are bypassing the pagecache on the 
client, those writes are being sent to the different OSDs in 4KB chunks 
over the network.  RBD stores data in blocks that are represented by 4MB 
objects on one of the OSDs, so without cache a lot of sequential 4KB 
writes will be hitting 1 OSD repeatedly and then moving on to the next 
one.  Hopefully those writes would get aggregated at the OSD level, but 
clearly that's not really happening here given your performance.

Here's a couple of thoughts:

1) If you are working with VMs, using the QEMU/KVM interface with virtio 
drivers and RBD cache enabled will give you a huge jump in small 
sequential write performance relative to what you are seeing now.

2) You may want to try upgrading to 0.60.  We made a change to how the 
pg_log works that causes fewer disk seeks during small IO, especially 
with XFS.

3) If you are still having trouble, testing your network, disk speeds, 
and using rados bench to test the object store all may be helpful.

>
> Thanks
>
> James

Good luck!

>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-18 13:43 ` Mark Nelson
@ 2013-04-18 16:46   ` Andrey Korolyov
  2013-04-18 17:01     ` Mark Nelson
  2013-04-18 23:23   ` James Harper
  1 sibling, 1 reply; 30+ messages in thread
From: Andrey Korolyov @ 2013-04-18 16:46 UTC (permalink / raw)
  To: Mark Nelson; +Cc: James Harper, ceph-devel@vger.kernel.org

On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 04/18/2013 06:46 AM, James Harper wrote:
>>
>> I'm doing some basic testing so I'm not really fussed about poor
>> performance, but my write performance appears to be so bad I think I'm doing
>> something wrong.
>>
>> Using dd to test gives me kbytes/second for write performance for 4kb
>> block sizes, while read performance is acceptable (for testing at least).
>> For dd I'm using iflag=direct for read and oflag=direct for write testing.
>>
>> My setup, approximately, is:
>>
>> Two OSD's
>> . 1 x 7200RPM SATA disk each
>> . 2 x gigabit cluster network interfaces each in a bonded configuration
>> directly attached (osd to osd, no switch)
>> . 1 x gigabit public network
>> . journal on another spindle
>>
>> Three MON's
>> . 1 each on the OSD's
>> . 1 on another server, which is also the one used for testing performance
>>
>> I'm using debian packages from ceph which are version 0.56.4
>>
>> For comparison, my existing production storage is 2 servers running DRBD
>> with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
>> of the iSCSI. Performance not spectacular but acceptable. The servers in
>> question are the same specs as the servers I'm testing on.
>>
>> Where should I start looking for performance problems? I've tried running
>> some of the benchmark stuff in the documentation but I haven't gotten very
>> far...
>
>
> Hi James!  Sorry to hear about the performance trouble!  Is it just
> sequential 4KB direct IO writes that are giving you troubles?  If you are
> using the kernel version of RBD, we don't have any kind of cache implemented
> there and since you are bypassing the pagecache on the client, those writes
> are being sent to the different OSDs in 4KB chunks over the network.  RBD
> stores data in blocks that are represented by 4MB objects on one of the
> OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
> repeatedly and then moving on to the next one.  Hopefully those writes would
> get aggregated at the OSD level, but clearly that's not really happening
> here given your performance.
>
> Here's a couple of thoughts:
>
> 1) If you are working with VMs, using the QEMU/KVM interface with virtio
> drivers and RBD cache enabled will give you a huge jump in small sequential
> write performance relative to what you are seeing now.
>
> 2) You may want to try upgrading to 0.60.  We made a change to how the
> pg_log works that causes fewer disk seeks during small IO, especially with
> XFS.

Can you point into related commits, if possible?

>
> 3) If you are still having trouble, testing your network, disk speeds, and
> using rados bench to test the object store all may be helpful.
>
>>
>> Thanks
>>
>> James
>
>
> Good luck!
>
>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-18 16:46   ` Andrey Korolyov
@ 2013-04-18 17:01     ` Mark Nelson
  0 siblings, 0 replies; 30+ messages in thread
From: Mark Nelson @ 2013-04-18 17:01 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: James Harper, ceph-devel@vger.kernel.org

On 04/18/2013 11:46 AM, Andrey Korolyov wrote:
> On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
>> On 04/18/2013 06:46 AM, James Harper wrote:
>>>
>>> I'm doing some basic testing so I'm not really fussed about poor
>>> performance, but my write performance appears to be so bad I think I'm doing
>>> something wrong.
>>>
>>> Using dd to test gives me kbytes/second for write performance for 4kb
>>> block sizes, while read performance is acceptable (for testing at least).
>>> For dd I'm using iflag=direct for read and oflag=direct for write testing.
>>>
>>> My setup, approximately, is:
>>>
>>> Two OSD's
>>> . 1 x 7200RPM SATA disk each
>>> . 2 x gigabit cluster network interfaces each in a bonded configuration
>>> directly attached (osd to osd, no switch)
>>> . 1 x gigabit public network
>>> . journal on another spindle
>>>
>>> Three MON's
>>> . 1 each on the OSD's
>>> . 1 on another server, which is also the one used for testing performance
>>>
>>> I'm using debian packages from ceph which are version 0.56.4
>>>
>>> For comparison, my existing production storage is 2 servers running DRBD
>>> with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
>>> of the iSCSI. Performance not spectacular but acceptable. The servers in
>>> question are the same specs as the servers I'm testing on.
>>>
>>> Where should I start looking for performance problems? I've tried running
>>> some of the benchmark stuff in the documentation but I haven't gotten very
>>> far...
>>
>>
>> Hi James!  Sorry to hear about the performance trouble!  Is it just
>> sequential 4KB direct IO writes that are giving you troubles?  If you are
>> using the kernel version of RBD, we don't have any kind of cache implemented
>> there and since you are bypassing the pagecache on the client, those writes
>> are being sent to the different OSDs in 4KB chunks over the network.  RBD
>> stores data in blocks that are represented by 4MB objects on one of the
>> OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
>> repeatedly and then moving on to the next one.  Hopefully those writes would
>> get aggregated at the OSD level, but clearly that's not really happening
>> here given your performance.
>>
>> Here's a couple of thoughts:
>>
>> 1) If you are working with VMs, using the QEMU/KVM interface with virtio
>> drivers and RBD cache enabled will give you a huge jump in small sequential
>> write performance relative to what you are seeing now.
>>
>> 2) You may want to try upgrading to 0.60.  We made a change to how the
>> pg_log works that causes fewer disk seeks during small IO, especially with
>> XFS.
>
> Can you point into related commits, if possible?

here you go:

http://tracker.ceph.com/projects/ceph/repository/revisions/188f3ea6867eeb6e950f6efed18d53ff17522bbc


>
>>
>> 3) If you are still having trouble, testing your network, disk speeds, and
>> using rados bench to test the object store all may be helpful.
>>
>>>
>>> Thanks
>>>
>>> James
>>
>>
>> Good luck!
>>
>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-18 12:15 ` Wolfgang Hennerbichler
@ 2013-04-18 23:11   ` James Harper
  2013-04-20 10:52     ` Harald Rößler
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-18 23:11 UTC (permalink / raw)
  To: Wolfgang Hennerbichler; +Cc: ceph-devel@vger.kernel.org

> 
> Hi James,
> 
> This is just pure speculation, but can you assure that the bonding works
> correctly? Maybe you have issues there. I have seen a lot of incorrectly
> configured bonding throughout my life as unix admin.
> 

The bonding gives me iperf performance consistent with 2 x 1GB links so I think it's okay.

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-18 13:43 ` Mark Nelson
  2013-04-18 16:46   ` Andrey Korolyov
@ 2013-04-18 23:23   ` James Harper
  2013-04-19  7:21     ` James Harper
  1 sibling, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-18 23:23 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> > Where should I start looking for performance problems? I've tried running
> > some of the benchmark stuff in the documentation but I haven't gotten very
> > far...
> 
> Hi James!  Sorry to hear about the performance trouble!  Is it just
> sequential 4KB direct IO writes that are giving you troubles?  If you
> are using the kernel version of RBD, we don't have any kind of cache
> implemented there and since you are bypassing the pagecache on the
> client, those writes are being sent to the different OSDs in 4KB chunks
> over the network.  RBD stores data in blocks that are represented by 4MB
> objects on one of the OSDs, so without cache a lot of sequential 4KB
> writes will be hitting 1 OSD repeatedly and then moving on to the next
> one.  Hopefully those writes would get aggregated at the OSD level, but
> clearly that's not really happening here given your performance.

Using dd I tried various block sizes. With 4kb I was getting around 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read performance seems great though.

> Here's a couple of thoughts:
> 
> 1) If you are working with VMs, using the QEMU/KVM interface with virtio
> drivers and RBD cache enabled will give you a huge jump in small
> sequential write performance relative to what you are seeing now.

I'm using Xen so that won't work for me right now, although I did notice someone posted some blktap code to support ceph.

I'm trying a windows restore of a physical machine into a VM under Xen and performance matches what I am seeing with dd - very very slow.

> 2) You may want to try upgrading to 0.60.  We made a change to how the
> pg_log works that causes fewer disk seeks during small IO, especially
> with XFS.

Do packages for this exist for Debian? At the moment my sources.list contains "ceph.com/debian-bobtail wheezy main".

> 3) If you are still having trouble, testing your network, disk speeds,
> and using rados bench to test the object store all may be helpful.
> 

I tried that and while the write worked the seq test always said I had to do a write test first.

While running my Xen restore, /var/log/ceph/ceph.log looks like:

pgmap v18316: 832 pgs: 832 active+clean; 61443 MB data, 119 GB used, 1742 GB / 1862 GB avail; 824KB/s wr, 12op/s
pgmap v18317: 832 pgs: 832 active+clean; 61446 MB data, 119 GB used, 1742 GB / 1862 GB avail; 649KB/s wr, 10op/s
pgmap v18318: 832 pgs: 832 active+clean; 61449 MB data, 119 GB used, 1742 GB / 1862 GB avail; 652KB/s wr, 10op/s
pgmap v18319: 832 pgs: 832 active+clean; 61452 MB data, 119 GB used, 1742 GB / 1862 GB avail; 614KB/s wr, 9op/s
pgmap v18320: 832 pgs: 832 active+clean; 61454 MB data, 119 GB used, 1742 GB / 1862 GB avail; 537KB/s wr, 8op/s
pgmap v18321: 832 pgs: 832 active+clean; 61457 MB data, 119 GB used, 1742 GB / 1862 GB avail; 511KB/s wr, 7op/s

James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-18 23:23   ` James Harper
@ 2013-04-19  7:21     ` James Harper
  2013-04-19  7:30       ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-19  7:21 UTC (permalink / raw)
  To: James Harper, Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> 
> > > Where should I start looking for performance problems? I've tried
> running
> > > some of the benchmark stuff in the documentation but I haven't gotten
> very
> > > far...
> >
> > Hi James!  Sorry to hear about the performance trouble!  Is it just
> > sequential 4KB direct IO writes that are giving you troubles?  If you
> > are using the kernel version of RBD, we don't have any kind of cache
> > implemented there and since you are bypassing the pagecache on the
> > client, those writes are being sent to the different OSDs in 4KB chunks
> > over the network.  RBD stores data in blocks that are represented by 4MB
> > objects on one of the OSDs, so without cache a lot of sequential 4KB
> > writes will be hitting 1 OSD repeatedly and then moving on to the next
> > one.  Hopefully those writes would get aggregated at the OSD level, but
> > clearly that's not really happening here given your performance.
> 
> Using dd I tried various block sizes. With 4kb I was getting around
> 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read
> performance seems great though.
> 

I did an strace -c to gather some performance info, if that helps:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.13   39.589549        2750     14398       967 futex
 12.45    6.308784        4200      1502           poll
  7.99    4.048253      224903        18         9 restart_syscall
  0.65    0.331042         635       521           writev
  0.34    0.172011       57337         3           SYS_344
  0.22    0.110395         117       944           close
  0.08    0.040002         310       129           truncate64
  0.07    0.036003       12001         3           fsync
  0.02    0.010611           1     10263           gettimeofday
  0.02    0.008000        1333         6           pwrite64
  0.01    0.004941           9       521           fsetxattr
  0.01    0.004256          33       129           sync_file_range
  0.01    0.002779           1      3660       814 stat64
  0.00    0.001775           4       442           sendmsg
  0.00    0.001266           1      1507           recv
  0.00    0.001103           1       948         4 open
  0.00    0.000640           1       979           time
  0.00    0.000493           1       409           clock_gettime
  0.00    0.000375           1       522           _llseek
  0.00    0.000111          11        10           read
  0.00    0.000000           0         1           setxattr
  0.00    0.000000           0         1           getxattr
  0.00    0.000000           0        32         8 fgetxattr
  0.00    0.000000           0         5           statfs64
  0.00    0.000000           0         5         5 fallocate
------ ----------- ----------- --------- --------- ----------------
100.00   50.672389                 36958      1807 total

Does that look about what you'd expect?

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-19  7:21     ` James Harper
@ 2013-04-19  7:30       ` James Harper
  2013-04-19 11:09         ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-19  7:30 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> 
> I did an strace -c to gather some performance info, if that helps:
> 

Oops. Forgot to say that that's an strace -c of the osd process!

> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  78.13   39.589549        2750     14398       967 futex
>  12.45    6.308784        4200      1502           poll
>   7.99    4.048253      224903        18         9 restart_syscall
>   0.65    0.331042         635       521           writev
>   0.34    0.172011       57337         3           SYS_344
>   0.22    0.110395         117       944           close
>   0.08    0.040002         310       129           truncate64
>   0.07    0.036003       12001         3           fsync
>   0.02    0.010611           1     10263           gettimeofday
>   0.02    0.008000        1333         6           pwrite64
>   0.01    0.004941           9       521           fsetxattr
>   0.01    0.004256          33       129           sync_file_range
>   0.01    0.002779           1      3660       814 stat64
>   0.00    0.001775           4       442           sendmsg
>   0.00    0.001266           1      1507           recv
>   0.00    0.001103           1       948         4 open
>   0.00    0.000640           1       979           time
>   0.00    0.000493           1       409           clock_gettime
>   0.00    0.000375           1       522           _llseek
>   0.00    0.000111          11        10           read
>   0.00    0.000000           0         1           setxattr
>   0.00    0.000000           0         1           getxattr
>   0.00    0.000000           0        32         8 fgetxattr
>   0.00    0.000000           0         5           statfs64
>   0.00    0.000000           0         5         5 fallocate
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   50.672389                 36958      1807 total
> 
> Does that look about what you'd expect?
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-19  7:30       ` James Harper
@ 2013-04-19 11:09         ` James Harper
  2013-04-19 14:50           ` Mark Nelson
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-19 11:09 UTC (permalink / raw)
  To: James Harper, Mark Nelson; +Cc: ceph-devel@vger.kernel.org

I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with a 4mb block size, instead of the 700kbytes/second I was getting with the debian 3.2 kernel.

I'm still getting 120kbytes/second with a dd 4kb block size though... is that expected?

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-19 11:09         ` James Harper
@ 2013-04-19 14:50           ` Mark Nelson
  2013-04-20  0:33             ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Nelson @ 2013-04-19 14:50 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

On 04/19/2013 06:09 AM, James Harper wrote:
> I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with a 4mb block size, instead of the 700kbytes/second I was getting with the debian 3.2 kernel.

That's.... unexpected.  Was this the kernel on the client, the OSDs, or 
both?

>
> I'm still getting 120kbytes/second with a dd 4kb block size though... is that expected?

that's still quite a bit lower than I'd expect as well.  What were your 
fs mount options on the OSDs?  Can you try some rados bench read/write 
tests on your pool?  Something like:

rados -p <pool> -b 4096 bench 300 write --no-cleanup -t 64
rados -p <pool> -b 4096 bench 300 seq -t 64

with 2 drives and 2x replication I wouldn't expect much without RBD 
cache, but 120kb/s is rather excessively bad. :)

>
> James
>

Mark

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-19 14:50           ` Mark Nelson
@ 2013-04-20  0:33             ` James Harper
  2013-04-20  1:30               ` James Harper
  2013-04-21 17:56               ` Sylvain Munaut
  0 siblings, 2 replies; 30+ messages in thread
From: James Harper @ 2013-04-20  0:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> 
> On 04/19/2013 06:09 AM, James Harper wrote:
> > I just tried a 3.8 series kernel and can now get 25mbytes/second using dd
> with a 4mb block size, instead of the 700kbytes/second I was getting with the
> debian 3.2 kernel.
> 
> That's.... unexpected.  Was this the kernel on the client, the OSDs, or
> both?

Kernel on the client. I can't easily change the kernel on the OSD's although if you think it will make a big difference I can arrange it.

> >
> > I'm still getting 120kbytes/second with a dd 4kb block size though... is that
> expected?
> 
> that's still quite a bit lower than I'd expect as well.  What were your
> fs mount options on the OSDs?

I didn't explicitly set any, so I guess these are the defaults:

xfs (rw,noatime,attr2,delaylog,inode64,noquota)

> Can you try some rados bench read/write
> tests on your pool?  Something like:
> 
> rados -p <pool> -b 4096 bench 300 write --no-cleanup -t 64

Ah. It's the --no-cleanup that explains why my pervious seq tests didn't work!

Total time run:         300.430516
Total writes made:      26726
Write size:             4096
Bandwidth (MB/sec):     0.347

Stddev Bandwidth:       0.322983
Max bandwidth (MB/sec): 1.34375
Min bandwidth (MB/sec): 0
Average Latency:        0.719337
Stddev Latency:         0.985265
Max latency:            7.2241
Min latency:            0.018218

But then it just hung and I had to hit ctrl-c

What is the unit of measure for latency and for write size?

> rados -p <pool> -b 4096 bench 300 seq -t 64

sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
read got -2
error during benchmark: -5
error 5: (5) Input/output error

not sure what that's about...

> 
> with 2 drives and 2x replication I wouldn't expect much without RBD
> cache, but 120kb/s is rather excessively bad. :)
> 

What is rbd cache? I've seen it mentioned but haven't found documentation for it anywhere...

My goal is 4 OSD's, each on separate machines, with 1 drive in each for a start, but I want to see performance of at least the same order of magnitude as the theoretical maximum on my hardware before I think about replacing my existing setup.

Thanks

James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-20  0:33             ` James Harper
@ 2013-04-20  1:30               ` James Harper
  2013-04-21 13:52                 ` Mark Nelson
  2013-04-21 17:56               ` Sylvain Munaut
  1 sibling, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-20  1:30 UTC (permalink / raw)
  To: James Harper, Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> > rados -p <pool> -b 4096 bench 300 seq -t 64
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
> read got -2
> error during benchmark: -5
> error 5: (5) Input/output error
> 
> not sure what that's about...
> 

Oops... I typo'd --no-cleanup. Now I get:

   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
 Total time run:        0.243709
Total reads made:     1292
Read size:            4096
Bandwidth (MB/sec):    20.709

Average Latency:       0.0118838
Max latency:           0.031942
Min latency:           0.001445

So it finishes instantly without seeming to do much actual testing...

James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-18 23:11   ` James Harper
@ 2013-04-20 10:52     ` Harald Rößler
  2013-04-20 11:12       ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: Harald Rößler @ 2013-04-20 10:52 UTC (permalink / raw)
  To: James Harper; +Cc: Wolfgang Hennerbichler, ceph-devel@vger.kernel.org

Hi James,

do you VLAN's interfaces configured on your bonding interfaces? Because
I saw a similar situation in my setup.

Kind Regards
Harald Roessler


On Fri, 2013-04-19 at 01:11 +0200, James Harper wrote:
> > 
> > Hi James,
> > 
> > This is just pure speculation, but can you assure that the bonding works
> > correctly? Maybe you have issues there. I have seen a lot of incorrectly
> > configured bonding throughout my life as unix admin.
> > 
> 
> The bonding gives me iperf performance consistent with 2 x 1GB links so I think it's okay.
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mit freundlichen Grüßen,
Harald Rößler
 
. . . . . . . . . . . . . . . .
 
BTD System GmbH
Tel.: +49 (89) - 20 05 - 44 30
Tel.: +49 (89) - 660 291 - 251
Mob.: +49 (151) - 11 70 17 59
Fax:  +49 (89) 89 - 20 05 - 44 11
harald.roessler@btd.de
www.btd.de
Projektbüro Allianz-Arena • Ebene 4
Werner-Heisenberg-Allee 25 • D-80939 München 
Goethestraße 34 • D-80336 München
 
HRB München 154370
Geschäftsführer: Stefan Leibhard, Kersten Kröhl, Harald Rößler
 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 CONFIDENTIALITY NOTICE
 
This communication contains information which is confidential and may
also be privileged. It is for the exclusive use of the intended
recipient(s). If you are not the intended recipient(s), please note that
any distribution, copying or use of this communication or the
information in it is strictly  prohibited. If you have received this
communication in error, please notify us  immediately by telephone on
+49 (0) 89 - 20 05 - 44 00 and then destroy the  email and any copies of
it. This communication is from BTD System GmbH whose  office is at
Werner-Heisenberg-Allee 25, D-80939 München.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-20 10:52     ` Harald Rößler
@ 2013-04-20 11:12       ` James Harper
  2013-04-20 21:04         ` Jeff Mitchell
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-20 11:12 UTC (permalink / raw)
  To: Harald Rößler
  Cc: Wolfgang Hennerbichler, ceph-devel@vger.kernel.org

> 
> Hi James,
> 
> do you VLAN's interfaces configured on your bonding interfaces? Because
> I saw a similar situation in my setup.
> 

No VLAN's on my bonding interface, although extensively used elsewhere.

Thanks

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-20 11:12       ` James Harper
@ 2013-04-20 21:04         ` Jeff Mitchell
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff Mitchell @ 2013-04-20 21:04 UTC (permalink / raw)
  To: James Harper
  Cc: Harald Rößler, Wolfgang Hennerbichler,
	ceph-devel@vger.kernel.org

James Harper wrote:
>> Hi James,
>>
>> do you VLAN's interfaces configured on your bonding interfaces? Because
>> I saw a similar situation in my setup.
>>
>
> No VLAN's on my bonding interface, although extensively used elsewhere.

What the OP described is *exactly* like a problem I've been struggling 
with. I thought the blame had lay elsewhere but maybe not.

My setup:

4 Ceph nodes, with 6 OSDs each and dual (bonded) 10GbE, with VLANs, 
running Precise. OSDs are using XFS. Replica count of 3. 3 of these are 
mons.
4 compute nodes, with dual (bonded) 10GbE, with VLANs, running a base of 
Precise along with a 3.6.3 Ceph-provided kernel, running KVM-based VMs. 
2 of these are also mons. VMs are Precise and accessing RBD through the 
kernel client.

(Eventually there will be 12 Ceph nodes. 5 mons seemed an appropriate 
number and when I've run into issues in the past I've actually gotten to 
cases where > 3 mons were knocked out, so 5 is a comfortable number 
unless it's problematic.)

In the VMs, I/O with ext4 is fine -- 10-15MB/s sustained. However, using 
ZFS (via ZFSonLinux, not FUSE), I see write speeds of about 150kb/sec, 
just like the OP.

I had figured that the problem lay with ZFS inside the VM (I've used 
ZFSonLinux on many bare metal machines without a problem for a couple of 
years now). The VMs were using virtio, and I'd heard that it was found 
that pre-1.4 Qemu versions could have some serious problems with virtio 
(which I didn't know at the time); also, I know that the kernel client 
is not the preferred client, and the version I'm using is a rather older 
version of the Ceph-provided builds. As a result, my plan was to try the 
updated Qemu version along with native Qemu librados RBD support once 
Raring was out, as I figured that the problem was either something in 
ZFSonLinux (though I reported the issue and nobody had ever heard of any 
such problem, or had any idea why it would be happening) or something 
specifically about ZFS running inside Qemu, as ext4 in the VMs is fine.

But, this thread has made me wonder if what's actually happening is in 
fact something else -- either something, as someone else saw, to do with 
using VLANs on the bonded interface (although I don't see such a write 
problem with any other traffic going through these VLANs); or, something 
about how ZFS inside the VM is writing to the RBD disk causing some kind 
of giant slowdown in Ceph. The numbers that the OP cited were exactly in 
line with what I was seeing.

I don't know offhand what the block sizes are that the kernel client was 
using, or that the different filesystems inside the VMs might be using 
when trying to write to their virtual disks (I'm guessing that if you 
are using virtio, as I am, it potentially could be anything). But 
perhaps ZFS writes extremely small blocks and ext4 doesn't.

Unfortunately, I don't have access to this testbed for the next few 
weeks, so for the moment I can only recount my experience and not 
actually test out any suggestions (unless I can corral someone with 
access to it to run tests).

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-20  1:30               ` James Harper
@ 2013-04-21 13:52                 ` Mark Nelson
  2013-04-22  5:32                   ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Nelson @ 2013-04-21 13:52 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

On 04/19/2013 08:30 PM, James Harper wrote:
>>> rados -p <pool> -b 4096 bench 300 seq -t 64
>>
>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>       0       0         0         0         0         0         -         0
>> read got -2
>> error during benchmark: -5
>> error 5: (5) Input/output error
>>
>> not sure what that's about...
>>
>
> Oops... I typo'd --no-cleanup. Now I get:
>
>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>       0       0         0         0         0         0         -         0
>   Total time run:        0.243709
> Total reads made:     1292
> Read size:            4096
> Bandwidth (MB/sec):    20.709
>
> Average Latency:       0.0118838
> Max latency:           0.031942
> Min latency:           0.001445
>
> So it finishes instantly without seeming to do much actual testing...

My bad.  I forgot to tell you to do a sync/flush on the OSDs after the 
write test.  All of those reads are probably coming from pagecache.  The 
good news is that this is demonstrating that reading 4k objects from 
pagecache isn't insanely bad on your setup (for larger sustained loads I 
see 4k object reads from pagecache hit up to around 100MB/s with 
multiple clients on my test nodes).

On your OSD nodes try:

sync
echo 3 > /proc/sys/vm/drop_caches

right before you run the read test.

Whatever issue you are facing is probably down at the filestore level or 
possible lower down yet.

How do your drives benchmark with something like fio doing random 4k 
writes?  Are your drives dedicated for ceph?  What filesystem?  Also 
what is the journal device you are using?

Mark

>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-20  0:33             ` James Harper
  2013-04-20  1:30               ` James Harper
@ 2013-04-21 17:56               ` Sylvain Munaut
  2013-04-21 23:04                 ` James Harper
  1 sibling, 1 reply; 30+ messages in thread
From: Sylvain Munaut @ 2013-04-21 17:56 UTC (permalink / raw)
  To: James Harper; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Hi,

> My goal is 4 OSD's, each on separate machines, with 1 drive in each for a start, but I want to see performance of at least the same order of magnitude as the theoretical maximum on my hardware before I think about replacing my existing setup.

My current understanding is that it's not even possible, you always
have a min 2/3x slow down in the best case.

If you do sustained sequential write benchmark, and have a single
drive, then that drive ends up writing the data twice (journal + final
storage area) which with the seeks will more than divide by 2 the peak
perf of the drive. And since it's sequential, it will only write to 1
PG at a time (so not divided among several OSD).

Also AFAIU the OSD receiving the data will also have to send the data
to the other OSD in the PG and wait for them to say everything is
written before confirming the write, which slows it even more.

Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-21 17:56               ` Sylvain Munaut
@ 2013-04-21 23:04                 ` James Harper
  2013-04-22  8:34                   ` Sylvain Munaut
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-21 23:04 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

> Hi,
> 
> > My goal is 4 OSD's, each on separate machines, with 1 drive in each for a
> start, but I want to see performance of at least the same order of magnitude
> as the theoretical maximum on my hardware before I think about replacing
> my existing setup.
> 
> My current understanding is that it's not even possible, you always
> have a min 2/3x slow down in the best case.
> 
> If you do sustained sequential write benchmark, and have a single
> drive, then that drive ends up writing the data twice (journal + final
> storage area) which with the seeks will more than divide by 2 the peak
> perf of the drive. And since it's sequential, it will only write to 1
> PG at a time (so not divided among several OSD).
> 
> Also AFAIU the OSD receiving the data will also have to send the data
> to the other OSD in the PG and wait for them to say everything is
> written before confirming the write, which slows it even more.
> 

Correct, but that's the theoretical maximum I was referring to. If I calculate that I should be able to get 50MB/second then 30MB/second is acceptable but 500KB/second is not :)

James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-21 13:52                 ` Mark Nelson
@ 2013-04-22  5:32                   ` James Harper
  2013-04-22 11:34                     ` Mark Nelson
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-22  5:32 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> 
> On 04/19/2013 08:30 PM, James Harper wrote:
> >>> rados -p <pool> -b 4096 bench 300 seq -t 64
> >>
> >> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>       0       0         0         0         0         0         -         0
> >> read got -2
> >> error during benchmark: -5
> >> error 5: (5) Input/output error
> >>
> >> not sure what that's about...
> >>
> >
> > Oops... I typo'd --no-cleanup. Now I get:
> >
> >     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >       0       0         0         0         0         0         -         0
> >   Total time run:        0.243709
> > Total reads made:     1292
> > Read size:            4096
> > Bandwidth (MB/sec):    20.709
> >
> > Average Latency:       0.0118838
> > Max latency:           0.031942
> > Min latency:           0.001445
> >
> > So it finishes instantly without seeming to do much actual testing...
> 
> My bad.  I forgot to tell you to do a sync/flush on the OSDs after the
> write test.  All of those reads are probably coming from pagecache.  The
> good news is that this is demonstrating that reading 4k objects from
> pagecache isn't insanely bad on your setup (for larger sustained loads I
> see 4k object reads from pagecache hit up to around 100MB/s with
> multiple clients on my test nodes).
> 
> On your OSD nodes try:
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> right before you run the read test.
> 

I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing something else wrong.

> Whatever issue you are facing is probably down at the filestore level or
> possible lower down yet.
> 
> How do your drives benchmark with something like fio doing random 4k
> writes?  Are your drives dedicated for ceph?  What filesystem?  Also
> what is the journal device you are using?
> 

Drives are dedicated for ceph. I originally put my journals on /, but that was ext3 and my throughput went down even further so the journal shares the osd disk for now.

I upgraded to 0.60 and that seems to have made a big difference. If I kill off one of my OSD's I get around 20MB/second throughput in live testing (test restore of Xen Windows VM from USB backup), which is pretty much the limit of the USB disk. If I reactivate the second OSD throughput drops back to ~10MB/second which isn't as good but is much better than I was getting.

Thanks

James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-21 23:04                 ` James Harper
@ 2013-04-22  8:34                   ` Sylvain Munaut
  2013-04-22 11:34                     ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: Sylvain Munaut @ 2013-04-22  8:34 UTC (permalink / raw)
  To: James Harper; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Hi,

> Correct, but that's the theoretical maximum I was referring to. If I calculate that I should be able to get 50MB/second then 30MB/second is acceptable but 500KB/second is not :)

I have written a small benchmark for RBD :

https://gist.github.com/smunaut/5433222

It uses the librbd API directly without kernel client and queue
requests long in advance and this should give an "upper" bound to what
you can get at best.
It reads and writes the whole image, so I usually just create a 1 or 2
G image for testing.

Using two OSDs on two distinct recent 7200rpm drives (with journal on
the same disk as data), I get :

Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)

The raw disk do about 45 Mo/s when written by 1M chunk. But when
written by 4k chunk, this falls to ~500 ko/s ...

# dd if=/dev/zero of=/dev/xen-disks/test bs=1M oflag=direct
2049+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 49.3943 s, 43.5 MB/s

# dd if=/dev/zero of=/dev/xen-disks/test bs=4k oflag=direct
^C61667+0 records in
61667+0 records out
252588032 bytes (253 MB) copied, 539.123 s, 469 kB/s

Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22  5:32                   ` James Harper
@ 2013-04-22 11:34                     ` Mark Nelson
  2013-04-22 11:40                       ` James Harper
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Nelson @ 2013-04-22 11:34 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

On 04/22/2013 12:32 AM, James Harper wrote:
>>
>> On 04/19/2013 08:30 PM, James Harper wrote:
>>>>> rados -p <pool> -b 4096 bench 300 seq -t 64
>>>>
>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>>>        0       0         0         0         0         0         -         0
>>>> read got -2
>>>> error during benchmark: -5
>>>> error 5: (5) Input/output error
>>>>
>>>> not sure what that's about...
>>>>
>>>
>>> Oops... I typo'd --no-cleanup. Now I get:
>>>
>>>      sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>>        0       0         0         0         0         0         -         0
>>>    Total time run:        0.243709
>>> Total reads made:     1292
>>> Read size:            4096
>>> Bandwidth (MB/sec):    20.709
>>>
>>> Average Latency:       0.0118838
>>> Max latency:           0.031942
>>> Min latency:           0.001445
>>>
>>> So it finishes instantly without seeming to do much actual testing...
>>
>> My bad.  I forgot to tell you to do a sync/flush on the OSDs after the
>> write test.  All of those reads are probably coming from pagecache.  The
>> good news is that this is demonstrating that reading 4k objects from
>> pagecache isn't insanely bad on your setup (for larger sustained loads I
>> see 4k object reads from pagecache hit up to around 100MB/s with
>> multiple clients on my test nodes).
>>
>> On your OSD nodes try:
>>
>> sync
>> echo 3 > /proc/sys/vm/drop_caches
>>
>> right before you run the read test.
>>
>
> I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing something else wrong.
>

It will try to read for up to 300 seconds, but if it runs out of data it 
stops.  Since you only wrote out something like 1300 4k objects, and you 
were reading at 20+MB/s, the test ran for under a second.

>> Whatever issue you are facing is probably down at the filestore level or
>> possible lower down yet.
>>
>> How do your drives benchmark with something like fio doing random 4k
>> writes?  Are your drives dedicated for ceph?  What filesystem?  Also
>> what is the journal device you are using?
>>
>
> Drives are dedicated for ceph. I originally put my journals on /, but that was ext3 and my throughput went down even further so the journal shares the osd disk for now.
>
> I upgraded to 0.60 and that seems to have made a big difference. If I kill off one of my OSD's I get around 20MB/second throughput in live testing (test restore of Xen Windows VM from USB backup), which is pretty much the limit of the USB disk. If I reactivate the second OSD throughput drops back to ~10MB/second which isn't as good but is much better than I was getting.
>

Ah, are these disks both connected through USB(2?)?

> Thanks
>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-22  8:34                   ` Sylvain Munaut
@ 2013-04-22 11:34                     ` James Harper
  2013-04-22 11:39                       ` Mark Nelson
  0 siblings, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-22 11:34 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

> Hi,
> 
> > Correct, but that's the theoretical maximum I was referring to. If I calculate
> that I should be able to get 50MB/second then 30MB/second is acceptable
> but 500KB/second is not :)
> 
> I have written a small benchmark for RBD :
> 
> https://gist.github.com/smunaut/5433222
> 
> It uses the librbd API directly without kernel client and queue
> requests long in advance and this should give an "upper" bound to what
> you can get at best.
> It reads and writes the whole image, so I usually just create a 1 or 2
> G image for testing.
> 
> Using two OSDs on two distinct recent 7200rpm drives (with journal on
> the same disk as data), I get :
> 
> Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
> Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)
> 

I like your benchmark tool!

How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas I get:

# ./a.out admin xen test
Read: 111.99 Mb/s (1073741824 bytes in 9144 ms)
Write: 29.68 Mb/s (1073741824 bytes in 34507 ms)

Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my public network (single gigabit interface). After dropping caches I consistently get:

# ./a.out admin xen test
Read: 39.98 Mb/s (1073741824 bytes in 25614 ms)
Write: 23.11 Mb/s (1073741824 bytes in 44316 ms)

Journal is on the same disk. Network is... confusing :) but is basically public on a single gigabit and cluster on a bonded pair of gigabit links. The whole network thing is shared with my existing drbd cluster so performance may vary over time.

My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read...

While running, iostat on each osd reports a read rate of around 20MB/second (1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on each) during write test, which is pretty much exactly right.

iperf on the cluster network (pair of gigabits bonded) gives me about 1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second.

changing the scheduler on the harddisk doesn't seem to make any difference, even when I set it to cfq which normally really sucks.

What ceph version are you using and what filesystem?

Thanks

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22 11:34                     ` James Harper
@ 2013-04-22 11:39                       ` Mark Nelson
  2013-04-22 11:48                         ` James Harper
  2013-04-22 15:20                         ` Sage Weil
  0 siblings, 2 replies; 30+ messages in thread
From: Mark Nelson @ 2013-04-22 11:39 UTC (permalink / raw)
  To: James Harper; +Cc: Sylvain Munaut, ceph-devel@vger.kernel.org

On 04/22/2013 06:34 AM, James Harper wrote:
>> Hi,
>>
>>> Correct, but that's the theoretical maximum I was referring to. If I calculate
>> that I should be able to get 50MB/second then 30MB/second is acceptable
>> but 500KB/second is not :)
>>
>> I have written a small benchmark for RBD :
>>
>> https://gist.github.com/smunaut/5433222
>>
>> It uses the librbd API directly without kernel client and queue
>> requests long in advance and this should give an "upper" bound to what
>> you can get at best.
>> It reads and writes the whole image, so I usually just create a 1 or 2
>> G image for testing.
>>
>> Using two OSDs on two distinct recent 7200rpm drives (with journal on
>> the same disk as data), I get :
>>
>> Read: 89.52 Mb/s (2147483648 bytes in 22877 ms)
>> Write: 10.62 Mb/s (2147483648 bytes in 192874 ms)
>>
>
> I like your benchmark tool!
>
> How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas I get:
>
> # ./a.out admin xen test
> Read: 111.99 Mb/s (1073741824 bytes in 9144 ms)
> Write: 29.68 Mb/s (1073741824 bytes in 34507 ms)
>
> Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my public network (single gigabit interface). After dropping caches I consistently get:
>
> # ./a.out admin xen test
> Read: 39.98 Mb/s (1073741824 bytes in 25614 ms)
> Write: 23.11 Mb/s (1073741824 bytes in 44316 ms)
>
> Journal is on the same disk. Network is... confusing :) but is basically public on a single gigabit and cluster on a bonded pair of gigabit links. The whole network thing is shared with my existing drbd cluster so performance may vary over time.
>
> My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read...

You may want to try increasing your read_ahead_kb on the OSD data disks 
and see if that helps read speeds.

>
> While running, iostat on each osd reports a read rate of around 20MB/second (1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on each) during write test, which is pretty much exactly right.
>
> iperf on the cluster network (pair of gigabits bonded) gives me about 1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second.
>
> changing the scheduler on the harddisk doesn't seem to make any difference, even when I set it to cfq which normally really sucks.
>
> What ceph version are you using and what filesystem?
>
> Thanks
>
> James
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-22 11:34                     ` Mark Nelson
@ 2013-04-22 11:40                       ` James Harper
  0 siblings, 0 replies; 30+ messages in thread
From: James Harper @ 2013-04-22 11:40 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> > I upgraded to 0.60 and that seems to have made a big difference. If I kill off
> > one of my OSD's I get around 20MB/second throughput in live testing (test
> > restore of Xen Windows VM from USB backup), which is pretty much the
> > limit of the USB disk. If I reactivate the second OSD throughput drops back to
> > ~10MB/second which isn't as good but is much better than I was getting.
> >
> 
> Ah, are these disks both connected through USB(2?)?
> 

I guess I was a bit brief :)

Both my OSD disks are SATA attached. Inside a VM I have attached another disk which is attached to the host via USB. This disk contains a backup of a server (using Windows Server Backup) and am doing a test restore of it, with ceph holding the C: drive of the virtual server (eg the write target). What I was saying is that I would never expect more than about 20-30MB/s write speed in this test because that is going to be approximately the limit of the USB interface that the data is coming from. This is more a production test than a benchmark, and I was just iostat to monitor the throughput of the /dev/rbdX interfaces while doing the restore.

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: poor write performance
  2013-04-22 11:39                       ` Mark Nelson
@ 2013-04-22 11:48                         ` James Harper
  2013-04-22 12:01                           ` Mark Nelson
  2013-04-22 15:20                         ` Sage Weil
  1 sibling, 1 reply; 30+ messages in thread
From: James Harper @ 2013-04-22 11:48 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sylvain Munaut, ceph-devel@vger.kernel.org

> > My read speed is consistently around 40MB/second, and my write speed is
> > consistently around 22MB/second. I had expected better of read...
> 
> You may want to try increasing your read_ahead_kb on the OSD data disks
> and see if that helps read speeds.
> 

Default appears to be 128 and I was getting 40MB/second
Increasing to 256 takes me up to 48MB/second
Increasing to 512 takes me up to 53Mb/second

Any further increases don't do anything that I can measure

Is increasing read_ahead_kb good for general performance, or just for impressing people with benchmarks? If the kernel spent time reading ahead woult it hurt random read/write performance?

Thanks
 
James


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22 11:48                         ` James Harper
@ 2013-04-22 12:01                           ` Mark Nelson
  2013-04-22 13:47                             ` Mark Nelson
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Nelson @ 2013-04-22 12:01 UTC (permalink / raw)
  To: James Harper; +Cc: Sylvain Munaut, ceph-devel@vger.kernel.org

On 04/22/2013 06:48 AM, James Harper wrote:
>>> My read speed is consistently around 40MB/second, and my write speed is
>>> consistently around 22MB/second. I had expected better of read...
>>
>> You may want to try increasing your read_ahead_kb on the OSD data disks
>> and see if that helps read speeds.
>>
>
> Default appears to be 128 and I was getting 40MB/second
> Increasing to 256 takes me up to 48MB/second
> Increasing to 512 takes me up to 53Mb/second
>
> Any further increases don't do anything that I can measure
>
> Is increasing read_ahead_kb good for general performance, or just for impressing people with benchmarks? If the kernel spent time reading ahead woult it hurt random read/write performance?

Potentially yes, but it depends on a lot of of factors.  I suspect that 
increasing it may be acceptable on modern drives, but you'll need to do 
some testing to see how it goes in practice.

If anyone on the list knows how many sectors per track is typical for 
modern 1-3TB drives I'm dying to know. That would help us guess at how 
much data can be writen/read on average without imposing any head 
movement. :)

>
> Thanks
>
> James
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22 12:01                           ` Mark Nelson
@ 2013-04-22 13:47                             ` Mark Nelson
  0 siblings, 0 replies; 30+ messages in thread
From: Mark Nelson @ 2013-04-22 13:47 UTC (permalink / raw)
  To: James Harper; +Cc: Sylvain Munaut, ceph-devel@vger.kernel.org

On 04/22/2013 07:01 AM, Mark Nelson wrote:
> On 04/22/2013 06:48 AM, James Harper wrote:
>>>> My read speed is consistently around 40MB/second, and my write speed is
>>>> consistently around 22MB/second. I had expected better of read...
>>>
>>> You may want to try increasing your read_ahead_kb on the OSD data disks
>>> and see if that helps read speeds.
>>>
>>
>> Default appears to be 128 and I was getting 40MB/second
>> Increasing to 256 takes me up to 48MB/second
>> Increasing to 512 takes me up to 53Mb/second
>>
>> Any further increases don't do anything that I can measure
>>
>> Is increasing read_ahead_kb good for general performance, or just for
>> impressing people with benchmarks? If the kernel spent time reading
>> ahead woult it hurt random read/write performance?
>
> Potentially yes, but it depends on a lot of of factors.  I suspect that
> increasing it may be acceptable on modern drives, but you'll need to do
> some testing to see how it goes in practice.
>
> If anyone on the list knows how many sectors per track is typical for
> modern 1-3TB drives I'm dying to know. That would help us guess at how
> much data can be writen/read on average without imposing any head
> movement. :)
>

Aha, sorry to reply to my own mail.  I found some specifications for 
Hitachi drives at least:

http://www.hgst.com/tech/techlib.nsf/products/Ultrastar_7K4000

look at section 4.2 of the "Ultrastar 7K4000 OEM Specification" document.

It specifies 310ktpi, or 310,000 tracks/inch.

Via google I found that this drive is using 5 800GB platters, meaning 
there are 10 heads in this drive.  Using hitachi's specifications:

(7,814,037,168 sectors / (310,000 tracks / inch * 3.5 inches)) / 10 
heads * 512 bytes / sector = ~360KB/track head

So assuming my math is right, it looks like we can read up to around 
360KB of data before hitting a head switch.  Now unfortunately (or maybe 
fortunately!) this is just the average case.  Outer tracks will store 
more data than inner tracks, so depending on what portion of the disk 
you are doing the read from, you might introduce head switches more or 
less often.  It looks like even with a 256k or 512k read_ahead you 
probably won't introduce a next-cylinder seek that often, though from 
what I can find it's not going to be all that much more expensive vs a 
head switch (2-3ms vs 1-2ms).

Mark

>>
>> Thanks
>>
>> James
>>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22 11:39                       ` Mark Nelson
  2013-04-22 11:48                         ` James Harper
@ 2013-04-22 15:20                         ` Sage Weil
  2013-04-22 15:35                           ` Sylvain Munaut
  1 sibling, 1 reply; 30+ messages in thread
From: Sage Weil @ 2013-04-22 15:20 UTC (permalink / raw)
  To: Mark Nelson; +Cc: James Harper, Sylvain Munaut, ceph-devel@vger.kernel.org

> You may want to try increasing your read_ahead_kb on the OSD data disks and
> see if that helps read speeds.

Jumping into this thread late, so I'm not sure if this was covered, but:

Remember that readahead on the OSDs will only help up to the size of the 
object (4MB).  To get good read performance in general what is really 
needed is for the librbd user to do readahead so that the next object(s)
are being fetched before they are needed.  I don't think this happens with 
'dd' (opening a block device as a file does not trigger the kernel VM 
readahead code, IIRC).  Unless Sylvian implemented this in his tool 
explicitly, it won't happen there either.

sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: poor write performance
  2013-04-22 15:20                         ` Sage Weil
@ 2013-04-22 15:35                           ` Sylvain Munaut
  0 siblings, 0 replies; 30+ messages in thread
From: Sylvain Munaut @ 2013-04-22 15:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, James Harper, ceph-devel@vger.kernel.org

Hi,

> Unless Sylvian implemented this in his tool
> explicitly, it won't happen there either.

The small bench tool submits requests using the asynchronous API as
fast as possible, using a 1M chunk.
Then it just waits for all the completions to be done.

    Sylvain

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-04-22 15:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-18 11:46 poor write performance James Harper
2013-04-18 12:15 ` Wolfgang Hennerbichler
2013-04-18 23:11   ` James Harper
2013-04-20 10:52     ` Harald Rößler
2013-04-20 11:12       ` James Harper
2013-04-20 21:04         ` Jeff Mitchell
2013-04-18 13:43 ` Mark Nelson
2013-04-18 16:46   ` Andrey Korolyov
2013-04-18 17:01     ` Mark Nelson
2013-04-18 23:23   ` James Harper
2013-04-19  7:21     ` James Harper
2013-04-19  7:30       ` James Harper
2013-04-19 11:09         ` James Harper
2013-04-19 14:50           ` Mark Nelson
2013-04-20  0:33             ` James Harper
2013-04-20  1:30               ` James Harper
2013-04-21 13:52                 ` Mark Nelson
2013-04-22  5:32                   ` James Harper
2013-04-22 11:34                     ` Mark Nelson
2013-04-22 11:40                       ` James Harper
2013-04-21 17:56               ` Sylvain Munaut
2013-04-21 23:04                 ` James Harper
2013-04-22  8:34                   ` Sylvain Munaut
2013-04-22 11:34                     ` James Harper
2013-04-22 11:39                       ` Mark Nelson
2013-04-22 11:48                         ` James Harper
2013-04-22 12:01                           ` Mark Nelson
2013-04-22 13:47                             ` Mark Nelson
2013-04-22 15:20                         ` Sage Weil
2013-04-22 15:35                           ` Sylvain Munaut

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.