From mboxrd@z Thu Jan  1 00:00:00 1970
From: Malcolm Haak <malcolm@sgi.com>
Subject: Re: RBD Read performance
Date: Fri, 19 Apr 2013 10:27:44 +1000
Message-ID: <51708F80.8090803@sgi.com>
References: <516F77FF.4060401@sgi.com> <516F9AEB.7000706@inktank.com> <516F9F35.1030507@sgi.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from relay1.sgi.com ([192.48.179.29]:37004 "EHLO relay.sgi.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S967671Ab3DSA1t (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
	Thu, 18 Apr 2013 20:27:49 -0400
In-Reply-To: <516F9F35.1030507@sgi.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Mark Nelson <mark.nelson@inktank.com>
Cc: ceph-devel@vger.kernel.org

Morning all,

Did the echos on all boxes involved... and the results are in..

[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=10000 iflag=direct
10000+0 records in
10000+0 records out
41943040000 bytes (42 GB) copied, 144.083 s, 291 MB/s
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=10000
10000+0 records in
10000+0 records out
41943040000 bytes (42 GB) copied, 316.025 s, 133 MB/s
[root@dogbreath ~]#

No change which is a shame. What other information or testing should I 
start?

Regards

Malcolm Haak

On 18/04/13 17:22, Malcolm Haak wrote:
> Hi Mark!
>
> Thanks for the quick reply!
>
> I'll reply inline below.
>
> On 18/04/13 17:04, Mark Nelson wrote:
>> On 04/17/2013 11:35 PM, Malcolm Haak wrote:
>>> Hi all,
>>
>> Hi Malcolm!
>>
>>>
>>> I jumped into the IRC channel yesterday and they said to email
>>> ceph-devel. I have been having some read performance issues. With Reads
>>> being slower than writes by a factor of ~5-8.
>>
>> I recently saw this kind of behaviour (writes were fine, but reads were
>> terrible) on an IPoIB based cluster and it was caused by the same TCP
>> auto tune issues that Jim Schutt saw last year. It's worth a try at
>> least to see if it helps.
>>
>> echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf
>>
>> on all of the clients and server nodes should be enough to test it out.
>>   Sage added an option in more recent Ceph builds that lets you work
>> around it too.
>>
> Awesome I will test this first up tomorrow.
>>>
>>> First info:
>>> Server
>>> SLES 11 SP2
>>> Ceph 0.56.4.
>>> 12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
>>> NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
>>> stream write and the same if not better read) Connected via 2xQDR IB
>>> OSD's/MDS and such all on same box (for testing)
>>> Box is a Quad AMD Opteron 6234
>>> Ram is 256Gb
>>> 10GB Journals
>>> osd_op_theads: 8
>>> osd_disk_threads:2
>>> Filestore_op_threads:4
>>> OSD's are all XFS
>>
>> Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
>> slightly oversubscribed hypertransport links don't they?  I wonder if on
>> a system with so many disks and QDR-IB if that could become a problem...
>>
>> We typically like smaller nodes where we can reasonably do 1 OSD per
>> drive, but we've tested on a couple of 60 drive chassis in RAID configs
>> too.  Should be interesting to hear what kind of aggregate performance
>> you can eventually get.
>
> We are also going to try this out with 6 luns on a dual xeon box. The
> Opteron box was the biggest scariest thing we had that was doing nothing.
>
>>
>>>
>>> All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
>>> performance tests between the nodes.
>>>
>>> Clients: One is FC17 the other us Ubuntu 12.10 they only have around
>>> 32GB-70GB ram.
>>>
>>> We ran into an odd issue were the OSD's would all start in the same NUMA
>>> node and pretty much on the same processor core. We fixed that up with
>>> some cpuset magic.
>>
>> Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
>> are doing anything that would cause that.
>>
>
> More than likely it is an odd quirk in the SLES kernel.. but when I have
> time I'll do some more poking. We were seeing insane CPU usage on some
> cores because all the OSD's were piled up in one place.
>
>>>
>>> Performance testing we have done: (Note oflag=direct was yielding
>>> results within 5% of cached results)
>>>
>>>
>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
>>> 3200+0 records in
>>> 3200+0 records out
>>> 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
>>> root@ty3:~#
>>> root@ty3:~# rm /test-rbd-fs/DELETEME
>>> root@ty3:~#
>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
>>> 4800+0 records in
>>> 4800+0 records out
>>> 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s
>>>
>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
>>> count=2400
>>> 2400+0 records in
>>> 2400+0 records out
>>> 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
>>> count=9600
>>> 9600+0 records in
>>> 9600+0 records out
>>> 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s
>>>
>>> Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
>>> time to two different rbds in the same pool.
>>>
>>> root@ty3:~# rm /test-rbd-fs/DELETEME
>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
>>> 14000+0 records in
>>> 14000+0 records out
>>> 146800640000 bytes (147 GB) copied, 412.404 s, 356 MB/s
>>> root@ty3:~#
>>>
>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
>>> count=14000
>>> 14000+0 records in
>>> 14000+0 records out
>>> 146800640000 bytes (147 GB) copied, 433.351 s, 339 MB/s
>>> [root@dogbreath ~]#
>>>
>>> Onto reads...
>>> Also we found that doing iflag=direct increased read performance.
>>>
>>> [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
>>> count=160
>>> 160+0 records in
>>> 160+0 records out
>>> 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
>>> [root@dogbreath ~]#
>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
>>> count=10000
>>> 10000+0 records in
>>> 10000+0 records out
>>> 41943040000 bytes (42 GB) copied, 382.334 s, 110 MB/s
>>> [root@dogbreath ~]#
>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
>>> count=10000 iflag=direct
>>> 10000+0 records in
>>> 10000+0 records out
>>> 41943040000 bytes (42 GB) copied, 150.774 s, 278 MB/s
>>> [root@dogbreath ~]#
>>>
>>>
>>> So what info do you want/where do I start hunting for my wumpus?
>>
>> might also be worth looking at the size of the reads to see if there's a
>> lot of fragmentation.  Also, is this kernel rbd or qemu-kvm?
>>
>
> Thing that got us was the back-end storage was showing very low read
> rates. Where as when writing we could see almost a 2xWrite rate back to
> physical disk (we assume that is Journal+data as the 2x is not from the
> word go but ramps up around the 3-5 second mark)
>
> It is kernel rbd at the moment, we will be testing qemu-kvm after things
> make sense.
>
>>>
>>> Regards
>>>
>>> Malcolm Haak
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html