From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: RBD Read performance Date: Thu, 18 Apr 2013 19:40:24 -0500 Message-ID: <51709278.8050602@inktank.com> References: <516F77FF.4060401@sgi.com> <516F9AEB.7000706@inktank.com> <516F9F35.1030507@sgi.com> <51708F80.8090803@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-pd0-f180.google.com ([209.85.192.180]:39509 "EHLO mail-pd0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967617Ab3DSAk1 (ORCPT ); Thu, 18 Apr 2013 20:40:27 -0400 Received: by mail-pd0-f180.google.com with SMTP id q11so1872043pdj.25 for ; Thu, 18 Apr 2013 17:40:27 -0700 (PDT) In-Reply-To: <51708F80.8090803@sgi.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Malcolm Haak Cc: ceph-devel@vger.kernel.org On 04/18/2013 07:27 PM, Malcolm Haak wrote: > Morning all, > > Did the echos on all boxes involved... and the results are in.. > > [root@dogbreath ~]# > [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M > count=10000 iflag=direct > 10000+0 records in > 10000+0 records out > 41943040000 bytes (42 GB) copied, 144.083 s, 291 MB/s > [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M > count=10000 > 10000+0 records in > 10000+0 records out > 41943040000 bytes (42 GB) copied, 316.025 s, 133 MB/s > [root@dogbreath ~]# Boo! > > No change which is a shame. What other information or testing should I > start? Any chance you can try out a quick rados bench test from the client against the pool for writes and reads and see how that works? rados -p bench 300 write --no-cleanup rados -p bench 300 seq > > Regards > > Malcolm Haak > > On 18/04/13 17:22, Malcolm Haak wrote: >> Hi Mark! >> >> Thanks for the quick reply! >> >> I'll reply inline below. >> >> On 18/04/13 17:04, Mark Nelson wrote: >>> On 04/17/2013 11:35 PM, Malcolm Haak wrote: >>>> Hi all, >>> >>> Hi Malcolm! >>> >>>> >>>> I jumped into the IRC channel yesterday and they said to email >>>> ceph-devel. I have been having some read performance issues. With Reads >>>> being slower than writes by a factor of ~5-8. >>> >>> I recently saw this kind of behaviour (writes were fine, but reads were >>> terrible) on an IPoIB based cluster and it was caused by the same TCP >>> auto tune issues that Jim Schutt saw last year. It's worth a try at >>> least to see if it helps. >>> >>> echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf >>> >>> on all of the clients and server nodes should be enough to test it out. >>> Sage added an option in more recent Ceph builds that lets you work >>> around it too. >>> >> Awesome I will test this first up tomorrow. >>>> >>>> First info: >>>> Server >>>> SLES 11 SP2 >>>> Ceph 0.56.4. >>>> 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 >>>> NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s >>>> stream write and the same if not better read) Connected via 2xQDR IB >>>> OSD's/MDS and such all on same box (for testing) >>>> Box is a Quad AMD Opteron 6234 >>>> Ram is 256Gb >>>> 10GB Journals >>>> osd_op_theads: 8 >>>> osd_disk_threads:2 >>>> Filestore_op_threads:4 >>>> OSD's are all XFS >>> >>> Interesting setup! QUAD socket Opteron boxes have somewhat slow and >>> slightly oversubscribed hypertransport links don't they? I wonder if on >>> a system with so many disks and QDR-IB if that could become a problem... >>> >>> We typically like smaller nodes where we can reasonably do 1 OSD per >>> drive, but we've tested on a couple of 60 drive chassis in RAID configs >>> too. Should be interesting to hear what kind of aggregate performance >>> you can eventually get. >> >> We are also going to try this out with 6 luns on a dual xeon box. The >> Opteron box was the biggest scariest thing we had that was doing nothing. >> >>> >>>> >>>> All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP >>>> performance tests between the nodes. >>>> >>>> Clients: One is FC17 the other us Ubuntu 12.10 they only have around >>>> 32GB-70GB ram. >>>> >>>> We ran into an odd issue were the OSD's would all start in the same >>>> NUMA >>>> node and pretty much on the same processor core. We fixed that up with >>>> some cpuset magic. >>> >>> Strange! Was that more due to cpuset or Ceph? I can't imagine that we >>> are doing anything that would cause that. >>> >> >> More than likely it is an odd quirk in the SLES kernel.. but when I have >> time I'll do some more poking. We were seeing insane CPU usage on some >> cores because all the OSD's were piled up in one place. >> >>>> >>>> Performance testing we have done: (Note oflag=direct was yielding >>>> results within 5% of cached results) >>>> >>>> >>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 >>>> 3200+0 records in >>>> 3200+0 records out >>>> 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s >>>> root@ty3:~# >>>> root@ty3:~# rm /test-rbd-fs/DELETEME >>>> root@ty3:~# >>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 >>>> 4800+0 records in >>>> 4800+0 records out >>>> 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s >>>> >>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>> count=2400 >>>> 2400+0 records in >>>> 2400+0 records out >>>> 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s >>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>> count=9600 >>>> 9600+0 records in >>>> 9600+0 records out >>>> 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s >>>> >>>> Both clients each doing a 140GB write (2x dogbreath's RAM) at the same >>>> time to two different rbds in the same pool. >>>> >>>> root@ty3:~# rm /test-rbd-fs/DELETEME >>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 >>>> 14000+0 records in >>>> 14000+0 records out >>>> 146800640000 bytes (147 GB) copied, 412.404 s, 356 MB/s >>>> root@ty3:~# >>>> >>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>> count=14000 >>>> 14000+0 records in >>>> 14000+0 records out >>>> 146800640000 bytes (147 GB) copied, 433.351 s, 339 MB/s >>>> [root@dogbreath ~]# >>>> >>>> Onto reads... >>>> Also we found that doing iflag=direct increased read performance. >>>> >>>> [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M >>>> count=160 >>>> 160+0 records in >>>> 160+0 records out >>>> 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s >>>> [root@dogbreath ~]# >>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>> count=10000 >>>> 10000+0 records in >>>> 10000+0 records out >>>> 41943040000 bytes (42 GB) copied, 382.334 s, 110 MB/s >>>> [root@dogbreath ~]# >>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>> count=10000 iflag=direct >>>> 10000+0 records in >>>> 10000+0 records out >>>> 41943040000 bytes (42 GB) copied, 150.774 s, 278 MB/s >>>> [root@dogbreath ~]# >>>> >>>> >>>> So what info do you want/where do I start hunting for my wumpus? >>> >>> might also be worth looking at the size of the reads to see if there's a >>> lot of fragmentation. Also, is this kernel rbd or qemu-kvm? >>> >> >> Thing that got us was the back-end storage was showing very low read >> rates. Where as when writing we could see almost a 2xWrite rate back to >> physical disk (we assume that is Journal+data as the 2x is not from the >> word go but ramps up around the 3-5 second mark) >> >> It is kernel rbd at the moment, we will be testing qemu-kvm after things >> make sense. >> >>>> >>>> Regards >>>> >>>> Malcolm Haak >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> ceph-devel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html