From mboxrd@z Thu Jan 1 00:00:00 1970 From: Malcolm Haak Subject: Re: RBD Read performance Date: Thu, 18 Apr 2013 17:22:29 +1000 Message-ID: <516F9F35.1030507@sgi.com> References: <516F77FF.4060401@sgi.com> <516F9AEB.7000706@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from relay1.sgi.com ([192.48.179.29]:42494 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965073Ab3DRHWe (ORCPT ); Thu, 18 Apr 2013 03:22:34 -0400 In-Reply-To: <516F9AEB.7000706@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Mark Nelson Cc: ceph-devel@vger.kernel.org Hi Mark! Thanks for the quick reply! I'll reply inline below. On 18/04/13 17:04, Mark Nelson wrote: > On 04/17/2013 11:35 PM, Malcolm Haak wrote: >> Hi all, > > Hi Malcolm! > >> >> I jumped into the IRC channel yesterday and they said to email >> ceph-devel. I have been having some read performance issues. With Reads >> being slower than writes by a factor of ~5-8. > > I recently saw this kind of behaviour (writes were fine, but reads were > terrible) on an IPoIB based cluster and it was caused by the same TCP > auto tune issues that Jim Schutt saw last year. It's worth a try at > least to see if it helps. > > echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf > > on all of the clients and server nodes should be enough to test it out. > Sage added an option in more recent Ceph builds that lets you work > around it too. > Awesome I will test this first up tomorrow. >> >> First info: >> Server >> SLES 11 SP2 >> Ceph 0.56.4. >> 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 >> NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s >> stream write and the same if not better read) Connected via 2xQDR IB >> OSD's/MDS and such all on same box (for testing) >> Box is a Quad AMD Opteron 6234 >> Ram is 256Gb >> 10GB Journals >> osd_op_theads: 8 >> osd_disk_threads:2 >> Filestore_op_threads:4 >> OSD's are all XFS > > Interesting setup! QUAD socket Opteron boxes have somewhat slow and > slightly oversubscribed hypertransport links don't they? I wonder if on > a system with so many disks and QDR-IB if that could become a problem... > > We typically like smaller nodes where we can reasonably do 1 OSD per > drive, but we've tested on a couple of 60 drive chassis in RAID configs > too. Should be interesting to hear what kind of aggregate performance > you can eventually get. We are also going to try this out with 6 luns on a dual xeon box. The Opteron box was the biggest scariest thing we had that was doing nothing. > >> >> All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP >> performance tests between the nodes. >> >> Clients: One is FC17 the other us Ubuntu 12.10 they only have around >> 32GB-70GB ram. >> >> We ran into an odd issue were the OSD's would all start in the same NUMA >> node and pretty much on the same processor core. We fixed that up with >> some cpuset magic. > > Strange! Was that more due to cpuset or Ceph? I can't imagine that we > are doing anything that would cause that. > More than likely it is an odd quirk in the SLES kernel.. but when I have time I'll do some more poking. We were seeing insane CPU usage on some cores because all the OSD's were piled up in one place. >> >> Performance testing we have done: (Note oflag=direct was yielding >> results within 5% of cached results) >> >> >> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 >> 3200+0 records in >> 3200+0 records out >> 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s >> root@ty3:~# >> root@ty3:~# rm /test-rbd-fs/DELETEME >> root@ty3:~# >> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 >> 4800+0 records in >> 4800+0 records out >> 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s >> >> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >> count=2400 >> 2400+0 records in >> 2400+0 records out >> 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s >> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >> count=9600 >> 9600+0 records in >> 9600+0 records out >> 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s >> >> Both clients each doing a 140GB write (2x dogbreath's RAM) at the same >> time to two different rbds in the same pool. >> >> root@ty3:~# rm /test-rbd-fs/DELETEME >> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 >> 14000+0 records in >> 14000+0 records out >> 146800640000 bytes (147 GB) copied, 412.404 s, 356 MB/s >> root@ty3:~# >> >> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >> count=14000 >> 14000+0 records in >> 14000+0 records out >> 146800640000 bytes (147 GB) copied, 433.351 s, 339 MB/s >> [root@dogbreath ~]# >> >> Onto reads... >> Also we found that doing iflag=direct increased read performance. >> >> [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M >> count=160 >> 160+0 records in >> 160+0 records out >> 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s >> [root@dogbreath ~]# >> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >> count=10000 >> 10000+0 records in >> 10000+0 records out >> 41943040000 bytes (42 GB) copied, 382.334 s, 110 MB/s >> [root@dogbreath ~]# >> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >> count=10000 iflag=direct >> 10000+0 records in >> 10000+0 records out >> 41943040000 bytes (42 GB) copied, 150.774 s, 278 MB/s >> [root@dogbreath ~]# >> >> >> So what info do you want/where do I start hunting for my wumpus? > > might also be worth looking at the size of the reads to see if there's a > lot of fragmentation. Also, is this kernel rbd or qemu-kvm? > Thing that got us was the back-end storage was showing very low read rates. Where as when writing we could see almost a 2xWrite rate back to physical disk (we assume that is Journal+data as the 2x is not from the word go but ramps up around the 3-5 second mark) It is kernel rbd at the moment, we will be testing qemu-kvm after things make sense. >> >> Regards >> >> Malcolm Haak >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >