From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Ceph RBD performance - random writes Date: Wed, 08 Aug 2012 16:58:57 -0500 Message-ID: <5022E121.4070004@inktank.com> References: <5021F6D1.7000004@catalyst.net.nz> <5022B3F4.5050109@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-yw0-f46.google.com ([209.85.213.46]:39065 "EHLO mail-yw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751439Ab2HHV7B (ORCPT ); Wed, 8 Aug 2012 17:59:01 -0400 Received: by yhmm54 with SMTP id m54so1348486yhm.19 for ; Wed, 08 Aug 2012 14:59:00 -0700 (PDT) In-Reply-To: <5022B3F4.5050109@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Josh Durgin Cc: Mark Kirkwood , ceph-devel@vger.kernel.org On 8/8/12 1:46 PM, Josh Durgin wrote: > On 08/07/2012 10:19 PM, Mark Kirkwood wrote: >> I've been looking at using Ceph RBD as a block store for database use. >> As part of this I'm looking a how (particularly random) IO of smallish >> (4K, 8K) block sizes performs. >> >> I've setup Ceph with a single osd and mon spread over two SSD (Intel >> 520) - 2G journal on one and the osd data on the other (xfs filesystem). >> The Intel's are pretty fast, and (despite being shackled by a crappy >> Nvidia SATA controller) fly for random IO. >> >> However I am not seeing that reflected in the RBD case. I have the >> device mounted on the local machine where the osd and mon are running >> (so network performance should not be a factor here). >> >> Here is what I did: >> >> Create a rbd device of 10G and mount on /mnt/vol0: >> >> $ rbd create --size 10240 vol0 >> $ rbd map vol0 >> $ mkfx.xfs /dev/rbd0 >> $ rbd mount /dev/rdb0 /mnt/vol0 >> >> Make a file: >> >> $ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4k count=300000 conv=fsync >> 1228800000 bytes (1.2 GB) copied, 13.4361 s, 91.5 MB/s >> >> Performance ok if file size < journal (2G). >> >> $ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4096k count=200 conv=fsync >> 838860800 bytes (839 MB) copied, 9.47086 s, 88.6 MB/s >> >> Not so good if file size > journal. >> >> $ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4096k count=1000 conv=fsync >> 4194304000 bytes (4.2 GB) copied, 279.891 s, 15.0 MB/s >> >> Random writes (see attached file) sync'ed with sync_file_range are ok if >> block size big: >> >> $ ./writetest /mnt/vol0/dump/file 4194304 0 1 >> random writes: 292 of: 4194304 bytes elapsed: 9.8397s io rate: 30/s >> (118.70 MB/s) >> >> $ ./writetest /mnt/vol0/dump/file 1048576 0 1 >> random writes: 1171 of: 1048576 bytes elapsed: 10.6042s io rate: 110/s >> (110.43 MB/s) >> >> $ ./writetest /mnt/vol0/dump/file 131072 0 1 >> random writes: 9375 of: 131072 bytes elapsed: 15.8075s io rate: 593/s >> (74.13 MB/s) >> >> >> However smallish block size is suicide (trigger suicide assert after a >> while), I see 100 IOPS or less on actual devices, all 100% util: >> >> $ ./writetest /mnt/vol0/dump/file 8192 0 1 >> >> I am running into http://tracker.newdream.net/issues/2784 here I think. > > This can be a sign of a bug in the underlying filesystem or hardware - > maybe your controller? That assert is hit when a single operation to > the filesystem beneath the osd takes longer than 180 seconds (by > default). > >> Note that the actual SSD are very fast for this when accessed directly: >> >> $ ./writetest /data1/ceph/1/file 8192 0 1 >> random writes: 1000000 of: 8192 bytes elapsed: 125.7907s io rate: 7950/s >> (62.11 MB/s) >> >> >> Thanks for your patience in reading so far - some actual questions now >> :-) >> >> 1/ Why is the appending write from dd when the size of file > journal so >> slow, despite reasonably capable storage devices? > > It's possible you need to use more threads to have more operations in > flight in to the filestore (the main storage for the osd). Try > something like this in your ceph configuration for the osds: > > osd op threads = 24 > osd disk threads = 24 > filestore op threads = 6 > filestore queue max ops = 24 > > (from http://www.spinics.net/lists/ceph-devel/msg07128.html) > It's probably worth giving a try, but I haven't had much luck improving small IO performance with more threads/queued ops. >> 2/ Is the sudden dramatic drop in random write performance a >> manifestation of the "small requests are slow" issue? or is this >> something else? > > It's probably that. Sam's actively looking into it, and once he has > something it will be interesting to see how well it works on your > hardware. For what it's worth, with mostly default settings I was seeing about 8MB/s to dell branded samsung SSDs with 4k IOs using rados bench. That was with 256 concurrent client requests. This is definitely something we are working hard on tracking down. > > Josh > >> Thanks >> >> Mark >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html