From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xing Lin <xinglin@cs.utah.edu>
Subject: Re: bandwidth with Ceph - v0.59 (Bobtail)
Date: Fri, 25 Apr 2014 17:47:12 -0600
Message-ID: <535AF400.1000007@cs.utah.edu>
References: <2C899773-4518-4704-86B3-F27B035F7E13@cs.utah.edu> <CAPYLRzgT0MRrdmPPaRii6h5nXoueJsKtfoMsKK3Og8Fc1-2bjw@mail.gmail.com> <89EC8837-F1DF-4B41-86CD-334BEA4226A2@cs.utah.edu> <535AD563.9010504@inktank.com> <535AD894.3020709@cs.utah.edu> <535ADEC6.4090205@inktank.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from rio.cs.utah.edu ([155.98.64.241]:37875 "EHLO
	mail-svr1.cs.utah.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751297AbaDYXrb (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Fri, 25 Apr 2014 19:47:31 -0400
In-Reply-To: <535ADEC6.4090205@inktank.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Mark Nelson <mark.nelson@inktank.com>, Gregory Farnum <greg@inktank.com>
Cc: xinglin@cs.utah.edu, "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>, ceph-users <ceph-users@ceph.com>

Hi Mark,

Thanks for sharing this. I did read these blogs early. If we look at the 
aggregated bandwidth, 600-700 MB/s for reads for 6 disks are quite good. 
But consider it is shared among 256 concurrent read streams, each one 
gets as little as 2-3 MB/s bandwidth. This does not sound that right.

I think the read bandwidth of a disk will be close to its write 
bandwidth. But just double-check: what the sequential read bandwidth 
your disks can provide?

I also read your follow-up blogs, comparing bobtail and cuttlefish. One 
thing I do not get from your experiments is that it hit the network 
bottleneck much earlier before being bottlenecked by disks. Could you 
setup a smaller cluster (e.g with 8 disks, rather than 24) such as a 10 
Gb/s link will not become the bottleneck and then test how much disk 
bandwidth can be achieved, preferably with new releases of Ceph. The 
other concern is I am not sure how close RADOS bench results are when 
compared with kernel RBD performance. I would appreciate it if you can 
do that. Thanks,

Xing

On 04/25/2014 04:16 PM, Mark Nelson wrote:
> I don't have any recent results published, but you can see some of the 
> older results from bobtail here:
>
> http://ceph.com/performance-2/argonaut-vs-bobtail-performance-preview/
>
> Specifically, look at the 256 concurrent 4MB rados bench tests. In a 6 
> disk, 2 SSD configuration we could push about 800MB/s for writes (no 
> replication) and around 600-700MB/s for reads with BTRFS.  On this 
> controller using a RAID0 configuration with WB cache helps quite a 
> bit, but in other tests I've seen similar results with a 9207-8i that 
> doesn't have WB cache when BTRFS filestores and SSD journals are used.
>
> Regarding the drives, they can do somewhere around 140-150MB/s large 
> block writes with fio.
>
> Replication definitely adds additional latency so aggregate write 
> throughput goes down, though it seems the penalty is worst after the 
> first replica and doesn't hurt as much with subsequent ones.
>
> Mark
>
>
> On 04/25/2014 04:50 PM, Xing Lin wrote:
>> Hi Mark,
>>
>> That seems pretty good. What is the block level sequential read
>> bandwidth of your disks? What configuration did you use? What was the
>> replica size, read_ahead for your rbds and what were the number of
>> workloads you used? I used btrfs in my experiments as well.
>>
>> Thanks,
>> Xing
>>
>> On 04/25/2014 03:36 PM, Mark Nelson wrote:
>>> For what it's worth, I've been able to achieve up to around 120MB/s
>>> with btrfs before things fragment.
>>>
>>> Mark
>>>
>>> On 04/25/2014 03:59 PM, Xing wrote:
>>>> Hi Gregory,
>>>>
>>>> Thanks very much for your quick reply. When I started to look into
>>>> Ceph, Bobtail was the latest stable release and that was why I picked
>>>> that version and started to make a few modifications. I have not
>>>> ported my changes to 0.79 yet. The plan is if v-0.79 can provide a
>>>> higher disk bandwidth efficiency, I will switch to 0.79.
>>>> Unfortunately, that does not seem to be the case.
>>>>
>>>> The futex trace was done with version 0.79, not 0.59. I did a profile
>>>> in 0.59 too. There are some improvements, such as the introduction of
>>>> fd cache. But lots of futex calls are still there in v-0.79. I also
>>>> measured the maximum bandwidth from each disk we can get in Version
>>>> 0.79. It does not improve significantly: we can still only get 90~100
>>>> MB/s from each disk.
>>>>
>>>> Thanks,
>>>> Xing
>>>>
>>>>
>>>> On Apr 25, 2014, at 2:42 PM, Gregory Farnum <greg@inktank.com> wrote:
>>>>
>>>>> Bobtail is really too old to draw any meaningful conclusions from; 
>>>>> why
>>>>> did you choose it?
>>>>>
>>>>> That's not to say that performance on current code will be better
>>>>> (though it very much might be), but the internal architecture has
>>>>> changed in some ways that will be particularly important for the 
>>>>> futex
>>>>> profiling you did, and are probably important for these throughput
>>>>> results as well.
>>>>> -Greg
>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>
>>>>>
>>>>> On Fri, Apr 25, 2014 at 1:38 PM, Xing <xinglin@cs.utah.edu> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I also did a few other experiments, trying to get what the maximum
>>>>>> bandwidth we can get from each data disk. The output is not
>>>>>> encouraging: for disks that can provide 150 MB/s block-level
>>>>>> sequential read bandwidths, we can only get about 90MB/s from each
>>>>>> disk. Something that is particular interesting is that the replica
>>>>>> size also affects the bandwidth we could get from the cluster. It
>>>>>> seems that there is no such observation/conversations in the Ceph
>>>>>> community and I think it may be helpful to share my findings.
>>>>>>
>>>>>> The experiment was run with two d820 machines in Emulab at
>>>>>> University of Utah. One is used as the data node and the other is
>>>>>> used as the client. They are connected by a 10 GB/s Ethernet. The
>>>>>> data node has 7 disks, one for OS and the rest 6 for OSDs. For the
>>>>>> rest 6 disks, we use one for journal and the other for data. Thus
>>>>>> in total we have 3 OSDs. The network bandwidth is sufficient to
>>>>>> support reading from 3 disks in full bandwidth.
>>>>>>
>>>>>> I varied the read-ahead size for the rbd block device (exp1), osd
>>>>>> op threads for each osd (exp2), varied the replica size (exp3), and
>>>>>> object size (exp4). The most interesting is varying the replica
>>>>>> size. As I varied replica size from 1, to 2 and to 3, the
>>>>>> aggregated bandwidth dropped from 267 MB/s to 211 and 180. The
>>>>>> reason for the drop I believe is as we increase the number of
>>>>>> replicas, we store more data into each OSD. then when we need to
>>>>>> read it back, we have to read from a larger range (more seeks). The
>>>>>> fundamental problem is likely because we are doing replication
>>>>>> synchronously, and thus layout object files in a raid 10 - near
>>>>>> format, rather than the far format. For the difference between the
>>>>>> near format and far format for raid 10, you could have a look at
>>>>>> the link provided below.
>>>>>>
>>>>>> http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt 
>>>>>>
>>>>>>
>>>>>>
>>>>>> For results about other experiments, you could download my slides
>>>>>> at the link provided below.
>>>>>> http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx
>>>>>>
>>>>>>
>>>>>> I do not know why Ceph only gets about 60% of the disk bandwidth.
>>>>>> To do a comparison, I ran tar to read every rbd object files to
>>>>>> create a tarball and see how much bandwidth I can get from this
>>>>>> workload. Interestingly, the tar workload actually gets a higher
>>>>>> bandwidth (80% of block level bandwidth), even though it is
>>>>>> accessing the disk more randomly (tar reads each object file in a
>>>>>> dir sequentially while the object files were created in a different
>>>>>> order.). For more detail, please goto my blog to have a read.
>>>>>> http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html 
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here are a few questions.
>>>>>> 1. What are the maximum bandwidth people can get from each disk? I
>>>>>> found Jiangang from Intel also reported 57% efficiency for disk
>>>>>> bandwidth. He suggested one reason: interference among so many
>>>>>> sequential read workloads. I agree but when I tried to run with one
>>>>>> single workload, I still do not get a higher efficiency.
>>>>>> 2. If the efficiency is about 60%, what are the reasons that cause
>>>>>> this? Could it be because of the locks (futex as I mentioned in my
>>>>>> previous email) or anything else?
>>>>>>
>>>>>> Thanks very much for any feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> Xing
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>