Re: bandwidth with Ceph - v0.59 (Bobtail)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: Xing <xinglin@cs.utah.edu>, Gregory Farnum <greg@inktank.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	ceph-users <ceph-users@ceph.com>
Subject: Re: bandwidth with Ceph - v0.59 (Bobtail)
Date: Fri, 25 Apr 2014 16:36:35 -0500	[thread overview]
Message-ID: <535AD563.9010504@inktank.com> (raw)
In-Reply-To: <89EC8837-F1DF-4B41-86CD-334BEA4226A2@cs.utah.edu>

For what it's worth, I've been able to achieve up to around 120MB/s with 
btrfs before things fragment.

Mark

On 04/25/2014 03:59 PM, Xing wrote:
> Hi Gregory,
>
> Thanks very much for your quick reply. When I started to look into Ceph, Bobtail was the latest stable release and that was why I picked that version and started to make a few modifications. I have not ported my changes to 0.79 yet. The plan is if v-0.79 can provide a higher disk bandwidth efficiency, I will switch to 0.79. Unfortunately, that does not seem to be the case.
>
> The futex trace was done with version 0.79, not 0.59. I did a profile in 0.59 too. There are some improvements, such as the introduction of fd cache. But lots of futex calls are still there in v-0.79. I also measured the maximum bandwidth from each disk we can get in Version 0.79. It does not improve significantly: we can still only get 90~100 MB/s from each disk.
>
> Thanks,
> Xing
>
>
> On Apr 25, 2014, at 2:42 PM, Gregory Farnum <greg@inktank.com> wrote:
>
>> Bobtail is really too old to draw any meaningful conclusions from; why
>> did you choose it?
>>
>> That's not to say that performance on current code will be better
>> (though it very much might be), but the internal architecture has
>> changed in some ways that will be particularly important for the futex
>> profiling you did, and are probably important for these throughput
>> results as well.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Fri, Apr 25, 2014 at 1:38 PM, Xing <xinglin@cs.utah.edu> wrote:
>>> Hi,
>>>
>>> I also did a few other experiments, trying to get what the maximum bandwidth we can get from each data disk. The output is not encouraging: for disks that can provide 150 MB/s block-level sequential read bandwidths, we can only get about 90MB/s from each disk. Something that is particular interesting is that the replica size also affects the bandwidth we could get from the cluster. It seems that there is no such observation/conversations in the Ceph community and I think it may be helpful to share my findings.
>>>
>>> The experiment was run with two d820 machines in Emulab at University of Utah. One is used as the data node and the other is used as the client. They are connected by a 10 GB/s Ethernet. The data node has 7 disks, one for OS and the rest 6 for OSDs. For the rest 6 disks, we use one for journal and the other for data. Thus in total we have 3 OSDs. The network bandwidth is sufficient to support reading from 3 disks in full bandwidth.
>>>
>>> I varied the read-ahead size for the rbd block device (exp1), osd op threads for each osd (exp2), varied the replica size (exp3), and object size (exp4). The most interesting is varying the replica size. As I varied replica size from 1, to 2 and to 3, the aggregated bandwidth dropped from 267 MB/s to 211 and 180. The reason for the drop I believe is as we increase the number of replicas, we store more data into each OSD. then when we need to read it back, we have to read from a larger range (more seeks). The fundamental problem is likely because we are doing replication synchronously, and thus layout object files in a raid 10 - near format, rather than the far format. For the difference between the near format and far format for raid 10, you could have a look at the link provided below
 .
>>>
>>> http://lxr.free-electrons.com/source/Documentation/device-mapper/dm-raid.txt
>>>
>>> For results about other experiments, you could download my slides at the link provided below.
>>> http://www.cs.utah.edu/~xinglin/slides/ceph-bandiwdth-exp.pptx
>>>
>>>
>>> I do not know why Ceph only gets about 60% of the disk bandwidth. To do a comparison, I ran tar to read every rbd object files to create a tarball and see how much bandwidth I can get from this workload. Interestingly, the tar workload actually gets a higher bandwidth (80% of block level bandwidth), even though it is accessing the disk more randomly (tar reads each object file in a dir sequentially while the object files were created in a different order.). For more detail, please goto my blog to have a read.
>>> http://xinglin-system.blogspot.com/2014/04/ceph-lab-note-1-disk-read-bandwidth-in.html
>>>
>>> Here are a few questions.
>>> 1. What are the maximum bandwidth people can get from each disk? I found Jiangang from Intel also reported 57% efficiency for disk bandwidth. He suggested one reason: interference among so many sequential read workloads. I agree but when I tried to run with one single workload, I still do not get a higher efficiency.
>>> 2. If the efficiency is about 60%, what are the reasons that cause this? Could it be because of the locks (futex as I mentioned in my previous email) or anything else?
>>>
>>> Thanks very much for any feedback.
>>>
>>> Thanks,
>>> Xing
>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2014-04-25 21:36 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-25 20:38 bandwidth with Ceph - v0.59 (Bobtail) Xing
2014-04-25 20:42 ` Gregory Farnum
2014-04-25 20:59   ` Xing
2014-04-25 21:36     ` Mark Nelson [this message]
2014-04-25 21:50       ` Xing Lin
2014-04-25 22:16         ` Mark Nelson
2014-04-25 23:47           ` Xing Lin
2014-04-26  0:06             ` Xing
2014-05-07 14:28               ` Milosz Tanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=535AD563.9010504@inktank.com \
    --to=mark.nelson@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ceph-users@ceph.com \
    --cc=greg@inktank.com \
    --cc=xinglin@cs.utah.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.