From: Mark Kampe <mark.kampe@inktank.com>
To: "Sébastien Han" <han.sebastien@gmail.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: RBD fio Performance concerns
Date: Fri, 16 Nov 2012 14:59:26 -0800 [thread overview]
Message-ID: <50A6C54E.9010805@inktank.com> (raw)
In-Reply-To: <CAOLwVUmQa4C_vs_Mbi3b2LeO=wx8_EMVWX5Pyu0y-JnG8nyz+Q@mail.gmail.com>
On 11/15/2012 12:23 PM, Sébastien Han wrote:
> First of all, I would like to thank you for this well explained,
> structured and clear answer. I guess I got better IOPS thanks to the 10K disks.
10K RPM would bring your per-drive throughput (for 4K random writes)
up to 142 IOPS and your aggregate cluster throughput up to 1700.
This would predict a corresponding RADOSbench throughput somewhere
above 425 (how much better depending on write aggregation and cylinder
affinity). Your RADOSbench 708 now seems even more reasonable.
> To be really honest I wasn't so concerned about the RADOS benchmarks
> but more about the RBD fio benchmarks and the amont of IOPS that comes
> out of it, which I found à bit to low.
Sticking with 4K random writes, it looks to me like you were running
fio with libaio (which means direct, no buffer cache). Because it
is direct, every I/O operation is really happening and the best
sustained throughput you should expect from this cluster is
the aggregate raw fio 4K write throughput (1700 IOPS) divided
by two copies = 850 random 4K writes per second. If I read the
output correctly you got 763 or about 90% of back-of-envelope.
BUT, there are some footnotes (there always are with performance)
If you had been doing buffered I/O you would have seen a lot more
(up front) benefit from page caching ... but you wouldn't have been
measuring real (and hence sustainable) I/O throughput ... which is
ultimately limited by the heads on those twelve disk drives, where
all of those writes ultimately wind up. It is easy to be fast
if you aren't really doing the writes :-)
I would have expected write aggregation and cylinder affinity to
have eliminated some seeks and improved rotational latency resulting
in better than theoretical random write throughput. Against those
expectations 763/850 IOPS is not so impressive. But, it looks to
me like you were running fio in a 1G file with 100 parallel requests.
The default RBD stripe width is 4M. This means that those 100
parallel requests were being spread across 256 (1G/4M) objects.
People in the know tell me that writes to a single object are
serialized, which means that many of those (potentially) parallel
writes were to the same object, and hence serialized. This would
increase the average request time for the colliding operations,
and reduce the aggregate throughput correspondingly. Use a
bigger file (or a narrower stripe) and this will get better.
Thus, getting 763 random 4K write IOPs out of those 12 drives
still sounds about right to me.
> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote:
>
>> Dear Sebastien,
>>
>> Ross Turn forwarded me your e-mail. You sent a great deal
>> of information, but it was not immediately obvious to me
>> what your specific concern was.
>>
>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>> radosbench (4K object creation) throughput of 2.9MB/s
>> (or 708 IOPS). I infer that you were disappointed by
>> this number, but it looks right to me.
>>
>> Assuming typical 7200 RPM drives, I would guess that each
>> of them would deliver a sustained direct 4K random write
>> performance in the general neighborhood of:
>> 4ms seek (short seeks with write-settle-downs)
>> 4ms latency (1/2 rotation)
>> 0ms write (4K/144MB/s ~ 30us)
>> -----
>> 8ms or about 125 IOPS
>>
>> Your twelve drives should therefore have a sustainable
>> aggregate direct 4K random write throughput of 1500 IOPS.
>>
>> Each 4K object create involves four writes (two copies,
>> each getting one data write and one data update). Thus
>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>
>> You are getting almost twice the expected raw IOPS ...
>> and we should expect that a large number of parallel
>> operations would realize some write/seek aggregation
>> benefits ... so these numbers look right to me.
>>
>> Is this the number you were concerned about, or have I
>> misunderstood?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next parent reply other threads:[~2012-11-16 22:59 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <50A537EA.5090409@inktank.com>
[not found] ` <CAOLwVUmQa4C_vs_Mbi3b2LeO=wx8_EMVWX5Pyu0y-JnG8nyz+Q@mail.gmail.com>
2012-11-16 22:59 ` Mark Kampe [this message]
2012-11-19 14:56 ` RBD fio Performance concerns Sébastien Han
2012-11-19 15:28 ` Alexandre DERUMIER
2012-11-19 15:42 ` Sébastien Han
2012-11-19 16:44 ` Sage Weil
2012-11-19 16:54 ` Mark Kampe
2012-11-19 18:03 ` Sébastien Han
2012-11-19 19:11 ` Alexandre DERUMIER
2012-11-19 20:57 ` Sébastien Han
2012-11-20 7:32 ` Alexandre DERUMIER
2012-11-20 10:37 ` Sébastien Han
2012-11-21 15:52 ` Mark Nelson
2012-11-21 16:34 ` Mark Nelson
2012-11-21 21:47 ` Sébastien Han
2012-11-21 22:05 ` Mark Kampe
2012-11-22 5:46 ` Alexandre DERUMIER
2012-11-23 13:36 ` Chen, Xiaoxi
2012-11-24 16:59 ` Gregory Farnum
2012-11-22 10:19 ` Stefan Priebe - Profihost AG
[not found] ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
2012-11-22 11:48 ` Stefan Priebe - Profihost AG
2012-11-22 12:50 ` Sébastien Han
2012-11-22 13:14 ` Stefan Priebe - Profihost AG
[not found] ` <CAOLwVUkwVSv-Ven2CTjnTN2J573TBTD2SLDY7df0h7ncJZQgpQ@mail.gmail.com>
2012-11-22 13:29 ` Stefan Priebe - Profihost AG
2012-11-22 14:20 ` Alexandre DERUMIER
2012-11-22 14:22 ` Stefan Priebe - Profihost AG
2012-11-22 14:37 ` Mark Nelson
2012-11-22 14:42 ` Stefan Priebe - Profihost AG
2012-11-22 14:46 ` Mark Nelson
2012-11-22 15:01 ` Stefan Priebe - Profihost AG
2012-11-22 15:26 ` Alexandre DERUMIER
2012-11-22 15:28 ` Stefan Priebe - Profihost AG
2012-11-22 15:35 ` Alexandre DERUMIER
2012-11-22 15:49 ` Sébastien Han
2012-11-22 15:54 ` Stefan Priebe - Profihost AG
2012-11-22 15:55 ` Sébastien Han
2012-11-22 15:57 ` Stefan Priebe - Profihost AG
2012-11-22 15:59 ` Stefan Priebe - Profihost AG
2012-11-22 14:52 ` Alexandre DERUMIER
2012-11-22 15:00 ` Stefan Priebe - Profihost AG
2012-11-23 10:31 ` Stefan Priebe - Profihost AG
2012-11-23 10:47 ` Alexandre DERUMIER
2012-11-23 10:49 ` Stefan Priebe - Profihost AG
2012-11-23 11:03 ` Alexandre DERUMIER
2012-11-23 13:12 ` Stefan Priebe - Profihost AG
2012-11-23 13:18 ` Mark Nelson
2012-11-23 13:24 ` Stefan Priebe - Profihost AG
2012-11-23 13:32 ` Alexandre DERUMIER
2012-11-23 13:33 ` Stefan Priebe - Profihost AG
2012-11-23 13:43 ` Stefan Priebe - Profihost AG
2012-11-22 14:34 ` Mark Nelson
[not found] ` <50AA763A.1050709@inktank.com>
2012-11-19 21:01 ` Sébastien Han
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50A6C54E.9010805@inktank.com \
--to=mark.kampe@inktank.com \
--cc=ceph-devel@vger.kernel.org \
--cc=han.sebastien@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.