From: Mark Nelson <mark.nelson@inktank.com>
To: Eric_YH_Chen@wiwynn.com
Cc: ceph-devel@vger.kernel.org
Subject: Re: Performance benchmark of rbd
Date: Wed, 13 Jun 2012 07:29:37 -0500 [thread overview]
Message-ID: <4FD887B1.2070202@inktank.com> (raw)
In-Reply-To: <8512670932FB654F81AF0FEF1BE6D49D01838FC3@WHQBEMAIL1.whq.wistron>
Hi Eric!
On 6/13/12 5:06 AM, Eric_YH_Chen@wiwynn.com wrote:
> Hi, all:
>
> I am doing some benchmark of rbd.
> The platform is on a NAS storage.
>
> CPU: Intel E5640 2.67GHz
> Memory: 192 GB
> Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm
> (H1~ H12)
> RAID Card: LSI 9260-4i
> OS: Ubuntu12.04 with Kernel 3.2.0-24
> Network: 1 Gb/s
>
> We create 12 OSD on H1 ~ H12 with the journal is put on H0.
Just to make sure I understand, you have a single node with 12 OSDs and
3 mons, and all 12 OSDs are using the H0 disk for their journals? What
filesystem are you using for the OSDs? How much replication?
> We also create 3 MON in the cluster.
> In briefly, we setup a ceph cluster all-in-one, with 3 monitors and
> 12 OSD.
>
> The benchmark tool we used is fio 2.0.3. We had 7 basic test case
> 1) sequence write with bs=64k
> 2) sequence read with bs=64k
> 3) random write with bs=4k
> 4) random write with bs=16k
> 5) mix read/write with bs=4k
> 6) mix read/write with bs=8k
> 7) mix read/write with bs=16k
>
> We create several rbd with different object size for the benchmark.
>
> 1. size = 20G, object size = 32KB
> 2. size = 20G, object size = 512KB
> 3. size = 20G, object size = 4MB
> 4. size = 20G, object size = 32MB
Given how much memory you have, you may want to increase the amount of
data you are writing during each test to rule out caching.
>
> We have some conclusion after the benchmark.
>
> a. We can get better performance of sequence read/write when the
> object size is bigger.
> Seq-read Seq-write
> 32 KB 23 MB/s 690 MB/s
> 512 KB 26 MB/s 960 MB/s
> 4 MB 27 MB/s 1290 MB/s
> 32 MB 36 MB/s 1435 MB/s
Which test are these results from? I'm suspicious that the write
numbers are so high. Figure that even with a local client and 1X
replication, your journals and data partitions are each writing out a
copy of the data. You don't have enough disk in that box to sustain
1.4GB/s to both even under perfectly ideal conditions. Given that it
sounds like you are using a single 7200rpm disk for 12 journals, I would
expect far lower numbers...
>
> b. There is no obvious influence for random read/write when the
> object size is different.
> All the result are in a range not more than 10%.
>
> rand-write-4K rand-write-16K mix-4K
> mix-8k mix-16k
> 881 iops 564 iops
> 1462 iops 1127 iops 1044 iops
>
> c. It we change the environment, for every 3 hard drive, we bind
> them together by RAID0. (LSI 9260-4i RAID card)
> So the ceph cluster becomes 3 MONs and 4 OSD (3T for each)
> We can get better performance on all items, around 10% ~ 20%
> enhancement.
Those IOPs numbers are more what I would expect. Using HW raid0 may
provide some benefit depending on the number of OSDs per node. It's
something we haven't had time to look at yet in detail, but is on our list.
>
> d. If we change H0 to a SSD device, and we also put all journal
> on it. We can get better performance on sequence-write.
> It would reach 135MB/s. However, there are no different for other
> test items.
>
> We want to check with you, if all the conclusion are reasonable for
> you? Or any seems strange? Thanks!
When you say that using an SSD device increases the sequence-write
speeds to 135MB/s, what are you comparing that to? Incidentally that
level of performance is entirely believable with 12 OSDs sharing a
single SSD for journals.
The write results with the 7200rpm journal disk do look strange to me,
but it's tough to say what's going on. If the numbers are accurate, I'd
say writes aren't getting to the disks. If they are mislabeled (IE
1.435MB/s or 1435Mb/s instead 1435MB/s), then things are more believable
and I'd try putting your journals on a small partition on each disk
(causes some extra seek behavior and lower OSD throughput, but better
than stacking the journals up on a single slow disk).
>
> ====
>
> Here is some data if I use command provided by rados.
> rados -p rbd bench 120 write -t 8
>
> Total time run: 120.751713
> Total writes made: 930
> Write size: 4194304
> Bandwidth (MB/sec): 30.807
>
> Average Latency: 1.03807
> Max latency: 2.63197
> Min latency: 0.205726
>
> [INF] bench: wrote 1024 MB in blocks of 4096 KB in 13.219819 sec
> at 79318 KB/sec
That looks much closer to what I would expect if you have 12 journals
all sharing a single 7200rpm drive.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark
next prev parent reply other threads:[~2012-06-13 12:29 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-13 10:06 Performance benchmark of rbd Eric_YH_Chen
2012-06-13 12:29 ` Mark Nelson [this message]
2012-06-14 1:26 ` Eric_YH_Chen
2012-06-23 5:53 ` Alexandre DERUMIER
2012-06-19 1:12 ` Eric_YH_Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FD887B1.2070202@inktank.com \
--to=mark.nelson@inktank.com \
--cc=Eric_YH_Chen@wiwynn.com \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.