Re: RBD fio Performance concerns

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: "Sébastien Han" <han.sebastien@gmail.com>
Cc: Alexandre DERUMIER <aderumier@odiso.com>,
	ceph-devel <ceph-devel@vger.kernel.org>,
	Mark Kampe <mark.kampe@inktank.com>
Subject: Re: RBD fio Performance concerns
Date: Wed, 21 Nov 2012 09:52:40 -0600	[thread overview]
Message-ID: <50ACF8C8.7020908@inktank.com> (raw)
In-Reply-To: <CAOLwVUn5EAC4QaWgU+TkodcJea7ayKUANUpfzWUG+cTBunao5g@mail.gmail.com>

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are 
getting higher performance with random reads/writes vs sequential!  It 
would be interesting to see what kind of throughput smalliobench reports 
(should be packaged in bobtail) and also see if this behavior happens 
with cephfs.  It's still too early in the morning for me right now to 
come up with a reasonable explanation for what's going on.  It might be 
worth running blktrace and seekwatcher to see what the io patterns on 
the underlying disk look like in each case.  Maybe something unexpected 
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:
> Which iodepth did you use for those benchs?
>
>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>
> Me neither, hope to get some clarification from the Inktank guys. It
> doesn't make any sense to me...
> --
> Bien cordialement.
> Sébastien HAN.
>
>
> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> rand read 4K : 6000 iops
>> seq read 4K : 3500 iops
>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>
>> rand write 4k: 6000iops  (tmpfs journal)
>> seq write 4k: 1600iops
>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>
>>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>>
>> I try with high end cpu for client, it doesn't change nothing.
>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on cluster during read bench)
>>
>>
>> ----- Mail original -----
>>
>> De: "Sébastien Han" <han.sebastien@gmail.com>
>> À: "Mark Kampe" <mark.kampe@inktank.com>
>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>> Objet: Re: RBD fio Performance concerns
>>
>> @Sage, thanks for the info :)
>> @Mark:
>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>
>> The original benchmark has been performed with 4M block size. And as
>> you can see I still get more IOPS with rand than seq... I just tried
>> with 4M without direct I/O, still the same. I can print fio results if
>> it's needed.
>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>
>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>> from a real life scenario and how common applications works. I just
>> try to see the maximum I/O throughput that I can get from my RBD. All
>> my applications use buffered I/O.
>>
>> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> Thanks to all of you..
>>
>>
>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>>> Recall:
>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>> 2. distinct writes to a single RADOS object are serialized
>>>
>>> Your sequential 4K writes are direct, depth=256, so there are
>>> (at all times) 256 writes queued to the same object. All of
>>> your writes are waiting through a very long line, which is adding
>>> horrendous latency.
>>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>>
>>>
>>>> That's correct for some of the benchmarks. However even with 4K for
>>>> seq, I still get less IOPS. See below my last fio:
>>>>
>>>> # fio rbd-bench.fio
>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> fio 1.59
>>>> Starting 4 processes
>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>> 02m:59s]
>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>> stdev=6239.06
>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>
>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>> stdev=648.62
>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>
>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>> stdev=353.97
>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>> =64=99.4%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>
>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>> stdev=2000.45
>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=99.9%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>
>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>
>>>> Run status group 0 (all jobs):
>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>> mint=60053msec, maxt=60053msec
>>>>
>>>> Run status group 1 (all jobs):
>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>
>>>> Run status group 2 (all jobs):
>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>> mint=60725msec, maxt=60725msec
>>>>
>>>> Run status group 3 (all jobs):
>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>> mint=60822msec, maxt=60822msec
>>>>
>>>> Disk stats (read/write):
>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>> in_queue=33434120, util=99.79%
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-11-21 15:52 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <50A537EA.5090409@inktank.com>
     [not found] ` <CAOLwVUmQa4C_vs_Mbi3b2LeO=wx8_EMVWX5Pyu0y-JnG8nyz+Q@mail.gmail.com>
2012-11-16 22:59   ` RBD fio Performance concerns Mark Kampe
2012-11-19 14:56     ` Sébastien Han
2012-11-19 15:28       ` Alexandre DERUMIER
2012-11-19 15:42         ` Sébastien Han
2012-11-19 16:44           ` Sage Weil
2012-11-19 16:54           ` Mark Kampe
2012-11-19 18:03             ` Sébastien Han
2012-11-19 19:11               ` Alexandre DERUMIER
2012-11-19 20:57                 ` Sébastien Han
2012-11-20  7:32                   ` Alexandre DERUMIER
2012-11-20 10:37                     ` Sébastien Han
2012-11-21 15:52                   ` Mark Nelson [this message]
2012-11-21 16:34                     ` Mark Nelson
2012-11-21 21:47                       ` Sébastien Han
2012-11-21 22:05                         ` Mark Kampe
2012-11-22  5:46                         ` Alexandre DERUMIER
2012-11-23 13:36                         ` Chen, Xiaoxi
2012-11-24 16:59                           ` Gregory Farnum
2012-11-22 10:19                       ` Stefan Priebe - Profihost AG
     [not found]                         ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
2012-11-22 11:48                           ` Stefan Priebe - Profihost AG
2012-11-22 12:50                             ` Sébastien Han
2012-11-22 13:14                               ` Stefan Priebe - Profihost AG
     [not found]                                 ` <CAOLwVUkwVSv-Ven2CTjnTN2J573TBTD2SLDY7df0h7ncJZQgpQ@mail.gmail.com>
2012-11-22 13:29                                   ` Stefan Priebe - Profihost AG
2012-11-22 14:20                                     ` Alexandre DERUMIER
2012-11-22 14:22                                       ` Stefan Priebe - Profihost AG
2012-11-22 14:37                                         ` Mark Nelson
2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
2012-11-22 14:46                                             ` Mark Nelson
2012-11-22 15:01                                               ` Stefan Priebe - Profihost AG
2012-11-22 15:26                                                 ` Alexandre DERUMIER
2012-11-22 15:28                                                   ` Stefan Priebe - Profihost AG
2012-11-22 15:35                                                     ` Alexandre DERUMIER
2012-11-22 15:49                                                       ` Sébastien Han
2012-11-22 15:54                                                         ` Stefan Priebe - Profihost AG
2012-11-22 15:55                                                           ` Sébastien Han
2012-11-22 15:57                                                             ` Stefan Priebe - Profihost AG
2012-11-22 15:59                                                       ` Stefan Priebe - Profihost AG
2012-11-22 14:52                                             ` Alexandre DERUMIER
2012-11-22 15:00                                               ` Stefan Priebe - Profihost AG
2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
2012-11-23 10:47                                 ` Alexandre DERUMIER
2012-11-23 10:49                                   ` Stefan Priebe - Profihost AG
2012-11-23 11:03                                     ` Alexandre DERUMIER
2012-11-23 13:12                                       ` Stefan Priebe - Profihost AG
2012-11-23 13:18                                       ` Mark Nelson
2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
2012-11-23 13:32                                           ` Alexandre DERUMIER
2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
2012-11-23 13:43                                             ` Stefan Priebe - Profihost AG
2012-11-22 14:34                           ` Mark Nelson
     [not found]               ` <50AA763A.1050709@inktank.com>
2012-11-19 21:01                 ` Sébastien Han

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50ACF8C8.7020908@inktank.com \
    --to=mark.nelson@inktank.com \
    --cc=aderumier@odiso.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=han.sebastien@gmail.com \
    --cc=mark.kampe@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.